Find variable by description

Hi, I downloaded a parquet document and it is a huge list of the PISA report with the description of each variable under their name. The names are in code so they doesnt make any sense. I would like to find variables by the description that is under its names. e.g.: find all the variables how have television in there description

It is possible?

They are almost certainly there as some sort of attributes on either the column or data.frame as a whole; so I can almost guarantee its possible.
If you provide a (small) reprex to show an example we can try to do it.

Thank you
I not sure how to subset the dataset to make it reproducible.
I tried to save as text (write.table) but when I import the data the descriptions are lost.
I downloaded the parquet from here.

https://peterejkemp.github.io/#sec-loading-computer
Search in the web: Download the PISA_2018_student_subset.parquet

load your paraquet data so that it become an R data.frame (I assume that you have done this ?)

Then for the data.frame that you made;
try selecting the first 3 columns and first 6 records, and use dput() to turn the object into a text defintion

my_df_from_para <- load_para(...) # whatever you do 
#export to text that you can share on the forum.
my_df_from_para[1:6,1:3]|> dput()

Sorry, I am a bit lost.

I type my code until now:

library(arrow)
library(tidyverse)

PISA_2018 <- read_parquet("C:/Users/JMM/2023/sub/PISA_2018_student_subset.parquet")

typeof(PISA_2018)
#it is list
PISA_2018<- as.data.frame(PISA_2018)
typeof(PISA_2018)
#it is still a list  

PISA_2018_SUBSET <- PISA_2018[c(1:11),c(1:7)]

write.table(PISA_2018_SUBSET, file = "PISA_SUBSET.txt", sep = "\t",
            row.names = TRUE, col.names = NA)

When I try to use load_para() It doesnt work. Do I need to install a package to use it?

my_df_from_para <- load_para(""	"CNT"	"OECD"	"PV1MATH"	"PV1READ"	"PV1SCIE"	"OCUM"	"OCUF"
                             "1"	"Albania"	"No"	490.187	375.984	445.039	"Domestic housekeepers"	"Civil engineers"
                             "2"	"Albania"	"No"	462.464	434.352	421.731	"Police officers"	"Professional services managers not elsewhere classified"
                             "3"	"Albania"	"No"	406.949	359.191	392.223	"Domestic housekeepers"	"Building construction labourers"
                             "4"	"Albania"	"No"	482.501	425.131	515.942	"Housewife"	"Housewife"
                             "5"	"Albania"	"No"	459.804	306.028	328.261	"Manufacturing labourers not elsewhere classified"	"Building construction labourers"
                             "6"	"Albania"	"No"	367.165	352.271	284.263	"Other cleaning workers"	"Car, taxi and van drivers"
                             "7"	"Albania"	"No"	411.192	412.724	486.595	"Missing"	"Missing"
                             "8"	"Albania"	"No"	441.037	271.213	391.562	"Missing"	"Housewife"
                             "9"	"Albania"	"No"	506.093	373.022	389.218	"Invalid"	"Building construction labourers"
                             "10"	"Albania"	"No"	412.011	412.048	389.048	"Missing"	"Missing"
                             "11"	"Albania"	"No"	504.49	426.085	491.192	"Domestic housekeepers"	"Crop farm labourers"
                             )
# Error: unexpected string constant in "my_df_from_para <- load_para("" "CNT""

Thank you for your patience

i didnt know you use read_parquet i wrote load_para() as an example of this type of thing.
I wanted to draw your attention to after your paraquet data was loaded as a data.frame.
and to apply subsetting on that, and dput it for sharing.

Thank you I get this from

 PISA_2018[1:6,1:3]|> dput()
PISA_2018 <- structure(list(CNT = structure(c(1L, 1L, 1L, 1L, 1L, 1L), levels = c("Albania", 
"United Arab Emirates", "Argentina", "Australia", "Austria", 
"Belgium", "Bulgaria", "Bosnia and Herzegovina", "Belarus", "Brazil", 
"Brunei Darussalam", "Canada", "Switzerland", "Chile", "Colombia", 
"Costa Rica", "Czech Republic", "Germany", "Denmark", "Dominican Republic", 
"Spain", "Estonia", "Finland", "France", "United Kingdom", "Georgia", 
"Greece", "Hong Kong", "Croatia", "Hungary", "Indonesia", "Ireland", 
"Iceland", "Israel", "Italy", "Jordan", "Japan", "Kazakhstan", 
"Korea", "Kosovo", "Lebanon", "Lithuania", "Luxembourg", "Latvia", 
"Macao", "Morocco", "Moldova", "Mexico", "North Macedonia", "Malta", 
"Montenegro", "Malaysia", "Netherlands", "Norway", "New Zealand", 
"Panama", "Peru", "Philippines", "Poland", "Portugal", "Qatar", 
"Baku (Azerbaijan)", "B-S-J-Z (China)", "Cyprus", "Moscow City (RUS)", 
"Moscow Region (RUS)", "Tatarstan (RUS)", "Romania", "Russian Federation", 
"Saudi Arabia", "Singapore", "Serbia", "Slovak Republic", "Slovenia", 
"Sweden", "Chinese Taipei", "Thailand", "Turkey", "Ukraine", 
"Uruguay", "United States", "Vietnam"), label = "Country code 3-character", class = "factor"), 
    OECD = structure(c(1L, 1L, 1L, 1L, 1L, 1L), levels = c("No", 
    "Yes"), label = "OECD country", class = "factor"), ISCEDL = structure(c(3L, 
    3L, 3L, 3L, 3L, 3L), levels = c("ISCED level 1", "ISCED level 2", 
    "ISCED level 3", "ISCED level 4", "ISCED level 5", "Valid Skip", 
    "Not Applicable", "Invalid", "No Response"), label = "ISCED level", class = "factor")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
# make a dictionary that you can both look at , and query
(mydict <- enframe(map_chr(PISA_2018,\(x)attr(x,"label"))))

# variable names whose `labels` contain "country"
(to_get <- filter(mydict,
       str_detect(tolower(value),
                  pattern = fixed("country"))) |> pull(name))

select(PISA_2018,all_of(to_get))
1 Like

Awesome!
Many thanks dude :star_struck:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.