Extracting rows based on list of ID's from a .csv

kylec1729 · May 26, 2020, 4:52am

I'm trying to extract rows of a dataset data using a column of values in a .csv file. I tried to do the following:

df <- data$data['rs146217251',1:486757,]
id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...)
df[id,]

where I left the rest of the values in id out because there are 1800 values I'm trying to include in id. There's a character limit in how many values I can use doing it with id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...), so I decided to try to import it from a .csv file instead.

How do I import the column from the .csv file into the above code so that I get a variable just like id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...) but with all of the values I want included?

Or is there a better way of doing this?

FJCC · May 26, 2020, 5:41am

I would use the semi_join function from the dplyr package.

DF <- data.frame(ID = rep(LETTERS[1:4], each = 3), Value = 1:12, stringsAsFactors = FALSE)
DF
#>    ID Value
#> 1   A     1
#> 2   A     2
#> 3   A     3
#> 4   B     4
#> 5   B     5
#> 6   B     6
#> 7   C     7
#> 8   C     8
#> 9   C     9
#> 10  D    10
#> 11  D    11
#> 12  D    12
FilterDF <- data.frame(ID = c("A", "C"), stringsAsFactors = FALSE)
library(dplyr)

#Keep only ID == A and ID == C
DF <- semi_join(DF, FilterDF, by = "ID")
DF
#>   ID Value
#> 1  A     1
#> 2  A     2
#> 3  A     3
#> 4  C     7
#> 5  C     8
#> 6  C     9

^{Created on 2020-05-25 by the reprex package (v0.3.0)}

kylec1729 · May 26, 2020, 5:48am

I'm not sure if this helps. My list is too large to define ID using ID = c("A", "C"), unless I imported the data from a .csv. This was the same problem I was trying to get around in the first place if I'm not mistaken.

FJCC · May 26, 2020, 12:18pm

I am sorry I was unclear. Instead of writing out the FIlterDF using the data.frame function you can read it in using read.csv() or another similar function. The resulting object will be a data frame that you can then use in the same way that I used FilterDF. The code would look like this, assuming that DF, the data frame you want to filter, already exists.

FilterDF <- read.csv("MyFile.csv", stringsAsFactors= FALSE)
DF <- semi_join(DF, FilterDF, by = "ID")

That assumes that MyFile.csv has a column named ID containing the IDs that you want to keep and that the name of the corresponding column in DF is also ID.

If you explain the structure of the data frame in which you want to keep only some values, I could be more specific about the code.

nirgrahamuk · May 26, 2020, 12:47pm

#made up patient info
set.seed(42) # for reprocucibilty of the random numbers
(madeupinfo <- data.frame(
  patient_id = sample.int(n=10^6,
                          size=20,
                          replace=FALSE),
  some_measures = runif(20)
))

#write them to csv
write.csv(madeupinfo,"madeupinfo.csv")

#at this point i recommend opening up madeupinfo.csv 
#first in notepad and then excel, and observe the appearance of the same data in both

got_again <- read.csv("madeupinfo.csv")

# lets make a new file that stores just the ids of 3 random patients

(pick3 <- sample(got_again$patient_id,3,replace=FALSE))

#enframe this vector as a dataframe and then write to a csv
library(tidyverse)
(enframed_3 <- enframe(pick3,
                       name = NULL,
                       value="patient_id"))
#write them to csv
write.csv(enframed_3,"3ids.csv")
#again recommend peeking at this (notepad/excel)

(get_ids <-  read.csv("3ids.csv"))

(ids_vec <- pull(get_ids,patient_id) )

(filtered_result <- filter(got_again,patient_id %in% ids_vec))

kylec1729 · May 27, 2020, 12:50am

I get the following error:

> DF <- semi_join(df, filterdf, by = "eid")
Error in UseMethod("semi_join") : 
  no applicable method for 'semi_join' applied to an object of class "c('matrix', 'double', 'numeric')"

where df <- data$data['rs146217251',1:486757,], filterdf <- read.csv("patientids.csv", stringsAsFactors= FALSE), and eid is the name of the column with all of the patient ID's. How do I fix this new error?

FJCC · May 27, 2020, 1:55am

The error message appears because df is not a data frame, it is a numeric matrix. That is what is meant by an object of class "c('matrix', 'double', 'numeric')". To confirm that, you can run

class(df)

Are the patient IDs numbers? They seemed to be characters in your first post.

While we are checking things, please post the results of these commands.

dim(df)

summary(filterdf)

system · June 17, 2020, 1:55am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.