Extracting rows based on list of ID's from a .csv

I'm trying to extract rows of a dataset data using a column of values in a .csv file. I tried to do the following:

df <- data$data['rs146217251',1:486757,]
id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...)
df[id,]

where I left the rest of the values in id out because there are 1800 values I'm trying to include in id. There's a character limit in how many values I can use doing it with id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...), so I decided to try to import it from a .csv file instead.

How do I import the column from the .csv file into the above code so that I get a variable just like id <- c("1006573_1006573", "1008603_1008603", "1012222_1012222",...) but with all of the values I want included?

Or is there a better way of doing this?

I would use the semi_join function from the dplyr package.

DF <- data.frame(ID = rep(LETTERS[1:4], each = 3), Value = 1:12, stringsAsFactors = FALSE)
DF
#>    ID Value
#> 1   A     1
#> 2   A     2
#> 3   A     3
#> 4   B     4
#> 5   B     5
#> 6   B     6
#> 7   C     7
#> 8   C     8
#> 9   C     9
#> 10  D    10
#> 11  D    11
#> 12  D    12
FilterDF <- data.frame(ID = c("A", "C"), stringsAsFactors = FALSE)
library(dplyr)

#Keep only ID == A and ID == C
DF <- semi_join(DF, FilterDF, by = "ID")
DF
#>   ID Value
#> 1  A     1
#> 2  A     2
#> 3  A     3
#> 4  C     7
#> 5  C     8
#> 6  C     9

Created on 2020-05-25 by the reprex package (v0.3.0)

I'm not sure if this helps. My list is too large to define ID using ID = c("A", "C"), unless I imported the data from a .csv. This was the same problem I was trying to get around in the first place if I'm not mistaken.

I am sorry I was unclear. Instead of writing out the FIlterDF using the data.frame function you can read it in using read.csv() or another similar function. The resulting object will be a data frame that you can then use in the same way that I used FilterDF. The code would look like this, assuming that DF, the data frame you want to filter, already exists.

FilterDF <- read.csv("MyFile.csv", stringsAsFactors= FALSE)
DF <- semi_join(DF, FilterDF, by = "ID")

That assumes that MyFile.csv has a column named ID containing the IDs that you want to keep and that the name of the corresponding column in DF is also ID.

If you explain the structure of the data frame in which you want to keep only some values, I could be more specific about the code.

#made up patient info
set.seed(42) # for reprocucibilty of the random numbers
(madeupinfo <- data.frame(
  patient_id = sample.int(n=10^6,
                          size=20,
                          replace=FALSE),
  some_measures = runif(20)
))

#write them to csv
write.csv(madeupinfo,"madeupinfo.csv")

#at this point i recommend opening up madeupinfo.csv 
#first in notepad and then excel, and observe the appearance of the same data in both

got_again <- read.csv("madeupinfo.csv")

# lets make a new file that stores just the ids of 3 random patients

(pick3 <- sample(got_again$patient_id,3,replace=FALSE))

#enframe this vector as a dataframe and then write to a csv
library(tidyverse)
(enframed_3 <- enframe(pick3,
                       name = NULL,
                       value="patient_id"))
#write them to csv
write.csv(enframed_3,"3ids.csv")
#again recommend peeking at this (notepad/excel)

(get_ids <-  read.csv("3ids.csv"))

(ids_vec <- pull(get_ids,patient_id) )

(filtered_result <- filter(got_again,patient_id %in% ids_vec))

I get the following error:

> DF <- semi_join(df, filterdf, by = "eid")
Error in UseMethod("semi_join") : 
  no applicable method for 'semi_join' applied to an object of class "c('matrix', 'double', 'numeric')"

where df <- data$data['rs146217251',1:486757,], filterdf <- read.csv("patientids.csv", stringsAsFactors= FALSE), and eid is the name of the column with all of the patient ID's. How do I fix this new error?

The error message appears because df is not a data frame, it is a numeric matrix. That is what is meant by an object of class "c('matrix', 'double', 'numeric')". To confirm that, you can run

class(df)

Are the patient IDs numbers? They seemed to be characters in your first post.

While we are checking things, please post the results of these commands.

dim(df)
summary(filterdf)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.