Question using dplyr package and apply family function on two datasets

Stay1 · December 9, 2021, 10:50am

**Question using dplyr package and apply family function on two datasets **

Your question goes below this line -->

Good morning,

I am starting to use Rstudio and have an evaluation to do at home, but I'm stuck on a question.
(I hope I'm in the right place to ask my question)

In brief, I have two objects called "mutations" and "clinicals".

In my "mutations" dataset, each row represents a mutation associated with the patient's name and the name of the mutated gene in columns.
Patients can have several mutations on the same gene.

In my "clinicals" dataset, one row is unique per patient.

The question asked to add a column to the "clinicals" object, which will contain "YES" if the individual has at least one mutation in "TP53" and "NO" otherwise, and I have to use the apply family function(s).

I would like to apply to each row a specific function which attributes "YES" if the mutation is "TP53" and "NO" otherwise, however how to manage with patients which are not part of the "clinicals" dataset, or which have several TP53 mutations?

If you can give me any advice, thank you !

Homework Question Checklist- I am not posting verbatim elements of my homework assignment.- Where reasonable, I am asking with a [reproducible example]

Equation · December 9, 2021, 10:55am

Can you provide a reprex (minimum working example) of how your datasets are structured?

Stay1 · December 9, 2021, 11:22am

Of course, but as a new member, I can't upload attachments, how can I share a reprex?

nirgrahamuk · December 9, 2021, 11:32am

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Stay1 · December 9, 2021, 11:45am

Here is the replex for the "clinicals dataset
data.frame(
stringsAsFactors = FALSE,
Hugo_Symbol = c("UBR4","PLA2G2D","MTF1",
"CLCA2","WNT2B","ENG","HSPA12A","OR51Q1","OR5D13",
"ZFP91","PYROXD1"),
Chromosome = c("chr1","chr1","chr1",
"chr1","chr1","chr9","chr10","chr11","chr11","chr11",
"chr12"),
Start_Position = c(19155452L,20115562L,
37822366L,86443922L,112525957L,127818738L,116701033L,
5422209L,55773911L,58614261L,21468533L),
End_Position = c(19155452L,20115562L,
37822366L,86443922L,112525957L,127818738L,116701033L,
5422209L,55773911L,58614261L,21468533L),
Variant_Type = c("SNP","SNP","SNP","SNP",
"SNP","SNP","SNP","SNP","SNP","SNP","SNP"),
Reference_Allele = c("C", "G", "A", "G", "T", "G", "G", "G", "C", "G", "C"),
Tumor_Seq_Allele1 = c("C", "G", "A", "G", "T", "G", "G", "G", "C", "G", "C"),
Tumor_Seq_Allele2 = c("T", "A", "G", "A", "C", "A", "A", "C", "T", "C", "G"),
t_depth = c(96L,147L,49L,134L,31L,
141L,96L,140L,262L,96L,74L),
t_ref_count = c(91L,135L,46L,127L,19L,
134L,87L,123L,242L,80L,63L),
t_alt_count = c(5L, 12L, 3L, 7L, 12L, 7L, 9L, 17L, 20L, 16L, 11L),
Consequence = c("missense_variant",
"synonymous_variant","missense_variant","missense_variant",
"downstream_gene_variant","missense_variant",
"synonymous_variant","missense_variant","synonymous_variant",
"missense_variant","missense_variant"),
PolyPhen = c("benign(0.25)",NA,
"benign(0)","probably_damaging(0.938)",NA,"benign(0)",NA,
"benign(0.002)",NA,"probably_damaging(0.996)",
"possibly_damaging(0.751)"),
IMPACT = c("MODERATE","LOW","MODERATE",
"MODERATE","MODIFIER","MODERATE","LOW","MODERATE",
"LOW","MODERATE","MODERATE"),
Sample = c("TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490","TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490")
)

and here for the "mutation" dataset

data.frame(
stringsAsFactors = FALSE,
bcr_patient_barcode = c("TCGA-86-8669","TCGA-44-3396",
"TCGA-35-4123","TCGA-75-5147",
"TCGA-78-7158","TCGA-44-6779",
"TCGA-50-6594"),
additional_studies = c(NA, NA, NA, NA, NA, NA, NA),
tissue_source_site = c(86L,44L,35L,
75L,78L,44L,50L),
patient_id = c(8669L,3396L,
4123L,5147L,7158L,6779L,
6594L),
bcr_patient_uuid = c("7b89166e-c5b8-481b-aa70-495141499b91",
"3bd6badb-27ff-4d8d-b206-4d28dc264862",
"6cf49cf0-de4c-4c90-8358-eae19c6206b0",
"d2824e6d-3784-45c2-9b0f-52b17356b5da",
"501c987e-d1eb-48a9-89eb-72a5062c90b4",
"cbbea9f1-396a-4bf3-b67c-2cac3394dceb",
"8504fd86-a70a-4cba-9ec8-25c9e60ca549"),
informed_consent_verified = c("YES","YES",
"YES","YES","YES","YES",
"YES"),
icd_o_3_site = c("C34.1",
"C34.1","C34.1","C34.1","C34.3",
"C34.9","C34.1"),
icd_o_3_histology = c(81403L,
81403L,81403L,82523L,82553L,
81403L,81403L),
icd_10 = c("C34.1",
"C34.1","C34.1","C34.1","C34.3",
"C34.1","C34.1"),
day_of_form_completion = c(30L,18L,20L,
6L,15L,31L,25L),
month_of_form_completion = c(8L, 10L, 12L, 4L, 9L, 8L, 8L),
year_of_form_completion = c(2012L,2010L,
2010L,2011L,2011L,2011L,
2011L),
tissue_prospective_collection_indicator = c("YES","YES",
"NO","NO","NO","NO","NO"),
tissue_retrospective_collection_indicator = c("NO","NO",
"YES","YES","YES","YES","YES"),
days_to_birth = c(-23443L,
-27073L,-14064L,NA,-21742L,
-18469L,-28924L),
gender = c("MALE",
"FEMALE","MALE","FEMALE","FEMALE",
"FEMALE","FEMALE"),
race_list = c("WHITE",
"WHITE","WHITE",NA,"WHITE",
"WHITE","BLACK OR AFRICAN AMERICAN"),
ethnicity = c("NOT HISPANIC OR LATINO",
"NOT HISPANIC OR LATINO","NOT HISPANIC OR LATINO",
NA,NA,NA,
"NOT HISPANIC OR LATINO"),
other_dx = c("No","No",
"No","No","No","No","No"),
history_of_neoadjuvant_treatment = c("No","No",
"No","No","No","No","No"),
vital_status = c("Alive",
"Alive","Alive","Alive","Dead",
"Dead","Dead"),
days_to_last_followup = c(34L, 311L, 182L, NA, NA, NA, NA),
days_to_death = c(NA,NA,NA,
NA,179L,500L,370L),
person_neoplasm_cancer_status = c("TUMOR FREE",
"TUMOR FREE","TUMOR FREE",
"TUMOR FREE","WITH TUMOR",
"WITH TUMOR","WITH TUMOR"),
has_new_tumor_events_information = c("NO","NO",
"NO","NO","NO","NO","NO"),
has_follow_ups_information = c("YES","YES",
"YES","YES","YES","YES",
"YES"),
has_drugs_information = c("YES","YES",
"NO","NO","YES","YES","NO"),
has_radiations_information = c("NO","NO",
"NO","NO","NO","NO","YES"),
stage_event_system_version = c("7th","7th",
"7th","6th","6th","6th",
"6th"),
stage_event_clinical_stage = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_pathologic_stage = c("Stage IA",
"Stage IIIA","Stage IA",
"Stage IB","Stage IIIB","Stage IIB",
"Stage IIIA"),
stage_event_tnm_categories = c("T1bN0M0",
"T2N2M0","T1N0M0","T2N0M0",
"T4N2M0","T2N1MX","T3N2M0"),
stage_event_psa = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_gleason_grading = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_ann_arbor = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_serum_markers = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_igcccg_stage = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_masaoka_stage = c(NA, NA, NA, NA, NA, NA, NA),
cohort = c("LUAD",
"LUAD","LUAD","LUAD","LUAD",
"LUAD","LUAD")
)

Stay1 · December 9, 2021, 11:49am

is the gene name

and the patient name in mutations dataset

and in clinicals

system · December 30, 2021, 11:50am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.