Question using dplyr package and apply family function on two datasets

**Question using dplyr package and apply family function on two datasets **

Your question goes below this line -->

Good morning,

I am starting to use Rstudio and have an evaluation to do at home, but I'm stuck on a question.
(I hope I'm in the right place to ask my question)

In brief, I have two objects called "mutations" and "clinicals".

In my "mutations" dataset, each row represents a mutation associated with the patient's name and the name of the mutated gene in columns.
Patients can have several mutations on the same gene.

In my "clinicals" dataset, one row is unique per patient.

The question asked to add a column to the "clinicals" object, which will contain "YES" if the individual has at least one mutation in "TP53" and "NO" otherwise, and I have to use the apply family function(s).

I would like to apply to each row a specific function which attributes "YES" if the mutation is "TP53" and "NO" otherwise, however how to manage with patients which are not part of the "clinicals" dataset, or which have several TP53 mutations?

If you can give me any advice, thank you !

Homework Question Checklist- I am not posting verbatim elements of my homework assignment.- Where reasonable, I am asking with a [reproducible example]

Can you provide a reprex (minimum working example) of how your datasets are structured?

Of course, but as a new member, I can't upload attachments, how can I share a reprex?

Here is the replex for the "clinicals dataset
data.frame(
stringsAsFactors = FALSE,
Hugo_Symbol = c("UBR4","PLA2G2D","MTF1",
"CLCA2","WNT2B","ENG","HSPA12A","OR51Q1","OR5D13",
"ZFP91","PYROXD1"),
Chromosome = c("chr1","chr1","chr1",
"chr1","chr1","chr9","chr10","chr11","chr11","chr11",
"chr12"),
Start_Position = c(19155452L,20115562L,
37822366L,86443922L,112525957L,127818738L,116701033L,
5422209L,55773911L,58614261L,21468533L),
End_Position = c(19155452L,20115562L,
37822366L,86443922L,112525957L,127818738L,116701033L,
5422209L,55773911L,58614261L,21468533L),
Variant_Type = c("SNP","SNP","SNP","SNP",
"SNP","SNP","SNP","SNP","SNP","SNP","SNP"),
Reference_Allele = c("C", "G", "A", "G", "T", "G", "G", "G", "C", "G", "C"),
Tumor_Seq_Allele1 = c("C", "G", "A", "G", "T", "G", "G", "G", "C", "G", "C"),
Tumor_Seq_Allele2 = c("T", "A", "G", "A", "C", "A", "A", "C", "T", "C", "G"),
t_depth = c(96L,147L,49L,134L,31L,
141L,96L,140L,262L,96L,74L),
t_ref_count = c(91L,135L,46L,127L,19L,
134L,87L,123L,242L,80L,63L),
t_alt_count = c(5L, 12L, 3L, 7L, 12L, 7L, 9L, 17L, 20L, 16L, 11L),
Consequence = c("missense_variant",
"synonymous_variant","missense_variant","missense_variant",
"downstream_gene_variant","missense_variant",
"synonymous_variant","missense_variant","synonymous_variant",
"missense_variant","missense_variant"),
PolyPhen = c("benign(0.25)",NA,
"benign(0)","probably_damaging(0.938)",NA,"benign(0)",NA,
"benign(0.002)",NA,"probably_damaging(0.996)",
"possibly_damaging(0.751)"),
IMPACT = c("MODERATE","LOW","MODERATE",
"MODERATE","MODIFIER","MODERATE","LOW","MODERATE",
"LOW","MODERATE","MODERATE"),
Sample = c("TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490","TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490",
"TCGA-49-4490","TCGA-49-4490","TCGA-49-4490")
)

and here for the "mutation" dataset

data.frame(
stringsAsFactors = FALSE,
bcr_patient_barcode = c("TCGA-86-8669","TCGA-44-3396",
"TCGA-35-4123","TCGA-75-5147",
"TCGA-78-7158","TCGA-44-6779",
"TCGA-50-6594"),
additional_studies = c(NA, NA, NA, NA, NA, NA, NA),
tissue_source_site = c(86L,44L,35L,
75L,78L,44L,50L),
patient_id = c(8669L,3396L,
4123L,5147L,7158L,6779L,
6594L),
bcr_patient_uuid = c("7b89166e-c5b8-481b-aa70-495141499b91",
"3bd6badb-27ff-4d8d-b206-4d28dc264862",
"6cf49cf0-de4c-4c90-8358-eae19c6206b0",
"d2824e6d-3784-45c2-9b0f-52b17356b5da",
"501c987e-d1eb-48a9-89eb-72a5062c90b4",
"cbbea9f1-396a-4bf3-b67c-2cac3394dceb",
"8504fd86-a70a-4cba-9ec8-25c9e60ca549"),
informed_consent_verified = c("YES","YES",
"YES","YES","YES","YES",
"YES"),
icd_o_3_site = c("C34.1",
"C34.1","C34.1","C34.1","C34.3",
"C34.9","C34.1"),
icd_o_3_histology = c(81403L,
81403L,81403L,82523L,82553L,
81403L,81403L),
icd_10 = c("C34.1",
"C34.1","C34.1","C34.1","C34.3",
"C34.1","C34.1"),
day_of_form_completion = c(30L,18L,20L,
6L,15L,31L,25L),
month_of_form_completion = c(8L, 10L, 12L, 4L, 9L, 8L, 8L),
year_of_form_completion = c(2012L,2010L,
2010L,2011L,2011L,2011L,
2011L),
tissue_prospective_collection_indicator = c("YES","YES",
"NO","NO","NO","NO","NO"),
tissue_retrospective_collection_indicator = c("NO","NO",
"YES","YES","YES","YES","YES"),
days_to_birth = c(-23443L,
-27073L,-14064L,NA,-21742L,
-18469L,-28924L),
gender = c("MALE",
"FEMALE","MALE","FEMALE","FEMALE",
"FEMALE","FEMALE"),
race_list = c("WHITE",
"WHITE","WHITE",NA,"WHITE",
"WHITE","BLACK OR AFRICAN AMERICAN"),
ethnicity = c("NOT HISPANIC OR LATINO",
"NOT HISPANIC OR LATINO","NOT HISPANIC OR LATINO",
NA,NA,NA,
"NOT HISPANIC OR LATINO"),
other_dx = c("No","No",
"No","No","No","No","No"),
history_of_neoadjuvant_treatment = c("No","No",
"No","No","No","No","No"),
vital_status = c("Alive",
"Alive","Alive","Alive","Dead",
"Dead","Dead"),
days_to_last_followup = c(34L, 311L, 182L, NA, NA, NA, NA),
days_to_death = c(NA,NA,NA,
NA,179L,500L,370L),
person_neoplasm_cancer_status = c("TUMOR FREE",
"TUMOR FREE","TUMOR FREE",
"TUMOR FREE","WITH TUMOR",
"WITH TUMOR","WITH TUMOR"),
has_new_tumor_events_information = c("NO","NO",
"NO","NO","NO","NO","NO"),
has_follow_ups_information = c("YES","YES",
"YES","YES","YES","YES",
"YES"),
has_drugs_information = c("YES","YES",
"NO","NO","YES","YES","NO"),
has_radiations_information = c("NO","NO",
"NO","NO","NO","NO","YES"),
stage_event_system_version = c("7th","7th",
"7th","6th","6th","6th",
"6th"),
stage_event_clinical_stage = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_pathologic_stage = c("Stage IA",
"Stage IIIA","Stage IA",
"Stage IB","Stage IIIB","Stage IIB",
"Stage IIIA"),
stage_event_tnm_categories = c("T1bN0M0",
"T2N2M0","T1N0M0","T2N0M0",
"T4N2M0","T2N1MX","T3N2M0"),
stage_event_psa = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_gleason_grading = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_ann_arbor = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_serum_markers = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_igcccg_stage = c(NA, NA, NA, NA, NA, NA, NA),
stage_event_masaoka_stage = c(NA, NA, NA, NA, NA, NA, NA),
cohort = c("LUAD",
"LUAD","LUAD","LUAD","LUAD",
"LUAD","LUAD")
)

is the gene name

and the patient name in mutations dataset

and in clinicals

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.