Complex data cleaning

chickenpotpiie · September 21, 2021, 7:24am

Hi,

It's a lot to ask, I know but I'm relatively new to R and I have some complicated data cleaning to do that I've been stuck on for a while.

Consider the example data that includes data for 3 different individuals:

i_pid	a_lab_id	a_bacteria	a_source	a_sample_host	a_date	a_amocla_mic	a_amocla_result	a_ampicillin_mic	a_ampicillin_result
CA036HT041S	920707EMB1	ECOLI	STO	HUM	12/12/2020 15:52	<= 4	SUS	<= 4	SUS
CA036HT041S	920707EMB2	ECOLI	STO	HUM	12/12/2020 15:53	8	SUS	<= 4	SUS
CA036HT041S	920707EMB3	ECOLI	STO	HUM	12/12/2020 15:54	<= 4	SUS	<= 4	SUS
CA036HT041S	920707EMB4	ECOLI	STO	HUM	12/12/2020 15:54	<= 4	RES	<= 4	SUS
CA036HT041S	920707EMB5	ECOLI	STO	HUM	12/12/2020 15:55	<= 4	SUS	<= 4	SUS
CA036HT041S	920707ENT1	EFAEM	STO	HUM	9/1/2021 12:26			2	SUS
CA036HT041S	920707ENT2	EFAEM	STO	HUM	9/1/2021 12:27			2	INT
CA036HT041S	920707ENT3	EFAEM	STO	HUM	9/1/2021 12:27			2	RES
CA036HT041S	920707ENT4	EFAEM	STO	HUM	9/1/2021 12:28			2	SUS
CA036HT041S	920707ENT5	EFAEM	STO	HUM	9/1/2021 12:28			2	SUS
CA037HT041S	920710KBS1	KPNEU	STO	HUM	19/12/2020 17:17	8	SUS	64	RES
CA037HT041S	920710KBS2	KPNEU	STO	HUM	19/12/2020 17:18	8	SUS	64	RES
CA037HT041S	920710KBS3	KPNEU	STO	HUM	19/12/2020 17:18	8	SUS	32	RES
CA037HT041S	920710KBS4	KPNEU	STO	HUM	19/12/2020 17:19	64	RES	> 128	RES
CA037HT041S	920710KBS5	KPNEU	STO	HUM	19/12/2020 17:19	8	SUS	32	RES
CB011HA041N	930711STA1	SAURS	NAS	HUM	16/1/2021 14:54			1	NOCLSI
CB011HA041N	930711STA2	SAURS	NAS	HUM	16/1/2021 14:54			1	NOCLSI
CB011HA041N	930711STA3	SAURS	NAS	HUM	16/1/2021 14:53			1	NOCLSI
CB011HA041N	930711STA4	SAURS	NAS	HUM	16/1/2021 14:52			1	NOCLSI
CB011HA041N	930711STA5	SAURS	NAS	HUM	16/1/2021 14:52			1	NOCLSI

There are multiple isolates/'tests' for individuals done as repeats. I want to merge the data for each individual such that my data will show only the highest level of resistance for each measure (antibiotic) in the order of NOCLSI < SUS < INT < RES. The final data should be a significantly smaller data frame (in this case, it should be 3 rows, where the first 5 characters of 'i_pid' identifies the individual) that merges all the rows of each person into one row, with the corresponding 'mic' number to the cell with the highest level of resistance. In this dataset, every 2 columns correspond to each other - e.g. 'a_amocla_mic' corresponds to 'a_amocla_result'.

How do I do this!!!?
I barely have a code because I'm that confused, so I'm not even going to bother showing it. Thanks so much in advance, any help would be appreciated

nirgrahamuk · September 21, 2021, 8:41am

Please read the following guide to sharing example data with the forum.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

system · October 12, 2021, 8:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.