Hi there! I am looking for a solution to delete rows that appear double or three times, in order to find out the exact number of participants (N) by only having one row per ID (seqid). See screenshot for example:
I would like to have only one row of 4 and 5 (so delete or do not include the other two 4-rows and the other 5-row). This should work for a data.frame of N=5000, so I wouldn't want to have to delete each row/number seperately.
Let's say your data frame is named DF. If you want to have a data frame with unique rows, you can run
DFuniq <- unique(DF)
Thank you for your suggestion! So I tried it, but unfortunately it generates the same output as before. I believe this is due to the fact, that the numbers on the left (1,2,3,4,6) are unique already and I am trying to change the data.frame according to the numbers in the first column (seqid), which are 4,4,4,5,5,7 - so I guess I would have to somehow include the column in the function?
The numbers at the far left are just aids for viewing the data and are not part of the data frame. Please post the output of
dput(head(DF, 10))
where you replace DF with the actual name of your data frame. That output will allow others to use the same data you have. Place lines with three back ticks just before and after the output, like this:
```
output goes here
```
Since my actual data is very large, the output doesn't fit into here, which is why I had added the screenshot previously as a sort of reprex ( I am still learning creating better reprexes).
If the data set is large in the number of rows, the use of head() in the command I posted will reduce the data to the first 10 rows. If the data set is large in the number of columns, then that probably explains why unique() did not reduce the number of rows. The unique() function will look at all of the columns. If there are columns beyond what you have shown, what do you want done with those data? Is it enough to drop the rows and lose the data?
ahh ok I understand! Good to know! I don't really need the other columns, so I tried extracting some like this:
# Extract columns and rows in order to count "real" N
seqid_deu <- subset(sdf_deu_rebinded, select = c(seqid, ageg10lfs))
dim(seqid_deu)
but I receive following error:
> # Extract columns and rows in order to count "real" N
> seqid_deu <- subset(sdf_deu_rebinded, select = c(seqid, ageg10lfs))
Error in iparse(condition_call, subsetEnv) :
argument "condition_call" is missing, with no default
>
This is a sign that you are using library(EdSurvey) , and it is affecting how subset() works.
That was the problem, thank you for pointing it out! I have now successfully been able to use the previously recommended unique-function.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.