Issue with subsetting data

Hi,

I'm trying to subset some data by those that reported an improvement in mood, but for some reason when using the subset function, it misses out 4 people. Initially, I just added these observations to the new dataframe I wanted to create however, now I'm trying to run some analyses on this new dataframe its treating the 4 people I added as a separate group.

This is what I did to subset the data initially (which missed out 4 participants):

df_Improved <- subset(dataframe, Mood == "Improved")

I looked at the original dataframe (to see if the issue stemmed from there) and manually filtered it by the term Improved. This showed the correct amount of observations so it's only when I try to subset it does it come out with the incorrect number.

Is there a way around this or is there way that I can get R to read these added 4 observations as falling under the same group as all the other observations?

I hope this makes sense, thanks in advance!
Phoenix

Hello.
Thanks for providing code , but you could take further steps to make it more convenient for other forum users to help you.

Share some representative data that will enable your code to run and show the problematic behaviour.

You might use tools such as the library datapasta, or the base function dput() to share a portion of data in code form, i.e. that can be copied from forum and pasted to R session.

Apologies, this is an example of the original dataframe:

Many thanks,
Phoenix

Hello,
I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

I do note that your code references a column 'Mood' that appears absent from your data.frame

Hopefully this is right (original dataset):

data.frame(
stringsAsFactors = FALSE,
Total.Viewed.Image = c(10L,20L,10L,10L,20L,
10L,10L,10L,10L,20L,10L,17L,20L,30L,20L,10L,
10L,30L,20L,20L),
Total.Liked.Image = c(NA,5L,4L,NA,9L,2L,7L,
7L,4L,11L,8L,5L,8L,12L,12L,2L,6L,15L,7L,4L),
Mood.Change = c("Improved","Improved",
"Improved","Improved ","Neg_maintained",
"Neg_maintained","Neg_maintained","Improved","Improved","Improved",
"Pos_maintained","Pos_maintained","Pos_maintained",
"Improved","Improved","Improved","Improved",
"Declined","Improved ","Neg_maintained")
)

New dataset I created using df_Improved <- subset(dataframe, Mood.Change == "Improved" :
data.frame(
stringsAsFactors = FALSE,
row.names = c("4", "19", "21", "23", "1", "2", "3", "8", "9", "10"),
Total.Viewed.Image = c(10L, 20L, 20L, 20L, 10L, 20L, 10L, 10L, 10L, 20L),
Total.Liked.Image = c(NA, 7L, NA, 8L, NA, 5L, 4L, 7L, 4L, 11L),
Mood.Change = c("Improved ","Improved ",
"Improved ","Improved ","Improved","Improved",
"Improved","Improved","Improved","Improved")
)

Yeah my actual code used Mood.Change, I just changed it to Mood for the purposes of this post.

Many thanks,
Phoenix

Not sure if this is the case.
If you check the example data you provided, you will find there are two observations with "Improved " values instead of "Improved". A space in the end of the string makes things different.

You can just run table(dataframe$Mood.Change) with your original data and check the result.

This sounds like a case where it will help to either do some additional data cleaning upstream, or make your subsetting step more robust to things like extra spaces. The interactive filter in RStudio filters for partial matches, but the subset function in base R identifies exact matches, which means "Improved " is not the same as "Improved".

For instance, you might do something like:

library(dplyr)
df_clean = dataframe %>% 
  mutate(across(contains("Mood"), ~stringr::str_trim(.x))

This will remove leading and trailing spaces (using str_trim from the stringr package) across all the columns that contain the work "Mood" in their column name.

Then, when you use this cleaned-up version for further analysis, you won't have mismatches due to those (presumably meaningless and unintended) extra spaces.

Alternatively, you could make your subset more flexible, like:

library(dplyr)
dataframe %>% 
  filter(Mood.Changes %>% str_detect("Improved"))

or using base R: (I don't find this as easy to read or remember)

dataframe[grepl('Improved', dataframe$Mood.Change), ]

This will keep all rows where the value in Mood.Changes contains the string "Improved." The downside of this approach is it's extra faff and easy to forget as you're going, so my preference is to put all my data cleaning first at the top of a script, so everything downstream can be simpler.

1 Like

All sorted now, thanks so much for your help!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.