I have a data frame, DF, with column B, where the values in B are a mix of numbers and letters and some other punctuation. Sometimes the letters are lower case, upper case, or both.
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
B D E
1 10.1056/NEJMOA1505467 Paywall 2015
2 10.1056/NEJMoa1505467 Paywall 2012
3 10.1056/nejmoa1508375 Paywall 2010
4 10.1056/NEJMOA1508375 Paywall 2011
I'm trying to identify duplicate values in B by using group_by and mutate and then get rid of rows with duplicate values in B using distinct. But because the cases aren't the same, group_by and distinct don't count them as being the same.
DF <- DF %>%
group_by(B) %>%
mutate(BCount = n())
DF
# A tibble: 4 x 4
# Groups: B [4]
B D E BCount
<fct> <fct> <dbl> <int>
1 10.1056/NEJMOA1505467 Paywall 2015 1
2 10.1056/NEJMoa1505467 Paywall 2012 1
3 10.1056/nejmoa1508375 Paywall 2010 1
4 10.1056/NEJMOA1508375 Paywall 2011 1
DF <- distinct(DF, B, .keep_all = TRUE)
DF
# A tibble: 4 x 4
# Groups: B [4]
B D E BCount
<fct> <fct> <dbl> <int>
1 10.1056/NEJMOA1505467 Paywall 2015 1
2 10.1056/NEJMoa1505467 Paywall 2012 1
3 10.1056/nejmoa1508375 Paywall 2010 1
4 10.1056/NEJMOA1508375 Paywall 2011 1
I've tried using tolower and toupper to get the letters all the same, but this doesn't seem to change the values in my data frame, it seems to create a new vector? I want to keep my data frame and just convert the text (one way or the other, it doesn't matter). What am I getting wrong?