Counting number of times a string appears UNIQUELY in a particular column

aiyer1217 · June 11, 2021, 2:10pm

Hi everyone,
I am working on a dataset where I want to count the number of times a string appears either in one OR multiple columns. I have attached an example file for reference.

In this file, I have 3 columns, A, B, C. In these columns, sometimes "Y" appears in multiple rows, sometimes it just appears in one. For example in row 11, "Y" just appears for column A. How would I count, say, just the number of times, for column A, "Y" appears only in that column? Conversely, how would I count the number of times "Y" appears in column A only when it appears in column B also, column C also, column B and C?

Please let me know if my question is unclear and I will be happy to provide clarification.

Thanks!

jrkrideau · June 11, 2021, 3:25pm

With a dataframe "dat1"

sum(dat1$a == "y")
sum(dat1 == "y")

aiyer1217 · June 11, 2021, 3:29pm

Thank you for your help My only concern is that it will count the total number of Y's in column A, but I do not want it to count the Ys in column A if there is a Y in column B or C. So, in this example, I would want the output for A to be 3.

Does that make sense?

jrkrideau · June 11, 2021, 3:56pm

Sorry, I misread your request.

Maybe

sum(dat1$a == "y" & dat1$b != "y" & dat1$c != "y")
# or
sum(dat1$a == "y" & dat1$b == "y" & dat1$c == "y")

There probably are more elegant ways to do this but I think this works . just change the EQUALS == or NOT EQUAL != as desired.

aiyer1217 · June 11, 2021, 8:00pm

No worries at all, thanks for replying!

That makes a lot of sense, I think the issue for me is that there are over 30 other columns! So it'd be a bit tedious to write out.

nirgrahamuk · June 11, 2021, 9:17pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

jrkrideau · June 11, 2021, 11:38pm

'I think the issue for me is that there are over 30 other columns!

As I said

For a small problem my approach is fine but I second @ nirgrahamuk,s request.

We need to get a feel for the data.

aiyer1217 · June 14, 2021, 1:33pm

No problem. I am attaching a file that is a reproducible example. I am not sure what code or packages would be most helpful here.

#For pasting
install.packages("datapasta")
library("datapasta")
reproex=read.csv("ReproducibleExample.csv")
reproex
dpasta(reproex)
reprex=data.frame(
  DrugA = c("C", "C", "P", NA, NA),
  DrugB = c(NA, "C", "P", "C", "C"),
  DrugC=c("P","c",NA,NA,"C")
  )
reprex

In the file attached, I would like to figure out how to do the following:

Count the number of rows under column A where "C" appears, but only in cases where C does not appear in any other column
Count the number of times in Row A where C appears but only if it is in all 3 columns. In the case of this request, you'll notice that under row 2 for DrugC, the "c" is lower case. Is there any way to make it so that the search criteria is NOT case sensitive? That is, the algorithm would still count a case where "c" appears in all 3 columns, regardless of case sensitivity?

I know in this case there are not many columns, and you could just count this manually, but I am working with a much larger dataset.

Please let me know if additional clarification is necessary, and I will do my best to provide what I can.

jrkrideau · June 14, 2021, 2:29pm

Thanks but I think that what we really need is a sample of your actual data if it is not confidential. That is we really need to see 30 columns of data and several rows of data.

A handy way to supply sample data is to use the dput() function. See ?dput. As you have a very large data set then something like head(dput(myfile), 100) which will give us 100 rows of all 30 columns of the data will likely supply more than enough data for us to work with.

The dput() function allows other users to see exactly the same file that you see.

nirgrahamuk · June 14, 2021, 3:05pm

A little convoluted but first drafts often can be.

library(tidyverse)

df1 <- data.frame(
  DrugA = c("C", "C", "P", NA, NA),
  DrugB = c(NA, "C", "P", "C", "C"),
  DrugC = c("P", "c", NA, NA, "C"),
  stringsAsFactors = FALSE
)

# get how many drugs we are doing
drugnum <- length(startsWith(names(df1), "Drug"))

(piv_df1 <- pivot_longer(df1 %>% mutate(rn = row_number()),
  cols = -rn
))

(sum_of_c <- piv_df1 %>%
  group_by(rn, name) %>%
  summarise(sum_c = sum(tolower(value) == "c",
                        na.rm = TRUE)) %>%
  ungroup() %>%
  pivot_wider(names_from = "name", values_from = "sum_c"))

(pre_result_df <- mutate(sum_of_c,
  c_row_count = rowSums(across(-rn))
))

(unique_c <- summarise(
  pre_result_df,
  across(
    starts_with("Drug"),
    ~ sum(. == c_row_count & . == 1)
  )
))

(all_c <- summarise(
  pre_result_df,
  across(
    starts_with("Drug"),
    ~ sum(c_row_count == drugnum & . == 1)
  )
))

aiyer1217 · June 14, 2021, 3:11pm

Thank you so much! Here is the "dput" output for you to reconstruct the data, at least partially.

structure(list(SDFASDF = c(9L, 9L, 1L, 1L, 1L, 1L), X1 = c("", 
"", "", "", "", "C"), X2 = c("P", "", "C", "", "C", "C"), X3 = c("", 
"", "", "", "", ""), X4 = c("P", "C", "", "", "", ""), X5 = c("", 
"", "", "", "", ""), X6 = c("", "", "", "", "", ""), X7 = c("", 
"", "", "", "", ""), X8 = c("", "", "", "", "", ""), X9 = c("", 
"C", "", "", "", ""), X10 = c("", "", "", "", "", ""), X11 = c("", 
"", "", "", "", ""), X12 = c("", "", "", "", "", ""), X13 = c("", 
"", "", "", "", ""), X14 = c("C", "C", "", "", "", "C"), X15 = c("", 
"", "", "", "", "C"), X16 = c("", "", "", "", "", ""), X17 = c("", 
"", "", "", "", ""), X18 = c("", "", "", "", "", ""), X19 = c("", 
"", "", "", "", ""), X20 = c("", "", "", "", "", ""), X21 = c("", 
"", "", "", "", ""), X22 = c("", "", "", "", "", ""), X23 = c("C", 
"", "", "", "", ""), X24 = c("", "", "", "", "", ""), X25 = c("", 
"", "", "", "", ""), X26 = c("", "", "", "", "", ""), X27 = c("", 
"", "", "", "", ""), X28 = c("", "", "", "", "", ""), X29 = c("", 
"", "", "", "", ""), X30 = c("", "", "", "C", "", ""), X31 = c("C", 
"", "C", "", "C", "C"), X32 = c("", "", "", "", "", ""), X33 = c("C", 
"", "", "", "", "C"), X34 = c("", "", "", "", "", ""), X35 = c("", 
"", "", "", "", ""), X36 = c("", "", "", "", "", ""), X37 = c("", 
"", "", "", "", ""), X38 = c("", "", "", "", "", ""), X39 = c(NA, 
NA, NA, NA, NA, NA), X40 = c(NA, NA, NA, NA, NA, NA), X41 = c("", 
"", "", "", "", ""), X42 = c("", "", "", "", "", ""), X43 = c("", 
"", "", "", "", ""), X44 = c("", "", "", "", "", ""), X45 = c("", 
"", "", "", "", ""), X = c(NA, NA, NA, NA, NA, NA), X.1 = c(NA, 
NA, NA, NA, NA, NA), X.2 = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
6L), class = "data.frame")

So, in this case, I would like to get (for example) the amount of rows in which only X2 has a C, then the amount of rows where only X2 and X30 have a C, etc...

So the goal here is to be able to be very specific in my column queries. I hope that makes more sense, and thanks, everyone, for your patience!

jrkrideau · June 14, 2021, 9:36pm

Hi aiyer;

Thanks for the data. It gives a much better idea of the structure of your data.

I suspect that an adaptation of @ nirgrahamuk

's solution will do the job but he is a lot better at this than I am, so I'll leave him to it.

I notice that you have five logical variables (X39, X40, X, X.1, X.2)
I assume they should be excluded?

arthur.t · June 15, 2021, 12:15pm

library(tidyverse)

df1 <- data.frame(
    DrugA = c("C", "C", "P", NA, NA),
    DrugB = c(NA, "C", "P", "C", "C"),
    DrugC = c("P", "c", NA, NA, "C"),
    stringsAsFactors = FALSE
)

# convenient to work without NAs and lower case letters
df2 <- df1 %>% 
    mutate_all(replace_na, "") %>%
    mutate_all(toupper)

# apply logic
df2$Condition1 <- (df2[, 1] == "C") & (!apply(df2[, 2:3] == "C", 1, any))
df2$Condition2 <- apply(df2[, 1:3] == "C", 1, all)
df2
#>   DrugA DrugB DrugC Condition1 Condition2
#> 1     C           P       TRUE      FALSE
#> 2     C     C     C      FALSE       TRUE
#> 3     P     P            FALSE      FALSE
#> 4           C            FALSE      FALSE
#> 5           C     C      FALSE      FALSE

# counts
df2[, c("Condition1", "Condition2")] %>% apply(2, sum)
#> Condition1 Condition2 
#>          1          1

^{Created on 2021-06-15 by the reprex package (v1.0.0)}

aiyer1217 · June 17, 2021, 5:46pm

Thank you, I am getting closer to a solution for sure!

This solution has helped a bit:

# apply logic
df2$Condition1 <- (df2[, 1] == "C") & (!apply(df2[, 2:3] == "C", 1, any))
df2$Condition2 <- apply(df2[, 1:3] == "C", 1, all)
df2
#>   DrugA DrugB DrugC Condition1 Condition2
#> 1     C           P       TRUE      FALSE
#> 2     C     C     C      FALSE       TRUE
#> 3     P     P            FALSE      FALSE
#> 4           C            FALSE      FALSE
#> 5           C     C      FALSE      FALSE

So, in this case, I can get the number of times C appears for drug A alone and the number of times c appears for all three. However, is there a way I can get the number of times C appears for drug A and drug b and NOT drug c? In the example above I know it would be relatively easy to do. But what if I wanted to say, in effect, "Select cases where C exists for drugb and C exists for drugC and where c exists for no other column. Is this possible?

Thank you again so much for all of your help and patience!

arthur.t · June 18, 2021, 1:22am

I think this will do it.

I think the part that you should isolate and learn about is the apply statement, which is what applies the logic across columns.

library(tidyverse)

df1 <- data.frame(
  DrugA = c("C", "C", "P", NA, NA),
  DrugB = c(NA, "C", "P", "C", "C"),
  DrugC = c("P", "c", NA, NA, "C"),
  stringsAsFactors = FALSE
)

# convenient to work without NAs and lower case letters
df2 <- df1 %>% 
  as_tibble() %>%
  mutate_all(replace_na, "") %>%
  mutate_all(toupper)

# apply logic
df2$Condition3 <- (df2[, 2] == "C") & 
  (df2[, 3] == "C") & 
  (!apply(df2[, -c(2,3)] == "C", 1, any))

^{Created on 2021-06-17 by the reprex package (v1.0.0)}

aiyer1217 · June 18, 2021, 1:45pm

Thank you! I was actually curious about the apply function, as well as different coding syntaxes.

To compare 3 examples, here is the original:

df2$Condition3 <- (df2[, 2] == "C") & 
  (df2[, 3] == "C") & 
  (!apply(df2[, -c(2,3)] == "C", 1, any))

If I had instead written:

df2$Condition3 <- (df2[, 2] == "C") & 
  (df2[, 3] == "C") & 
  (df2[, -c(2,3)] != "C")

Or if I had written

df2$Condition3 <- (df2[, 2] == "C") & 
  (df2[, 3] == "C") & 
  (df2[, -c(2,3)] != "C")

Are all 3 the same? Or would I have obtained different outputs in some of these cases?

Once again, thank you so much for your patience. I cannot tell you how much of a help this has been :).

system · July 9, 2021, 1:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.