Countifs in R Studio

Karim492 · December 16, 2019, 10:20am

Hi. I was trying to do a countifs function in R studio. But so far unsuccessful. So I have the following columns in my data set:

Policy ID Main Entry Counts
1 A 1
2 E 1
3 E 0
4 A 1
2 A 1
5 A 1

In excel, the formula is:

COUNTIFS($A:$A,A3,$Z:$Z,"A")

Column A is policy ID. There are duplicates in this columns.
Z Column is the Main Entry Column.

Basically I want each row to assign a counted number. How do i do this in R? The data set is pretty huge.

So far whatever solution i have come across seems very complicated.

pieterjanvc · December 16, 2019, 1:20pm

Hi,

Welcome to the RStudio community!
I don't know if I completely get what you want (it's a bit confusing), but looking at the Excel function, here is an implementation using TidyVerse.

library(dplyr)

#The data
myData = data.frame(PolicyID = c(1:4,2,5), 
                    MainEntry = c("A", "E", "E","A", "A", "A"), 
                    Counts = c(1,1,0,1,1,1))


#Group by MainEntry, only consider Counts > 0 and count the number of entries
myData %>% group_by(MainEntry) %>% filter(Counts > 0) %>% summarise(counted = n())

# A tibble: 2 x 2
  MainEntry counted
  <fct>       <int>
1 A               4
2 E               1

Regardless whether the columns I choose are the correct ones, you can filter and count easily with these functions. The final table says how many times A and E were seen in the condition of the counts > 0. You can change any of these to get different results if the columns are wrong.

If this is not what you were looking for, please provide a clear example of the input and the output tables you expect.

Hope this helps,
PJ

Karim492 · December 16, 2019, 2:04pm

So basically the counts column is the required output.

Excel Formula used: =+COUNTIFS($A:$A,A3,$B:$B,"A")

Notice how policy 2 is counted just once and gives output of 1, since there is a condition for A.

How do i get the output?

pieterjanvc · December 17, 2019, 1:31pm

Hi,

The easiest and fastest way of doing it like this would be the following I think:

library(tidyr)
library(dplyr)

myData = data.frame(Policy = c(1:3, 4,2,5), Main = c("A", "E","E", "A", "A", "A"))

myData %>% left_join(
    myData %>% filter(Main == "A") %>% 
      group_by(Policy) %>% summarise(output = n()), 
    by = c("Policy" = "Policy")) %>% 
  replace_na(list(output = 0))

> myData
  Policy Main output
1      1    A      1
2      2    E      1
3      3    E      0
4      4    A      1
5      2    A      1
6      5    A      1

EXPLANATION
I first calculated the counts per Policy number when the Main was 'A'

myData %>% filter(Main == "A") %>% 
  group_by(Policy) %>% summarise(output = n())

  Policy output
   <dbl>  <int>
1      1      1
2      2      1
3      4      1
4      5      1

Then I used a left join to add these numbers back into the original table.

myData %>% left_join(
    myData %>% filter(Main == "A") %>% 
      group_by(Policy) %>% summarise(output = n()), 
    by = c("Policy" = "Policy"))

  Policy Main output
1      1    A      1
2      2    E      1
3      3    E     NA
4      4    A      1
5      2    A      1
6      5    A      1

Note that Policy numbers that have no "A" will be inserted as NA, for which I finally replace them by 0 (see solution at top).

You could alternatively also use sapply to iterate over all the rows, but although this has the same result, it would be slower in calculations for large datasets:

myData$output = sapply(1:nrow(myData), function(i){
  myData %>% filter(Policy == myData$Policy[i], Main == "A") %>% nrow
})

Hope this helps,
PJ

Karim492 · December 18, 2019, 6:43am

PJ,

It looks like this will solve my problem. Thank you so much! The code is taking some time to run on a larger data set, but its fine, as long as it is producing good results.

One question, the output column isn't showing up on my R Studio as a column. Is there a way I can get this output column attached to the myData, so it basically becomes an extension of it?

Yarnabrina · December 18, 2019, 8:46am

You'll have to assign the result first. Just use this:

myData <- myData %>% ...

If you want speed, explore data.table package.

Edit:

@Karim492 , Here's an example:

library(data.table)

set.seed(seed = 47418)

n_rows <- 5e6

dt <- data.table(Policy = sample.int(n = 50,
                                     size = n_rows,
                                     replace = TRUE), 
                 Main = sample(x = c("A", "E"),
                               size = n_rows,
                               replace = TRUE,
                               prob = c(0.9, 0.1)))

system.time(expr = {dt[, output := sum(Main == "A"), keyby = Policy]})
#>    user  system elapsed 
#>   1.398   0.081   0.262

^{Created on 2019-12-19 by the reprex package (v0.3.0)}

pieterjanvc · December 18, 2019, 12:18pm

You're welcome.

How large is your data set and how long does it take to run? Just curious...

PJ

Karim492 · December 18, 2019, 4:34pm

The data has 500K+ rows and 30 columns.

Surprisingly, when I followed Yarnabrina's advice and assigned the result first, the process became quicker. The code is working super quick now. Thanks!

pieterjanvc, I have another question regarding sumif function that we use in excel. I have a feeling the code you provided can handle sumif too.

Sample working provided above.

I did find a solution to it, but a rather long one. First i created a separate data set form the available data using aggregate function, and then used merge command to lookup values of it into the main data set.

So in excel terms, in R Studio, I created a pivot table, and then vlookup them.

Feel like there is a more efficient way.

pieterjanvc · December 18, 2019, 4:56pm

Hi,

First of all, I indeed forgot to assign the result to a variable like @Yarnabrina suggested, so thank you to him for noticing.

The code should indeed be very fast and efficient, because I tested it with 5 million rows and the result was almost instantaneous

library(tidyr)
library(dplyr)

nRows = 5000000
myData = data.frame(Policy = sample(1:50, nRows, replace = T), 
                    Main = sample(c("A", "E"), nRows, replace = T, prob = c(0.9,0.1)))

myData = myData %>% left_join(
  myData %>% filter(Main == "A") %>% 
    group_by(Policy) %>% summarise(output = n()), 
  by = c("Policy" = "Policy")) %>% 
  replace_na(list(output = 0))

Regarding your new question: it's best to start them in a new topic next time as this will be easier for other folk to find it back when looking for help (the title refers to the first issue).

To answer it anyway: it's an even easier implementation given you can just use the group_by and mutate:

myData = data.frame(Policy = c(1,2,2,2,4,3,4,8),
                    Premium = c(5,6,8,7,9,2,1,4))

myData = myData %>% group_by(Policy) %>% mutate(output = sum(Premium))

  Policy Premium output
   <dbl>   <dbl>  <dbl>
1      1       5      5
2      2       6     21
3      2       8     21
4      2       7     21
5      4       9     10
6      3       2      2
7      4       1     10
8      8       4      4

While writing this, I realised that I could write the original code the same way and make it even simpler (though maybe slightly less fast, though still within several seconds on 5m rows)

myData = data.frame(Policy = c(1:3, 4,2,5), Main = c("A", "E","E", "A", "A", "A"))
myData = myData %>% group_by(Policy) %>% mutate(output = sum(Main == "A"))

  Policy Main  output
   <dbl> <fct>  <int>
1      1 A          1
2      2 E          1
3      3 E          0
4      4 A          1
5      2 A          1
6      5 A          1

These magical functions are all part of the Tidyverse. I suggest you read more about those and you'll be able to replicate any Excel function faster and better

PJ

Karim492 · December 18, 2019, 5:37pm

PJ. You're a savior.

Ill try this at work tomorrow on my data set and check the results.

This does look like a better approach, since the code is small and neat as well.

Thanks!

Scoco · January 2, 2020, 12:09pm

is truly fast, but it sorts dt by Policy which may be unhelpful.
If you want to retain the order, you can do it in base R with

set.seed(seed = 47418)

n_rows <- 5e6

df <- data.frame(Policy = sample.int(n = 50,
                                     size = n_rows,
                                     replace = TRUE), 
                 Main = sample(x = c("A", "E"),
                               size = n_rows,
                               replace = TRUE,
                               prob = c(0.9, 0.1)))

tt <- tapply(df$Main, df$Policy, function(mm) sum(mm=="A"))
df$Count <- tt[df$Policy]

Timings on my machine are comparable to @Yarnabrina's expression (across a wide range of n, both take longer for larger n), but it retains the original ordering and tt gives a compact count of records for each distinct Policy number which might be useful if there are many repeats (~100,000 records per Policy in this example).

martin.R · January 2, 2020, 1:53pm

If you want to retain the order, just leave out the 'key', i.e.:

library(data.table)
dt[, output := sum(Main == "A"), by = Policy]

system · January 23, 2020, 1:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.