countif help for large table

Hi,

I am a beginner and have no programming experience. I am trying to use RStudio for a quick and easy project.

I have a imported a table from Hive which has about 35k rows. A particular column "A" has multiple different values/names - John, Shawn, Todd, etc. I want to count how many times each of these names have repeated in the entire data set. For e.g. John was repeated 15 times, Shawn was repeated 300 times, etc.

In excel this would be a simple countif formula (=countif(A:A,A2)) that you can drag down the ~35k cells at a click of a button and get an answer.

There are more sophisticated ways but at its most basic

dat1 <-  data.frame(A = sample(c("John", "Shawn", "Todd", "Mike"), 20, replace = TRUE),
                    B = rnorm(20, mean = 10, sd = 3))
table(dat1$A)

To get the table output into a vector that is easier to use

dat2 <-   as.vector(table(dat1$A))

Tidy option:

library(tidyverse)

dat1 <-  data.frame(A = sample(c("John", "Shawn", "Todd", "Mike"), 20, replace = TRUE),
                    B = rnorm(20, mean = 10, sd = 3))

# Summarises by name
dat1 %>% 
  group_by(A) %>% 
  count()
#> # A tibble: 4 x 2
#> # Groups:   A [4]
#>   A         n
#>   <chr> <int>
#> 1 John      5
#> 2 Mike      2
#> 3 Shawn     8
#> 4 Todd      5

# More like adding a `countif()` column in Excel
dat1 %>% 
  group_by(A) %>% 
  mutate(n = n())
#> # A tibble: 20 x 3
#> # Groups:   A [4]
#>    A         B     n
#>    <chr> <dbl> <int>
#>  1 Shawn  8.46     8
#>  2 Shawn 14.9      8
#>  3 Todd  12.7      5
#>  4 Todd  13.6      5
#>  5 Todd  10.7      5
#>  6 Todd  10.6      5
#>  7 John  11.8      5
#>  8 Shawn 11.3      8
#>  9 John   6.71     5
#> 10 Shawn 10.4      8
#> 11 John   8.46     5
#> 12 Mike  11.4      2
#> 13 Shawn  9.02     8
#> 14 John  11.4      5
#> 15 Todd   8.05     5
#> 16 Shawn  8.70     8
#> 17 Shawn  7.62     8
#> 18 Mike   7.81     2
#> 19 John   7.69     5
#> 20 Shawn 10.7      8

Created on 2021-10-05 by the reprex package (v2.0.0)

Thanks for your time and feedback. I will take some time to understand the syntax and implement it. I am just starting R, so not sure what would be the easiest way to break this down. I went to the Help section in rstudio to read on "data.frame" but couldn't make out much of it. May be need to go through some basic training?

Material for you to study.

  1. Tidyverse; you can quite quickly learn to be very effective in R studying this packages syntax
    https://r4ds.had.co.nz/

  2. base R; my favoured course for learning base R is swirl.
    swirl: Learn R, in R. (swirlstats.com)

I would suggest you look at these in the order I present them here. R4DS to be able to do a lot fairly easily. and swirl to fill in the gaps of your R knowledge/ to be thorough.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.