Problems with counting columns in data table

Hello everyone,

I have a data table with more than 5'000'000 rows and 18 columns.
I would like to count all my rows for each year, if a certain column (occupation) is a specific value.
Fortunately, I was already able to successfully create the table, which obtains the desired occupation for each year and lists also the frequency.
Unfortunately, I struggle with counting the frequency of the occupation 9920 in each year. Therefore, my problem is that I don't know how to connect the data that is stored in my data table with the commands that count the frequency.

A selection of my data:

row   YEAR OCC2010
  1   1990    4600
  2   1990    2810
  3   1990    9920
  4   1990    9920
  5   1991    5600
  6   1991    5700
  7   1991    5160
...

My file is:

setwd("C:\\Users\\power\\Desktop\\Data Thesis")
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)

cps_ddi <- read_ipums_ddi("cps_00002.xml")
cps_data <- read_ipums_micro(cps_ddi, verbose = FALSE)

View(cps_data)

# Problem
cps_data(year = (sample(c("1990", "1991", "1992", "1993"), 100000, replace=T)), 
               OCC2010 = rep("9920", 100000))

combns = expand.grid(c("1990", "1991", "1992", "1993"), "9920")

combns = cbind (combns, apply(combns, 1, function(x)sum(transp$YEAR==x[1] & transp$OCC2010==x[2])))
colnames(combns) = c("year", "occupation", "count")

View(combns)

My desired table, which I've created already succesfully

row   year  occupation count
  1   1990    9920       0
  2   1991    9920       0
  3   1992    9920       0 
  4   1992    9920       0

Many thanks in advance and I hope somebody could help me

Xx
Freddy

Can you provide a reproducible example (not a screenshot)?

1 Like

I am sorry but this code makes no sense.

a = data.table(cps_data=YEAR(c("1990", "1991", "1992", "1993"), 100000, replace=T), cps_data=OCC2010(c("9420", 100000, replace=T))

Perhaps you wanted something like this

a = data.table(year = (sample(c("1990", "1991", "1992", "1993"), 100000, replace=T)), 
      OCC2010 = rep("9420", 100000))

Also this seems meaningless,

sum(a$YEAR == "1990" & a$OCC2010=="9420")

What we need is some sample raw data, presumably from cps_data
and a description, in words of what you want to do.

Please have another look at the link willianl supplied.

Thanks a lot for your reply! I've already tried to implement your advise, but it did not work..
I think my problem is just super basic, namely that I don't know really how to connect the data from my table with the commands listed below #Problem.

Anyway
Thanks a lot for your help!

I have no idea of what you are trying to achieve but if this is what you want to do:

then

library(dplyr)
 cps_data    %>%  filter(OCC == 9920)   %>%  
       count(OCC, YEAR)  

should do it. The rest of the code has no context.

Given that the request is for data.table, the equivalent of the dplyr solution above would be:
cps_data[OCC == 9920, .(count = .N), .(OCC, YEAR)]

That's what I originally thought but the code is confusing enough that I changed my opinion. It well could be a data.table.

I think I only read the post title and your initial reply, so looking at the original request it does look like it was for dplyr. Nevermind, both would produce the same result.

1 Like

Thanks a lot! It worked !
Best regards
Freddy

Which solution worked? The dplyr one or the data.table one?

The dplyr one!
The data.table one did not work as it could not find a function for "."
Thanks again!

That is because your data is in a data.frame or a tibble. You would have to install and load the data.table package plus you would have to convert your data.frame to a data.table.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.