Problems with counting columns in data table

freddywit · February 8, 2021, 7:11pm

Hello everyone,

I have a data table with more than 5'000'000 rows and 18 columns.
I would like to count all my rows for each year, if a certain column (occupation) is a specific value.
Fortunately, I was already able to successfully create the table, which obtains the desired occupation for each year and lists also the frequency.
Unfortunately, I struggle with counting the frequency of the occupation 9920 in each year. Therefore, my problem is that I don't know how to connect the data that is stored in my data table with the commands that count the frequency.

A selection of my data:

row   YEAR OCC2010
  1   1990    4600
  2   1990    2810
  3   1990    9920
  4   1990    9920
  5   1991    5600
  6   1991    5700
  7   1991    5160
...

My file is:

setwd("C:\\Users\\power\\Desktop\\Data Thesis")
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)

cps_ddi <- read_ipums_ddi("cps_00002.xml")
cps_data <- read_ipums_micro(cps_ddi, verbose = FALSE)

View(cps_data)

# Problem
cps_data(year = (sample(c("1990", "1991", "1992", "1993"), 100000, replace=T)), 
               OCC2010 = rep("9920", 100000))

combns = expand.grid(c("1990", "1991", "1992", "1993"), "9920")

combns = cbind (combns, apply(combns, 1, function(x)sum(transp$YEAR==x[1] & transp$OCC2010==x[2])))
colnames(combns) = c("year", "occupation", "count")

View(combns)

My desired table, which I've created already succesfully

row   year  occupation count
  1   1990    9920       0
  2   1991    9920       0
  3   1992    9920       0 
  4   1992    9920       0

Many thanks in advance and I hope somebody could help me

Xx
Freddy

williaml · February 8, 2021, 9:05pm

Can you provide a reproducible example (not a screenshot)?

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

jrkrideau · February 8, 2021, 10:58pm

I am sorry but this code makes no sense.

a = data.table(cps_data=YEAR(c("1990", "1991", "1992", "1993"), 100000, replace=T), cps_data=OCC2010(c("9420", 100000, replace=T))

Perhaps you wanted something like this

a = data.table(year = (sample(c("1990", "1991", "1992", "1993"), 100000, replace=T)), 
      OCC2010 = rep("9420", 100000))

Also this seems meaningless,

sum(a$YEAR == "1990" & a$OCC2010=="9420")

What we need is some sample raw data, presumably from cps_data
and a description, in words of what you want to do.

Please have another look at the link willianl supplied.

freddywit · February 8, 2021, 11:54pm

Thanks a lot for your reply! I've already tried to implement your advise, but it did not work..
I think my problem is just super basic, namely that I don't know really how to connect the data from my table with the commands listed below #Problem.

Anyway
Thanks a lot for your help!

jrkrideau · February 9, 2021, 3:05am

I have no idea of what you are trying to achieve but if this is what you want to do:

then

library(dplyr)
 cps_data    %>%  filter(OCC == 9920)   %>%  
       count(OCC, YEAR)

should do it. The rest of the code has no context.

martin.R · February 9, 2021, 10:28am

Given that the request is for data.table, the equivalent of the dplyr solution above would be:
cps_data[OCC == 9920, .(count = .N), .(OCC, YEAR)]

jrkrideau · February 9, 2021, 6:19pm

That's what I originally thought but the code is confusing enough that I changed my opinion. It well could be a data.table.

martin.R · February 9, 2021, 6:32pm

I think I only read the post title and your initial reply, so looking at the original request it does look like it was for dplyr. Nevermind, both would produce the same result.

freddywit · February 10, 2021, 11:30am

Thanks a lot! It worked !
Best regards
Freddy

jrkrideau · February 10, 2021, 2:14pm

Which solution worked? The dplyr one or the data.table one?

freddywit · February 10, 2021, 2:29pm

The dplyr one!
The data.table one did not work as it could not find a function for "."
Thanks again!

jrkrideau · February 10, 2021, 3:06pm

That is because your data is in a data.frame or a tibble. You would have to install and load the data.table package plus you would have to convert your data.frame to a data.table.

system · March 7, 2021, 9:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.