Summarized data using dplyr

srini · May 21, 2019, 11:19am

Hi I am trying to summarize the data using dplyr. I have a dataset with few columns the code has to filter the data with date_Created column and see if the conditions are matched within the datasets and output a new column with counts (which matches the filtering condition. Let me give an example
I have columns Date_created, Skills, grade. So the first row is picked in the dataset and checked if there are any similar conditions 6 months before if yes then count 1 is added to the column.
eg. Date_created skill grade
2016-09-01 maths 4
2016-10-01 physics 5
2016-03 -01 maths 5
2016-03-01 maths 4

so the output will be in this case the new column count will be 2 for the first row because date_created 2016-03-01 is less than 2016-09-01 twice and the skill is maths and grade is 4 and 4+1,
date_created sklll grade count
2016-09-01 maths 4 2
2016-10-01 physics 5
2016-03 -01 maths 5
2016-03-01 maths 4

Any help would be great, thanks

makrez · May 21, 2019, 1:41pm

Hi srini,

First of all, it would be nice if you could provide the data as code, then people save some time and don't have to build the data frame first.

I used the package lubridate.

# Building data frame
library(lubridate)

set_date <- function(date){
  as.Date(date, format = "%Y-%m-%d")
}

df <- tibble(
  test_date = c(set_date("2016-09-01"), set_date("2016-10-01"), set_date("2016-03-01"), set_date("2016-03-01")),
  subject = c("maths", "physics", "maths", "maths"),
  grade = c(4,5,5,4)
)

I usually use a helper data frame for such operations, which might be slow with big data sets. Maybe there is a better way how to do this in a single mutate() statement. Anyway this should work:

# Function definition

add_counting <- function(df, date_col, subject_col){
  d <- enquo(date_col)
  s <- enquo(subject_col)
  helper_df <- df %>% mutate(!!d := !!d %m+% months(6)) %>%  #add 6 months to the date
    group_by(!!d, !!s) %>% summarise(count = n())
  df %>% left_join(helper_df, by = c(quo_name(d),quo_name(s)))
}

add_counting(df, test_date, subject)

srini · May 21, 2019, 2:14pm

Hi @makrez,

Thank you for providing the solution. Its helpful. But when i am trying to replicate i am not getting the solution. Please find below the sample data and the code which i am trying.
Actually the code should run through the dataset and check every row and to see the conditions are matched. It should take for example Job ==2 and the Job ==3 ie. Job and Job +1.if yes the count is assigned.

skill	conditions	Job	Date created
Art	L	2	1/30/2016
science	E	3	2/27/2017
maths	L	4	3/20/2018
maths	L	5	4/22/2017
sciencce	L	6	5/26/2018

df1 %>%
mutate(row_id = row_number()) %>%
group_by(row_id) %>%
mutate(out = sum(Skill == Skill &
(job == job | job + 1 == job) &
conditions == "L" &
Date_Created > = 6month &
)) %>%
ungroup()

makrez · May 21, 2019, 3:01pm

Hi @srini,

what do you mean that you could not replicate the it? Did the code not work when tried with the data that I gave you or does it not work for the data that you will actually use?

I am sorry, but I am having a hard time understanding what you want to do. I thought that you wanted to check whether there are entries that are exactly 6 months apart and meet the condition "skills". Then theg the count of the most recent entry goes up by 1. Now I am not sure what you actually need.

If you want to follow this up, provide your data in code (i.e. data <- data.frame(skill=c(...) etc.)) as well as the expected output. Then give some clear explanations on which conditions the count of each row should go up by 1.

system · June 11, 2019, 3:01pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.