Hello,
I'm trying to observe the ratio change between categories over two decades in some data. Essentially, between each year I want to be able to say what ratio each category holds within that year, and display it over the two decades. Does anyone have any advice on how to start this?
Cheers,
J. Maxwell
It is hard to give specific advice without knowing more about the data you are starting with. For example, do you already have yearly values for the categories or does the yearly value have to be calculated? Please give more information about your data. It would help a lot if you could show a bit of the data. If it is in a data frame called DF, the result of the following command would be very helpful.
dput(head(DF))
Paste the result of that between two lines that consist of only three back ticks.
```
Your output here.
```
Each category doesn't have a specific yearly value. The categories are based off of title posts and the date is recorded for the title post. I ran the dput(head(DF)), but because of the length of the titles it is pretty unseemly. What specifically are you looking for with the dput? Here is an example of some of the data. Does this help clear things up? I apologize for being unclear initially.
Category | month | replies | title | views | year |
---|---|---|---|---|---|
Education | 10 | 463 | NEW WHITE NATIONAL SCHOOL (K-12) FORMING.... | 210859 | 2005 |
Education | 5 | 72 | Homeschool lessons | 69853 | 2004 |
Children | 1 | 52 | Book suggestions for children and young adults | 34967 | 2012 |
Misc | 8 | 304 | Firefox - A Better Browser For Whites That's Spreading Like Wildfire | 166373 | 2005 |
Education | 8 | 46 | College textbooks: buy/sell | 38115 | 2007 |
Misc | 12 | 42 | Free Online Course on How to Start a Business | 43691 | 2004 |
I invented a small data set with only two categories, two years and no extra columns but I think that is sufficient to show the method.
library(dplyr)
DF <- data.frame(category = rep(LETTERS[1:2], each = 6),
views = c(143, 198, 87, 252, 632, 56, 484, 399, 144, 256, 532, 333),
year = rep(2010:2011, 6))
DF
#> category views year
#> 1 A 143 2010
#> 2 A 198 2011
#> 3 A 87 2010
#> 4 A 252 2011
#> 5 A 632 2010
#> 6 A 56 2011
#> 7 B 484 2010
#> 8 B 399 2011
#> 9 B 144 2010
#> 10 B 256 2011
#> 11 B 532 2010
#> 12 B 333 2011
AnnualTotal <- DF %>% group_by(year) %>% summarize(Total = sum(views))
#> `summarise()` ungrouping output (override with `.groups` argument)
Cat_Year <- DF %>% group_by(category, year) %>%
summarise(GroupTotal = sum(views))
#> `summarise()` regrouping output by 'category' (override with `.groups` argument)
Cat_Year <- inner_join(Cat_Year, AnnualTotal, by = "year")
Cat_Year
#> # A tibble: 4 x 4
#> # Groups: category [2]
#> category year GroupTotal Total
#> <chr> <int> <dbl> <dbl>
#> 1 A 2010 862 2022
#> 2 A 2011 506 1494
#> 3 B 2010 1160 2022
#> 4 B 2011 988 1494
Cat_Year <- Cat_Year %>% mutate(Ratio = GroupTotal/Total)
Cat_Year
#> # A tibble: 4 x 5
#> # Groups: category [2]
#> category year GroupTotal Total Ratio
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 2010 862 2022 0.426
#> 2 A 2011 506 1494 0.339
#> 3 B 2010 1160 2022 0.574
#> 4 B 2011 988 1494 0.661
Created on 2020-09-09 by the reprex package (v0.3.0)
library(dplyr)
DF <- data.frame(category = rep(LETTERS[1:2], each = 6),
views = c(143, 198, 87, 252, 632, 56, 484, 399, 144, 256, 532, 333),
year = rep(2010:2011, 6))
For this section is there an easier way to tackle the views for your script? I have 3,169 different titles, and therefore a lot of different views, and typing out each one seems a bit much.
That part of the code is simply me inventing some data. You should use your own data set that you partially displayed in a previous post. Whatever it is named, substitute that name for DF in the line
AnnualTotal <- DF %>% group_by(year) %>% summarize(Total = sum(views))
I'm still struggling a bit to get your script to meld with my data. I probably should have prefaced that I am relatively new to R. Here's where I think I am getting stuck.
-you're creating object an object (AnnualTotal) that should have the titles organized and then summarized. When I try and run that data, however, I am met with an error code "Error in summarize(., Total = sum(views)) :
argument "by" is missing, with no default"
I do not know what the problem is. Please post the output of
dput(head(DF))
except replace DF with the name of your data frame. When you paste the output into your reply, put a line containing only three back ticks just before and after. Like this
```
Paste your output here
```
The back tick key is just to the left of the number 1 on US keyboards.
> dput(head(SF))
structure(list(Category = c("Education", "Education", "Children",
"Misc", "Education", "Misc"), month = c(10L, 5L, 1L, 8L, 8L,
12L), replies = c(463L, 72L, 52L, 304L, 46L, 42L), title = c("NEW WHITE NATIONAL SCHOOL (K-12) FORMING....",
"Homeschool lessons", "Book suggestions for children and young adults",
"Firefox - A Better Browser For Whites That's Spreading Like Wildfire",
"College textbooks: buy/sell", "Free Online Course on How to Start a Business"
), views = c(210859L, 69853L, 34967L, 166373L, 38115L, 43691L
), year = c(2005L, 2004L, 2012L, 2005L, 2007L, 2004L), moyr = c(50,
33, 125, 48, 72, 40)), row.names = c(NA, 6L), class = "data.frame")
The following code works for me using the data you posted. The only changes I made to my original code were to write Category with an upper case C to match your data and to add the arrange() function at the end to sort the final data frame so that data from each year are displayed together.
library(dplyr)
DF <- structure(list(Category = c("Education", "Education", "Children",
"Misc", "Education", "Misc"),
month = c(10L, 5L, 1L, 8L, 8L, 12L),
replies = c(463L, 72L, 52L, 304L, 46L, 42L),
title = c("NEW WHITE NATIONAL SCHOOL (K-12) FORMING....",
"Homeschool lessons", "Book suggestions for children and young adults",
"Firefox - A Better Browser For Whites That's Spreading Like Wildfire",
"College textbooks: buy/sell",
"Free Online Course on How to Start a Business"),
views = c(210859L, 69853L, 34967L, 166373L, 38115L, 43691L),
year = c(2005L, 2004L, 2012L, 2005L, 2007L, 2004L),
moyr = c(50,33, 125, 48, 72, 40)), row.names = c(NA, 6L), class = "data.frame")
AnnualTotal <- DF %>% group_by(year) %>% summarize(Total = sum(views))
#> `summarise()` ungrouping output (override with `.groups` argument)
Cat_Year <- DF %>% group_by(Category, year) %>%
summarise(GroupTotal = sum(views))
#> `summarise()` regrouping output by 'Category' (override with `.groups` argument)
Cat_Year <- inner_join(Cat_Year, AnnualTotal, by = "year")
Cat_Year
#> # A tibble: 6 x 4
#> # Groups: Category [3]
#> Category year GroupTotal Total
#> <chr> <int> <int> <int>
#> 1 Children 2012 34967 34967
#> 2 Education 2004 69853 113544
#> 3 Education 2005 210859 377232
#> 4 Education 2007 38115 38115
#> 5 Misc 2004 43691 113544
#> 6 Misc 2005 166373 377232
Cat_Year <- Cat_Year %>% mutate(Ratio = GroupTotal/Total) %>%
arrange(year, Category)
Cat_Year
#> # A tibble: 6 x 5
#> # Groups: Category [3]
#> Category year GroupTotal Total Ratio
#> <chr> <int> <int> <int> <dbl>
#> 1 Education 2004 69853 113544 0.615
#> 2 Misc 2004 43691 113544 0.385
#> 3 Education 2005 210859 377232 0.559
#> 4 Misc 2005 166373 377232 0.441
#> 5 Education 2007 38115 38115 1
#> 6 Children 2012 34967 34967 1
Created on 2020-09-11 by the reprex package (v0.3.0)
That worked. Thank you so much for your help.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.