Plot Error - Summarize function

Whenever I try to run the plot function, it shows me a error. How do I get rid of this error? Please help.

edx %>% group_by(movieId) %>% summarize(n = n()) %>% ggplot(aes(n)) +
geom_histogram(fill = "rosybrown2", col .... [TRUNCATED]
summarise() ungrouping output (override with .groups argument)

Code:

edx %>% group_by(movieId) %>%
summarize(n = n()) %>%
ggplot(aes(n)) +
geom_histogram(fill = "rosybrown2", color = "black", bins = 10) +
scale_x_log10() +
ggtitle("Total number of movies Ratings")

The bit in bold is info, not an error.

This is also one of the errors I keep getting. I cannot understand my mistake. Could anybody please help?

edx %>%

  • group_by(userId) %>% ggplot(aes(n)) +
  • geom_histogram(color = "cyan", bins = 10) +
  • scale_x_log10() + xlab("Number of ratings") + .... [TRUNCATED]
    Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = n.
    Did you mistype the name of a data column or forget to add after_stat()?

Try this simplified reproducible example. Does it work for you?

Please notice that I have included the data and the complete code, not images of them.

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))

#This works
DF %>% group_by(movieId) %>% 
  summarize(n = n()) %>% 
ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


#This does not work because of the missing summarize()
DF %>% group_by(movieId) %>% 
  ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")
#> Don't know how to automatically pick scale for object of type function. Defaulting to continuous.
#> Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = n. 
#> Did you mistype the name of a data column or forget to add after_stat()?

Created on 2020-06-13 by the reprex package (v0.3.0)

I created DF as a convenience because I do not have your data. You should substitute edx where I wrote

DF %>% group_by() %>% 

This calculation should not take so much time. I believe you have 9 million rows, so the code could take a noticeable amount of time but nothing like five hours. I increased my data frame DF to 9 million rows and the calculation ran in about 1 second on my laptop that has 8 GB of memory.

The code runs but the plot section remains empty.
How do I increase the speed of the plot formation? It's been over half an hour now since I run the above code provided by you. The plot section is still empty.

This code

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))


DF %>% group_by(movieId) %>% 
  summarize(n = n()) %>% 
ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")

should run very quickly. The data frame has 500 rows and the plotted data has 100 rows. Try breaking it up into three steps, run each step individually, and find which step is taking so long.

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))

DF2 <- DF %>% group_by(movieId) %>% 
  summarize(n = n()) 

ggplot(DF2, aes(n)) + geom_histogram(fill = "rosybrown2")

Still plot's not forming. Would you know how to speed up the plot formation?

do sessionInfo() in the console to share the info with us, it might reveal an issue.

also try dev.off() to see if it temporarily affects the plot window or not
if it does, run the plotting again.

The plot formed. It took over an hour though.

Quick question on your "group_by(movieId) %>% summarize(n = n())"

Since you have single column in group_by, is that not the same as "count(movieId)" and do away with group_by() + summarise()?

Yes both are the same.

Heyy could you please help me by running this code below on your rstudio and tell me if you get the same error?

avg_users <- edx %>%
left_join(avg_movie_rating, by='movieId') %>%
group_by(userId) %>%
filter(n() >= 100) %>%
summarize(b_u = mean(rating - mu - b_i))
Error: cannot allocate a vector of size 52.9 MB

I do not have the objects edx or avg_movie_rating, so I cannot conclude anything from running that command. Try this

nrow(edx)
tmp <- edx %>%
left_join(avg_movie_rating, by='movieId')
nrow(tmp)

Does tmp have the number of rows you expect? You may get the error just running that part of the code. If so, I would suspect that each movieId appears more than once in avg_movie_rating and the left_join ends up making multiple versions of each row in edx. On the other hand, 52.9MB is not very large. Do you have a computer with little memory?

This is what I get when I try to run the tmp code.

tmp <- edx %>%
left_join(avg_movie_rating, by='movieId')
Error: cannot allocate vector of size 68.7 Mb

97% of my memory usage is by RStudio.
If you don't mind could I please mail you my full code? Could you tell me if it runs?
I would truly appreciate the help.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

But I don't get a plot. What must I do to get a plot? How do I modify the code?

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

In your latest example, it seems that the edx data frame does not have a column named n. Did you leave out the summarize step by mistake?

Please post a reproducible example, as requested by andresrcs, if you need more help. It is difficult to help you if we cannot work with the same data set you are using. It is a good idea to make a simplified data set to illustrate your problem. Please see the link provided earlier.

This is my code and at the bottom, the information pops up of summarise().
The plot is not created.

rsz_picture_1