Adding unique rows in a table

I am trying to add all the rows of confirmed_cum for a date in a file with format like:

I need to bring it in to format like:

Date Confirmed_cum

25/01/2020 4
26/01/2020 4

I have worked out till this code, but not sure which function or how to add the column(confirmed_cum) by column(date). Can anyone help?

covid <- read.csv(file = 'covid_au_state.csv')
ggplot(data = covid, aes(x =confirmed_cum , y = date)) +
geom_point(aes(color = confirmed))+ labs(x = 'Confirmed cases', y = 'date', title = 'Number of new confirmed cases daily throughout Australia')

trying to filter the distinct dates

covid_distinct_dates<- distinct(covid, covid$date)

Hello Maninder,

Welcome to the community :slight_smile: . Happy to help if you can provide some data and expected output (easiest if you just make a reprex: https://www.tidyverse.org/help/ )

Here is the code

covid <- read.csv(file = 'covid_au_state.csv')
dput(covid)
library(lubridate)
library(dplyr)
library(ggplot2)
covid %>%
mutate(date = dmy(date))
group_by(date) %>%
summarize(confirmed_cum = sum(confirmed_cum)) %>%
ggplot(aes(x =confirmed_cum , y = date)) +
geom_point(aes(color = confirmed)) +
labs(x = 'Confirmed cases', y = 'date',
title = 'Number of new confirmed cases daily throughout Australia')

I am getting an error:

> covid <- read.csv(file = 'covid_au_state.csv')
> dput(covid)
structure(list(date = c("25/01/2020", "26/01/2020", "27/01/2020"
), confirmed_cum = c(4L, 4L, 5L)), class = "data.frame", row.names = c(NA, 
-3L))
> library(lubridate)
> library(dplyr)
> library(ggplot2)
> covid %>%
+   mutate(date = dmy(date)) 
        date confirmed_cum
1 2020-01-25             4
2 2020-01-26             4
3 2020-01-27             5
> group_by(date) %>%       
+   summarize(confirmed_cum = sum(confirmed_cum)) %>% 
+   ggplot(aes(x =confirmed_cum , y = date)) +
+   geom_point(aes(color = confirmed)) + 
+   labs(x = 'Confirmed cases', y = 'date', 
+        title = 'Number of new confirmed cases daily throughout Australia')
Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "function"
> covid <- read.csv(file = 'covid_au_state.csv')
> dput(covid)
structure(list(date = c("25/01/2020", "26/01/2020", "27/01/2020"
), confirmed_cum = c(4L, 4L, 5L)), class = "data.frame", row.names = c(NA, 
-3L))
> library(lubridate)
> library(dplyr)
> library(ggplot2)
> covid %>%
+   mutate(date = dmy(date)) 
        date confirmed_cum
1 2020-01-25             4
2 2020-01-26             4
3 2020-01-27             5
> group_by(date) %>%       
+   summarize(confirmed_cum = sum(confirmed_cum)) %>% 
+   ggplot(aes(x =confirmed_cum , y = date)) +
+   geom_point(aes(color = confirmed)) + 
+   labs(x = 'Confirmed cases', y = 'date', 
+        title = 'Number of new confirmed cases daily throughout Australia')
Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "function"

1 Like

Hi!
You get the error because you are missing a pipe operator (%>%) between the

mutate(date = dmy(date)) and group_by(date).

If you use

covid %>%
    mutate(date = dmy(date))%>%
    group_by(date)%>%
    summarize(...)

you should get around that error.

Edit:
I would also recommend you to look at the table you generate with

covid %>%
    mutate(date = dmy(date))%>%
    group_by(date)%>%
    summarize(confirmed_cum = sum(confirmed_cum))

before you pipe it into the ggplot command. Based on the aesthetics you have specified there, it seems that you want to have a different table as the base for your plot.

3 Likes

Thanks for correcting.

I have though used confirmed column now. I have another question how do we change the color of top 3 max values?

covid <- read.csv(file = 'covid_au_state.csv')
dput(covid)
library(lubridate)
library(dplyr)
library(ggplot2)
covid %>%
mutate(date = dmy(date)) %>%
group_by(date) %>%
summarize(confirmed = sum(confirmed))%>%
ggplot(aes(x =date , y = confirmed)) +
geom_point(aes(color = confirmed)) +
labs(x = 'Date', y = 'New Confirmed cases(every day)',
title = 'Total number of new confirmed cases daily throughout Australia')

I assume we use geom.point but not sure the correct implementation. Can you suggest?

Okay. I used geom_point(aes(color = confirmed)) + points(x[y >= 600], y[y >= 600], pch = 4, col = "red", cex =2) + labs but it shows no change in the plot, Why?

If you could do a full reprex, it would help us to help you!

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

1 Like

I am not super familiar with the more advanced possibilities of ggplot2, but you could create a column in which you assign colours to the different points. If you for example want to fill all points above 600 in red and all below in green, you could do something like that:

covid %>% ...%>%
  mutate(colour=ifelse(confirmed>600,"red","green"))%>%
  ggplot(aes(x =date , y = confirmed,size=5)) +
  geom_point(aes(colour=colour))+
  scale_colour_identity()

Alternatively, you could order your data and give the top observations one colour and the other ones another.

The problem with that approach is, that you loose the continuos colour scale that you'd get from aes(colour=confirmed). But I don't know if there is any way to mix continuos and discrete plotting or to overwrite the continuous plotting for selected observations.
So if anyone here has a good suggestion in how to do this, I am as eager as Maninder to hear about it.

1 Like

Okay thanks @jms. I do tried another way using gghighlight package:

covid <- read.csv(file = 'covid_au_state.csv')
dput(covid)
library(lubridate)
library(dplyr)
library(ggplot2)
covid %>%
mutate(date = dmy(date)) %>%
group_by(date) %>%
summarize(confirmed = sum(confirmed))%>%
ggplot(aes(x =date , y = confirmed)) +
geom_line(aes(color = confirmed), size = 0.80, color= "blue")+
gghighlight(confirmed >= 610, label_key = confirmed, unhighlighted_params = list(colour="orange" ,size = .5)) +
labs(x = 'Date', y = 'New Confirmed cases(every day)',
title = 'Total number of new confirmed cases daily throughout Australia')

Rplot

However I was thinking to improve this visualization by either redefining the x axis to highlight the dates for max 3 confirmed cases. Is that possible, if yes please suggest some way to highlight the dates as well. I looked over many ways but all functions are for continuous scales.

Do I understand you right in that you'd like to have something like this?

A graph that shows your counts and highlights the days below and above a certain threshold?

Yes, but I want to display the date say on x axis:5/08/2020 and y axis: confirmed =717 (highest). User should be able to see the scale value for highest points on both axis.

I hope you understand what I mean.

Unfortunately, I don't know how you could highlight certain values on the axes except maybe by adding a straight line, e.g. gom_line(aes(y=717)).

If you still need help, it is probably best to create a new thread with those questions and include a full reprex, so that others that may be more familiar with ggplot2 can find it easily.

1 Like