Plotting two data frames with different Y scales in the same line graph in R

Hello,
I have two graphs with the same X axis (Date) but with different Y axis:

df1: crime rate occurrence for 5 types of crime across 9 regions in England

structure(list(
  Date = c("2019-04", "2019-04", "2019-12", "2020-02", 
           "2019-09", "2019-10", "2020-05", "2020-07", "2019-07", "2019-05"), 
  Region = structure(c(7L, 1L, 3L, 7L, 7L, 7L, 3L, 7L, 1L, 7L), 
 .Label = c("South East", "South West", "London", "East of England", "East Midlands", "West 
            Midlands", "Yorkshire and The Humber", "North East", "North West"), class = "factor"), 
  Crime = c("Robbery", "Robbery", "Robbery", "Robbery", "Anti-social behaviour", 
            "Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
            "Robbery", "Anti-social behaviour")), 
 row.names = c(NA, -10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 
 0x000001cfda8e1ef0>, index = integer(0))

df2: unemployment rate fluctuation across the same 9 regions (same time period too)

structure(list(
  Date = c("2020-02", "2019-12", "2019-10", "2020-10", "2020-06", "2019-11", "2019-07", "2020-05", 
           "2020-08", "2020-06"), 
  Region = structure(c(1L, 8L, 10L, 8L, 9L, 8L, 9L, 10L, 8L, 3L), 
  .Label = c("England", "South East", "South West", "London", "East of England", "East Midlands", 
             "West Midlands", "Yorkshire and The Humber", "North East", "North West"), 
  class = "factor"), 
  Unemployment.rate = c(4.04317280091498, 4.47398990035041, 3.99786361805527, 5.15177120913334, 
                      5.16820059074221, 4.34062792253313, 4.97071907922267, 3.79490967669574, 
                      4.16298001615593, 3.57267916967994)), 
  row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

Desired output:

Plotting df2 line graph for each region on their respective region in df1 (with df2 Y axis scale on the right side of the graph)

df1 line graph:

enter image description here

df2 line graph:

enter image description here

The first graph in df2 represents the unemployment rate across England, it's not necessary to plot that on the df1 line graph.

Any help would be greatly appreciated, thanks in advance!

i recommend you convert your df1 data.table to a simple dataframe and try resharing, in its current form it is not readable by forum users.

.internal.selfref = <pointer: 
 0x000001cfda8e1ef0>

this fragment will cause parse errors, and it seems numeric information is absent and can not be used to make a chart like the first chart...

That said here are some examples of second axis

Thanks a lot for your quick reply. I merged the two data frames, and now I have df3; here's the data.table converted to a data frame:

structure(list(Region = c("West Midlands", "Yorkshire and The Humber", 
"West Midlands", "East Midlands", "Yorkshire and The Humber", 
"North East", "South East", "South West", "East Midlands", "South West"
), Month = c("2019-06", "2019-03", "2020-07", "2020-06", "2019-12", 
"2019-06", "2019-09", "2019-10", "2019-11", "2019-02"), Crime = c("Theft", 
"Burglary", "Anti-social behaviour", "Theft", "Violence and sexual offences", 
"Anti-social behaviour", "Burglary", "Burglary", "Burglary", 
"Anti-social behaviour"), Crime_occurrencies = c(11053L, 5001L, 
9015L, 3226L, 20977L, 7895L, 3713L, 1416L, 1393L, 6953L), Date = c("2019-01", 
"2019-05", "2020-08", "2019-02", "2019-07", "2019-11", "2020-09", 
"2020-04", "2020-07", "2019-03"), Unemployment.rate = c(0.201444838205937, 
4.45220611226308, 4.62628095567074, 4.24086209425895, 4.28827168334814, 
6.21086689536757, 3.21824018447441, 3.01193326177149, 4.38753537321373, 
2.69121673177636)), class = "data.frame", row.names = c(NA, -10L
))

Thanks for the link, but I am struggling with overlaying the linegraphs more than anything. I am trying to analyse any correlation between unemployment rate and crime occurrencies over time across 9 regions, and I can't seem to plot the unemployment rate for each region on the crime data faceted graph (df1 line graph).

I'm wondering if the join to make df3 caused issues relating to date/month as you previously had date fields, but now df3 has both date and month field, and they are not matching

One refers to the dates that were previously in the Crime dataset, and the other refers to the dates in the previous Unemployment dataset.

There is only one "Unemployment date" for each region for each month. However, the "Crime date" is repeated for every single reported crime, for every region (i.e. if there are three Theft crimes reported in the same month, even in they same region, they will appear three times in the data frame. Hence why there is a much higher number of dates under that column compared to "Unemployment dates"

Before merging, there were 8 million rows in the Crime dataset, and 260 in the Unemployment one due to the high number of dates repeated in the Crime dataset. The merged data frame has about 27k.

im not sure that your explanation makes sense when looked at in the context of what you have provided.
I.e. understanding that multiple crime events occur in a given month, yet it appears that you have aggregated, as each crime is not a single crime but rather a large Crime_occurrencies of so many thousands of events.
I propose we back up a step and think carefully about how to make df3 from df1 and df2.
Or else. I could steam ahead to providing you a plot function for this df3, but the results likely wont make much sense as the date confusion is a problem from my point of view.

First of all, thank you for taking the time to bear with my ignorance.

After merging the data, this df3 was born:

image

  • "Month" refers to when a type of crime was committed, and how many times it was committed that month (Crime_occurrencies) in a specific Region.
  • "Date" refers to the Unemployment rate, i.e. how it was on that Date(or Month), in a specific Region.

Since there are 5 types of crimes, even after aggregating them, the number of (Crime)Months are higher than (Unemployment rate)Dates.

So on the same Date/Month, e.g. 01-2019, I can have up to 5 rows for each region. I.e., one row for each one of the five types of crime in the dataset. As a result, the same Unemployment rate will be repeated an equal amount of times.

Does it make sense?

I was suggested to aggregate everything using

df3<- df2 %>%

count(Region, Month, Crime, name = 'Crime_occurrencies') %>% full_join(df1 %>% filter(Region != "England"))

I didn't need "England" from the Unemployment dataset since there's no "England" data for crime rates in my Crime dataset.

I would say that its probably a mistake to use full_join without specifying the by= parameter.
If you can provide df1 and df2 I can have a go at creating df3 if you wish

That would be extremely helpful, thanks in advance!

Crime dataset sample (sorry I can't seem to be able to format it more clearly):

structure(list(Month = c("2019-10", "2019-07", "2020-07", "2019-02", 
"2020-09", "2019-10", "2019-12", "2019-11", "2019-03", "2019-08"
), Falls.within = c("Metropolitan Police Service", "Metropolitan Police Service", 
"Dorset Police", "West Midlands Police", "Sussex Police", "Suffolk Constabulary", 
"Lincolnshire Police", "City of London Police", "West Mercia Police", 
"Metropolitan Police Service"), Region = c("London", "London", 
"South West", "West Midlands", "South East", "East of England", 
"East Midlands", "London", "West Midlands", "London"), Crime = c("Theft", 
"Anti-social behaviour", "Theft", "Theft", "Theft", "Violence and sexual offences", 
"Violence and sexual offences", "Anti-social behaviour", "Anti-social behaviour", 
"Theft")), class = "data.frame", row.names = c(NA, -10L))

Unemployment dataset (same as above):

structure(list(Date = c("2019-10", "2019-07", "2020-07", "2019-02", 
"2020-09", "2019-10", "2019-12", "2019-11", "2019-03", "2019-08"
), Region = structure(c(6L, 8L, 1L, 7L, 2L, 8L, 7L, 5L, 2L, 9L
), .Label = c("England", "South East", "South West", "London", 
"East of England", "East Midlands", "West Midlands", "Yorkshire and The Humber", 
"North East", "North West"), class = "factor"), Unemployment.rate = c(4.63665149464923, 
4.28827168334814, 4.00444047276053, 4.31748261760943, 2.99042563020473, 
3.75797094178592, 4.80322174913054, 4.2511880639675, 3.259676887651, 
6.1811363458787)), class = "data.frame", row.names = c(NA, -10L
))

I'm not sure how to produce two samples with matching dates, so I ended up copy-pasting df1 structure's dates into df2 structure.

the data provided was too sparse for me to make a plottable test, so I took the information in the tables provided and randomised data from them to make sufficient data to be 'plottable'

Unfortunately that means a great deal of the code at the beginning is an invention of df1 and df2, but then I have some useful code, about the join, pivot. calculating a linear transform of the second scale to the first scale and back so that the unemployed data will be plotted at a good height

df1_stub<-structure(list(Month = c("2019-10", "2019-07", "2020-07", "2019-02", 
                         "2020-09", "2019-10", "2019-12", "2019-11", "2019-03", "2019-08"
), Falls.within = c("Metropolitan Police Service", "Metropolitan Police Service", 
                    "Dorset Police", "West Midlands Police", "Sussex Police", "Suffolk Constabulary", 
                    "Lincolnshire Police", "City of London Police", "West Mercia Police", 
                    "Metropolitan Police Service"), Region = c("London", "London", 
                                                               "South West", "West Midlands", "South East", "East of England", 
                                                               "East Midlands", "London", "West Midlands", "London"), Crime = c("Theft", 
                                                                                                                                "Anti-social behaviour", "Theft", "Theft", "Theft", "Violence and sexual offences", 
                                                                                                                                "Violence and sexual offences", "Anti-social behaviour", "Anti-social behaviour", 
                                                                                                                                "Theft")), class = "data.frame", row.names = c(NA, -10L))

df2_stub <- structure(list(Date = c("2019-10", "2019-07", "2020-07", "2019-02", 
                        "2020-09", "2019-10", "2019-12", "2019-11", "2019-03", "2019-08"
), Region = structure(c(6L, 8L, 1L, 7L, 2L, 8L, 7L, 5L, 2L, 9L
), .Label = c("England", "South East", "South West", "London", 
              "East of England", "East Midlands", "West Midlands", "Yorkshire and The Humber", 
              "North East", "North West"), class = "factor"), Unemployment.rate = c(4.63665149464923, 
                                                                                    4.28827168334814, 4.00444047276053, 4.31748261760943, 2.99042563020473, 
                                                                                    3.75797094178592, 4.80322174913054, 4.2511880639675, 3.259676887651, 
                                                                                    6.1811363458787)), class = "data.frame", row.names = c(NA, -10L
                                                                                    ))
d1 <- summarise_all(tibble(df1_stub),~list(unique(.)))
d2 <- summarise_all(tibble(df2_stub) %>% 
                      mutate_if(is.factor,as.character)
                    ,~list(unique(.))) 
set.seed(42)
num_rows_to_invent <-100000
df1 <- data.frame(Month=sample(seq.Date(from=as.Date("2019-02-01"),
                                        to = as.Date("2020-07-01"),
                                        by = "1 months"),size = num_rows_to_invent,replace = TRUE),
                  Region=sample(unlist(d1$Region),size = num_rows_to_invent,replace = TRUE),
                  Crime=sample(unlist(d1$Crime),size = num_rows_to_invent,replace = TRUE))

df2 <- data.frame(Date=sample(seq.Date(from=as.Date("2019-02-01"),
                                       to = as.Date("2020-07-01"),
                                       by = "1 months"),size = num_rows_to_invent,replace = TRUE),
                  Region=sample(unlist(d1$Region),size = num_rows_to_invent,replace = TRUE),
                  ur=sample(unlist(d2$Unemployment.rate),size = num_rows_to_invent,replace = TRUE)) %>% 
  group_by(Date,Region) %>% slice(1) %>% ungroup


library(tidyverse)
df3<- df1 %>%
  count(Region, Month, Crime, name = 'Crime_occurrencies') %>% 
  full_join(df2, by=c("Region","Month"="Date"))  %>% filter(Region != "England") 

range(df3$Crime_occurrencies,na.rm = TRUE)
# [1] 249 623
range(df3$ur,na.rm = TRUE)
# [1] 2.990426 6.181136
(c(249,623) - 249)
1/374 * (6.181136 - 2.990426) 
#0.00853131
#therefore
(c(249,623) - 249) *  0.00853131 + 2.990426
# and the opposite way
(c(2.990426, 6.181136 )- 2.990426) /0.00853131 +249

df4 <- pivot_longer(df3 %>% mutate(ur_falsescaled=(ur - 2.990426) /0.00853131 +249),
                    cols=c(Crime_occurrencies ,ur_falsescaled)) %>% mutate(
                      Crime=case_when(name=='ur_falsescaled' ~ "NA",
                                       TRUE ~ Crime)
                    ) %>% distinct()


p <- ggplot(df4 , aes(x=Month, y=value,colour=Crime,linetype=name)) + facet_wrap(~Region)+
  geom_line()

# Create a simple secondary axis
p + scale_y_continuous(sec.axis = sec_axis(~ (.- 249) *  0.00853131 + 2.990426))

That's amazing, thank you so much for your help!

I tried running that code using my datasets but I get this error 9 times (one per region I guess) when I try plotting df4:

"geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?"

You've already done a lot to help me out but I don't want to waste any more of your time, I don't think I am able to understand how this all works at the moment.

Thanks again for your help, I really appreciate it.

I think the best way myself or other forum users can proceed to help, is if we now see the 'true' df4 you are working with, I should think this is small enough to share in its entirety ?

I used the code from the full_join bit onwards, and the plot looks like this:

  joined_data<- crime_data %>%
  count(Region, Month, Crime, name = 'Crime_occurrencies') %>% 
  full_join(urate_data, by=c("Region","Month"="Date"))  %>% filter(Region != "England")

 df4 <- pivot_longer(joined_data %>% mutate(ur_falsescaled=(ur - 2.990426) /0.00853131 +249),
                    cols=c(Crime_occurrencies ,ur_falsescaled)) %>% mutate(
                      Crime=case_when(name=='ur_falsescaled' ~ "NA",
                                      TRUE ~ Crime)
                    ) %>% distinct()

#At this point I get a lot of NA and "ur_falsescaled" rows, which I removed from the dataframe. I would get the same visualisation even if I kept the NA rows, the only difference would be one more graph labeled "NA" and more NA in the legend.

p <- ggplot(df4 , aes(x=Month, y=value,colour=Crime,linetype=name)) + facet_wrap(~Region)+
  geom_line()

# Create a simple secondary axis
p + scale_y_continuous(sec.axis = sec_axis(~ (.- 249) *  0.00853131 + 2.990426))```

I only have my 'version' of your data and code, that I shared in full, if you ran that as I pasted it you will see that although the numbers are nonsense, it does chart., and makes no complaints in the console.


Therefore the assumption is that with similarly structured and typed data your df4 would plot as well (with accurate numbers) however based on your response this is not the case, yet you havent provided a way for me to help you with that...
I cannot debug your code/data issues which is based on data I don't have. sorry.

Yeah I don't understand why it doesn't work with my datasets, sorry.
Thanks anyway for your help and sorry for wasting your time!

You don't have to but, Im posting just to remind you that you have the option to use dput(df4) to share it to the forum

Sure! Here's what comes out of it:

structure(list(Region = c("North West", "West Midlands", "South East", 
"London", "London", "South East", "London", "South East", "North West", 
"North East", "Yorkshire and The Humber", "South West", "East of England", 
"East Midlands", "West Midlands", "London", "South West", "South West", 
"West Midlands", "South East"), Month = c("2019-01", "2019-10", 
"2020-06", "2019-04", "2020-07", "2019-03", "Change", "2019-08", 
"2019-09", "2020-04", "2019-09", "2020-08", "Avg.Covid", "2019-09", 
"2019-08", "2019-03", "2019-08", "2019-08", "2019-06", "2020-10"
), Crime = c("Theft", "Robbery", "Theft", "Anti-social behaviour", 
"Robbery", "Anti-social behaviour", NA, "Violence and sexual offences", 
"NA", "Violence and sexual offences", "Anti-social behaviour", 
"NA", "NA", "Anti-social behaviour", "NA", "Theft", "Violence and sexual offences", 
"Burglary", "Robbery", "Theft"), Unemployment.rate = c(3.60769292318816, 
4.4653881330391, 3.25568537435778, 4.2765275404374, 4.98077486925899, 
3.18049050596868, 0.767920037443671, 3.08966319084375, 4.19768568734536, 
5.20229523339659, 3.90122545742185, 4.13354432440967, 3.93905085573782, 
4.46358064141772, 3.98279027057451, 4.36960549219723, 2.38897367671231, 
2.38897367671231, 4.56621383409869, 3.86912441545878), name = c("Crime_occurrencies", 
"Crime_occurrencies", "Crime_occurrencies", "Crime_occurrencies", 
"Crime_occurrencies", "Crime_occurrencies", "Crime_occurrencies", 
"Crime_occurrencies", "ur_falsescaled", "Crime_occurrencies", 
"Crime_occurrencies", "ur_falsescaled", "ur_falsescaled", "Crime_occurrencies", 
"ur_falsescaled", "Crime_occurrencies", "Crime_occurrencies", 
"Crime_occurrencies", "Crime_occurrencies", "Crime_occurrencies"
), value = c(14929, 913, 10945, 23632, 2178, 14461, NA, 23407, 
390.50929779194, 5445, 17433, 382.990949151967, 360.193340265191, 
6454, 365.320268584134, 32942, 7663, 1359, 843, 13794)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))```

ok, theres definitely things to fix in here, to not overwhelm you lets cover one aspect at a time.
The first thing I notice is that the Month field is a character field (not a date type) and it contains data other than only 2019-08 etc but also 'Avg.Covid' and 'Change'

There are two ways to go about this.

  1. simply omiting from df4 what we dont want by use of filter, and then mutate the true month character string to date types with perhaps lubridate package
  2. understanding how the incorrect month entries have snuck in there and ensuring they dont propogate to df4 in the first place

How would you prefer to proceed ?

Oh yeah I created the "Avg. Covid", "Change" and "Pre Covid" columns in the Excel file (the original dataset) to check how the unemployment rate changed from the start of the pandemic. Whenever I plotted that dataframe in ggplot, I just ignored those rows by selecting the rows without them (i.e. [1:220, ]).

The reason why the Month column is a "character " class is because I read that as.Date does not work unless the date follows a format that includes the day, which I don't have. So I used strftime() to transform Excel dates to "readable" dates in R.

I checked lubridate and zoo packages too to work with dates, but when I converted the Month column to a date format by adding a day to the current format (which is %m-%Y) and I tried to import it in the data frame, it imported it as random numbers; just like it was when i first imported the original dataset from Excel and all the dates looked like random numbers.

So I sticked with strftime which did the job even though it kept the Month column as "character" class.

I am guessing I need to find a way to convert that Month column to a "date" type without R changing them to random numbers? I.e. option 2

The date part is straightforward try this

library(lubridate)

(df5 <- filter(df4,
              startsWith(Month,"2")) %>% 
        mutate(Month = lubridate::ymd(paste0(Month,"-01"))))

arrange(df5,
        Region,Crime,Month)

unfortunately this shows an underlying problem with your data if you desire to draw lines, there simply isnt enough data.

# A tibble: 18 x 6
   Region             Month      Crime                 Unemployment.ra~ name           value
   <chr>              <date>     <chr>                            <dbl> <chr>          <dbl>
 1 East Midlands      2019-09-01 Anti-social behaviour             4.46 Crime_occurr~  6454 
 2 London             2019-04-01 Anti-social behaviour             4.28 Crime_occurr~ 23632 
 3 London             2020-07-01 Robbery                           4.98 Crime_occurr~  2178 
 4 London             2019-03-01 Theft                             4.37 Crime_occurr~ 32942 
 5 North East         2020-04-01 Violence and sexual ~             5.20 Crime_occurr~  5445 
 6 North West         2019-09-01 NA                                4.20 ur_falsescal~   391.
 7 North West         2019-01-01 Theft                             3.61 Crime_occurr~ 14929 
 8 South East         2019-03-01 Anti-social behaviour             3.18 Crime_occurr~ 14461 
 9 South East         2020-06-01 Theft                             3.26 Crime_occurr~ 10945 
10 South East         2020-10-01 Theft                             3.87 Crime_occurr~ 13794 
11 South East         2019-08-01 Violence and sexual ~             3.09 Crime_occurr~ 23407 
12 South West         2019-08-01 Burglary                          2.39 Crime_occurr~  1359 
13 South West         2020-08-01 NA                                4.13 ur_falsescal~   383.
14 South West         2019-08-01 Violence and sexual ~             2.39 Crime_occurr~  7663 
15 West Midlands      2019-08-01 NA                                3.98 ur_falsescal~   365.
16 West Midlands      2019-06-01 Robbery                           4.57 Crime_occurr~   843 
17 West Midlands      2019-10-01 Robbery                           4.47 Crime_occurr~   913 
18 Yorkshire and The~ 2019-09-01 Anti-social behaviour             3.90 Crime_occurr~ 17433 

You have only a single fact about Yorkshire and the humber. a line cant be drawn for it, perhaps you can draw a point. and strangely the unemployment rates arent available throughout, but only for one moment of the west midlands , one for south west , one month of north west. this seems unlikely