Can't manage to perform a join or rename a column,

airlie · August 6, 2020, 5:04pm

Another newbie question, where i guess i'm missing something obvious but ive looked for so long i can't see the wood from the trees
I would like to join two csv files together - hb_join and deaths_join and rename a column "HB" in the hb_join to the same as deaths_join column "HBR" and to also rename "country" to "CA"
I've tried to do this through rename() and also through the join but fail on every try and i'm now pulling my hair out over it. Any help to point out the obvious would be fantastic

# read in health board dataset and read in deaths rates
deaths_join = read.csv("ui_deaths_2020.csv")
hb_join = read.csv("hb_lookup.csv")

hb_join %>%
_join(deaths_join, by = c("HB" = "hbr", "Country" = "CA"))

Also tried on it's own

hb_join %>%
rename(HB = hbr, Country = CA)

Error: Can't rename columns that don't exist. x Column hbr doesn't exist.

Thank you in anticipation

siddharthprabhu · August 6, 2020, 5:34pm

Can you share a few sample records from deaths_join and hb_join? Preferably in an easy-to-copy format by using dput() or any similar alternative.

airlie · August 6, 2020, 6:56pm

Sorry tried and failed with dput()
is this glimpse() any good???

glimpse(deaths_join)

Rows: 211,986
Columns: 14
Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2... HBR "S92000003", "S92000003", "S92000003", "S92000003", "S9...
HBRQF <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", ... CA "S92000003", "S92000003", "S92000003", "S92000003", "S9...
CAQF <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", ... AgeGroup "All", "All", "All", "All", "All", "All", "All", "All",...
AgeGroupQF <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", ... Sex "All", "All", "All", "All", "All", "All", "All", "All",...
SexQF <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", ... InjuryLocation "All", "All", "All", "All", "All", "All", "All", "All",...
InjuryLocationQF <chr> "d", "d", "d", "d", "d", "d", "d", "d", "d", "", "", ""... InjuryType "All", "Land transport accidents", "Poisoning", "Falls"...
InjuryTypeQF <chr> "d", "", "", "", "", "", "", "", "", "d", "", "", "", "... NumberOfDeaths

siddharthprabhu · August 6, 2020, 7:11pm

Since your data frame is quite large, you'll have to pass a subset of rows. Can you try dput(head(deaths_join)) instead?

airlie · August 6, 2020, 7:15pm

I have somehow managed to update column names and so now i'm just working on join. The only thing i have changed is to stop for a while and restart R and now seems to work?!
structure(list(Year = c(2007, 2007, 2007, 2007, 2007, 2007),
HBR = c("S92000003", "S92000003", "S92000003", "S92000003",
"S92000003", "S92000003"), HBRQF = c("d", "d", "d", "d",
"d", "d"), CA = c("S92000003", "S92000003", "S92000003",
"S92000003", "S92000003", "S92000003"), CAQF = c("d", "d",
"d", "d", "d", "d"), AgeGroup = c("All", "All", "All", "All",
"All", "All"), AgeGroupQF = c("d", "d", "d", "d", "d", "d"
), Sex = c("All", "All", "All", "All", "All", "All"), SexQF = c("d",
"d", "d", "d", "d", "d"), InjuryLocation = c("All", "All",
"All", "All", "All", "All"), InjuryLocationQF = c("d", "d",
"d", "d", "d", "d"), InjuryType = c("All", "Land transport accidents",
"Poisoning", "Falls", "Struck by, against", "Crushing"),
InjuryTypeQF = c("d", NA, NA, NA, NA, NA), NumberOfDeaths = c(1289,
294, 63, 658, 5, 1)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))

siddharthprabhu · August 7, 2020, 4:31am

Thank you. Can you similarly also provide the output of dput(head(hb_join))? Then we can attempt the join and see what's going wrong.

vinaychuri · August 7, 2020, 4:57am

Pls check the column names since its likely that the column names may have been updated to hbr.x,hbr.y post the "join" function if the same column names exists in source & destination.

airlie · August 7, 2020, 9:23am

Thank you so much for your help. I've gone back to original data in case as Vinay suggested i changed column names and caused more confusion, but to no avail.

(head(hb_join))

HB

HBName

HBDateEnacted

HBDateArchived

Country

1 S08000015 NHS Ayrshire and Arran 20140401 NA S92000003
2 S08000016 NHS Borders 20140401 NA S92000003
3 S08000017 NHS Dumfries and Galloway 20140401 NA S92000003
4 S08000018 NHS Fife 20140401 20180201 S92000003
5 S08000019 NHS Forth Valley 20140401 NA S92000003
6 S08000020 NHS Grampian 20140401 NA S92000003
6 rows

airlie · August 7, 2020, 9:34am

Sorry i realised i used the wrong file!
Correct as below with column names renamed in hb-clean to show the same as Raw_deaths - HBR and CA columns

(head(hb_clean))

HBR

HBName

HBDateEnacted

HBDateArchived

CA

1 S08000015 NHS Ayrshire and Arran 20140401 NA S92000003
2 S08000016 NHS Borders 20140401 NA S92000003
3 S08000017 NHS Dumfries and Galloway 20140401 NA S92000003
4 S08000018 NHS Fife 20140401 20180201 S92000003
5 S08000019 NHS Forth Valley 20140401 NA S92000003
6 S08000020 NHS Grampian 20140401 NA S92000003
6 rows

siddharthprabhu · August 7, 2020, 9:42am

Please share the output of dput(head(hb_join)), not head(hb_join).

I would suggest that you don't manually modify your data before import as it's better to handle it in the script. It's also getting confusing for me as to what your original data looks like.

airlie · August 7, 2020, 9:55am

I couldn’t manage to get the column names to copy with the rest of the info so hope this image lets you see. This is original files with no changes
Thank you so much for helping me

siddharthprabhu · August 7, 2020, 10:24am

Close enough. I'll use the names as shown in your screenshot for the tibbles.

I'll demonstrate the syntax by performing a left_join(). You can swap it for inner_join() or any other join depending on your requirements.

library(dplyr, warn.conflicts = FALSE)

hb_lookup_2020 <- tribble(
  ~HB, ~HBName, ~HBDateEnacted, ~HBDateArchived, ~Country,
  "S08000015", "NHS Ayrshire and Arran", 20140401, NA, "S92000003",
  "S08000016", "NHS Borders", 20140401, NA, "S92000003",
  "S08000017", "NHS Dumfries and Galloway", 20140401, NA, "S92000003",
  "S08000018", "NHS Fife", 20140401, 20180201, "S92000003",
  "S08000019", "NHS Forth Valley", 20140401, NA, "S92000003",
  "S08000020", "NHS Grampian", 20140401, NA, "S92000003"
)

ui_deaths_2020 <- structure(list(
  Year = c(2007, 2007, 2007, 2007, 2007, 2007),
  HBR = c("S92000003", "S92000003", "S92000003", "S92000003", "S92000003", "S92000003"),
  HBRQF = c("d", "d", "d", "d", "d", "d"),
  CA = c("S92000003", "S92000003", "S92000003", "S92000003", "S92000003", "S92000003"),
  CAQF = c("d", "d", "d", "d", "d", "d"),
  AgeGroup = c("All", "All", "All", "All", "All", "All"),
  AgeGroupQF = c("d", "d", "d", "d", "d", "d"),
  Sex = c("All", "All", "All", "All", "All", "All"),
  SexQF = c("d", "d", "d", "d", "d", "d"), InjuryLocation = c("All", "All", "All", "All", "All", "All"),
  InjuryLocationQF = c("d", "d", "d", "d", "d", "d"),
  InjuryType = c("All", "Land transport accidents", "Poisoning", "Falls", "Struck by, against", "Crushing"),
  InjuryTypeQF = c("d", NA, NA, NA, NA, NA),
  NumberOfDeaths = c(1289, 294, 63, 658, 5, 1)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
)

hb_lookup_2020 %>% 
  left_join(ui_deaths_2020, by = c("HB" = "HBR", "Country" = "CA"))
#> # A tibble: 6 x 17
#>   HB    HBName HBDateEnacted HBDateArchived Country  Year HBRQF CAQF  AgeGroup
#>   <chr> <chr>          <dbl>          <dbl> <chr>   <dbl> <chr> <chr> <chr>   
#> 1 S080~ NHS A~      20140401             NA S92000~    NA <NA>  <NA>  <NA>    
#> 2 S080~ NHS B~      20140401             NA S92000~    NA <NA>  <NA>  <NA>    
#> 3 S080~ NHS D~      20140401             NA S92000~    NA <NA>  <NA>  <NA>    
#> 4 S080~ NHS F~      20140401       20180201 S92000~    NA <NA>  <NA>  <NA>    
#> 5 S080~ NHS F~      20140401             NA S92000~    NA <NA>  <NA>  <NA>    
#> 6 S080~ NHS G~      20140401             NA S92000~    NA <NA>  <NA>  <NA>    
#> # ... with 8 more variables: AgeGroupQF <chr>, Sex <chr>, SexQF <chr>,
#> #   InjuryLocation <chr>, InjuryLocationQF <chr>, InjuryType <chr>,
#> #   InjuryTypeQF <chr>, NumberOfDeaths <dbl>

^{Created on 2020-08-07 by the reprex package (v0.3.0)}

Note: When you perform a join only the key columns from the left table are retained by default i.e. HB and Country from hb_lookup_2020 in this case (which I believe is what you want).

You can follow this up with a rename() statement if you wish to change the names. For example, if you prefer the names of the keys in the right table, you could do the following.

hb_lookup_2020 %>% 
  left_join(ui_deaths_2020, by = c("HB" = "HBR", "Country" = "CA")) %>% 
  rename(HBR = HB, CA = Country)
# A tibble: 6 x 17
  HBR   HBName HBDateEnacted HBDateArchived CA     Year HBRQF CAQF  AgeGroup AgeGroupQF Sex   SexQF InjuryLocation InjuryLocationQF
  <chr> <chr>          <dbl>          <dbl> <chr> <dbl> <chr> <chr> <chr>    <chr>      <chr> <chr> <chr>          <chr>           
1 S080~ NHS A~      20140401             NA S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
2 S080~ NHS B~      20140401             NA S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
3 S080~ NHS D~      20140401             NA S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
4 S080~ NHS F~      20140401       20180201 S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
5 S080~ NHS F~      20140401             NA S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
6 S080~ NHS G~      20140401             NA S920~    NA NA    NA    NA       NA         NA    NA    NA             NA              
# ... with 3 more variables: InjuryType <chr>, InjuryTypeQF <chr>, NumberOfDeaths <dbl>

Hope that helps.

airlie · August 7, 2020, 1:58pm

Thank you so much for your help

system · August 14, 2020, 1:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.