How to merge two datasets together when one set has multiple instances of something I want to merge

Hello!

I have a project that I am working on (for my Gov 1005 Data Science class) where I am trying to map the population density of Korean individuals in various counties of the U.S. I have the county lat long data from map_data(county), and I have the population data by county/state in two separate files. But the county lat long data I have has several occurrences of counties and states because it needs to put every single coordinate on the map to make the outline of the shape.

My population data on the other hand only has one instance per county.

I've tried using all the various joins in r, but I can't seem to make it correctly join together the way I need it to! Right now, it only joins together baltimore county in Maryland and a random county in Nevada. How do I fix this?

Thank you so much!
My code is below:

#Want to change maps to show by county how many koreans live in each county
population_korea<- read_csv("data/Copy of DEC_10_SF1_PCT7_with_ann.csv") %>% 
  clean_names()
#> Error in read_csv("data/Copy of DEC_10_SF1_PCT7_with_ann.csv") %>% clean_names(): could not find function "%>%"


#HAS EVERTHING>> including alaska and hawaii
population_korea<- population_korea%>% 
  dplyr::select(geo_display_label,d012)
#> Error in population_korea %>% dplyr::select(geo_display_label, d012): could not find function "%>%"

population_korea<- population_korea %>% 
  separate(geo_display_label, into = c("subregion", "region"), sep = ", ") %>% 
  #split the geography column into county and state
  na.exclude() %>% 
  filter(region != c("alaska", "hawaii"))#county doesn't have the shape files
#> Error in population_korea %>% separate(geo_display_label, into = c("subregion", : could not find function "%>%"

population_korea <- population_korea[c(2,1,3)]
#> Error in eval(expr, envir, enclos): object 'population_korea' not found
population_korea$subregion <- iconv((population_korea$subregion), to='ASCII//TRANSLIT')
#> Error in iconv((population_korea$subregion), to = "ASCII//TRANSLIT"): object 'population_korea' not found
#Take out all the accents

population_korea$region <- tolower(population_korea$region)#change to all lowercase
#> Error in tolower(population_korea$region): object 'population_korea' not found
population_korea$subregion <- tolower(population_korea$subregion)#change to all lowercase
#> Error in tolower(population_korea$subregion): object 'population_korea' not found


stopwords = c("county")
x  = population_korea$subregion        #Company column data
#> Error in eval(expr, envir, enclos): object 'population_korea' not found
x  =  removeWords(x,stopwords)     #Remove stopwords
#> Error in removeWords(x, stopwords): could not find function "removeWords"

population_korea$subregion <- x  
#> Error in eval(expr, envir, enclos): object 'x' not found

county<- map_data("county") %>% 
  mutate(d012 = "0") %>% 
  group_by(subregion) %>% 
  merge(population_korea, c("region","subregion"))
#> Error in map_data("county") %>% mutate(d012 = "0") %>% group_by(subregion) %>% : could not find function "%>%"

Created on 2019-10-30 by the reprex package (v0.3.0)

Hi @gkim65!

Preliminarily, see the Homework Policy -- I'm guessing from milestone-4.Rmd that this might be such.

Second, a reprex with data is really, really. Helpful. It's only by good luck that I found DEC_10_SF1_PCT7_with_ann.csv. And library(janitor) is needed to access clean_names. Not to mention dplyr to give you %>%

OK, enough preaching.

Your problem will be more tractable if you focus on the three variables needed to calculate population density by county

  • identifier for county
  • its area (or total population if you mean percentage of population classified as Korean)
  • the Korean population

99%+ plus of your population_korea data frame isn't needed for that. It has data for all population categories, a huge range of demographic characteristics, other than population, and the long/lat isn't needed unless you plan to do mapping.

population_korea$GEO.display.label contains county and state names. There's your identifier.

One of the HDxx-Sxxx contains total population by county/state and one contains the Korean population by county/state.

To figure out which, you'll need to do some digging. See the asc and tidycensus for resources to track those down.

Once you have those,

my_reduced_df <- population_korea %>% select(GEO.display.label, HDxx-Sxxx, HDyy-Syy)

If you do need mapping, the simplest way is an sf data frame with FIPS codes for county/states. You'll need to go back and line up your county/states in population_korea and then do an inner_join.

1 Like

Hello @technocrat!

Thank you for taking the time to answer my question. I already had filtered all of my data from the .csv file, so it looks like:

I had the column names match region and subregion similar to the county map data that I want to merge it with; is it possible to merge these two together, so that for every row for each county (baltimore city for example has 15 rows that I need to apply the d012(the population data) 2404 to)?

Thank you!

Can you post an reprex of the poulation_korea data_frame that you're working with? I didn't see in your example how it ended up. And are you eventually mapping?