combining datasets with different countries and rows

dplyr
#1

Hi all,

I want to combine two datasets and I tried to do this with merge, join, c/r bind. However, all give the same problem. The values from dataset 2 will be all NAs.

dat1
Country.Name year GDP enrolment Expenditure
1 AFGHANISTAN YR1997
2 AFGHANISTAN YR1998 1046338
3 AFGHANISTAN YR1999
4 AFGHANISTAN YR2000 749360
5 AFGHANISTAN YR2001 773623
6 AFGHANISTAN YR2002 184.494712122049 2667629

dat2
Country.Name year year.inschool
1 AFGHANISTAN YR1997 4.9
194 AFGHANISTAN YR1998 5.2
387 AFGHANISTAN YR1999 5.5
580 AFGHANISTAN YR2000 5.9
773 AFGHANISTAN YR2001 6.2
966 AFGHANISTAN YR2002 6.5

The problem is that dat1 consists of over +- 5500 rows and dat2 of +- 4000 rows. A second problem is that not all countries in dat1 are in dat2 and some countries have different names such as The Bahamas and Bahamas. Can anyone help me, so that the values of dat2 (year.inschool) can be added tot dat1 in that the values are available instead of making them all NA.

Thanks

0 Likes

#2

I'm not sure, but are you looking for something like this?

(dat1 <- data.frame(stringsAsFactors=FALSE,
                    Country.Name = c("AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN"),
                    year = c("YR1997", "YR1998", "YR1999", "YR2000", "YR2001", "YR2002"),
                    GDP = c(NA, 1046338, NA, 749360, 773623, 184.494712122049),
                    enrolment = c(NA, NA, NA, NA, NA, 2667629),
                    Expenditure = c(NA, NA, NA, NA, NA, NA)))
#>   Country.Name   year          GDP enrolment Expenditure
#> 1  AFGHANISTAN YR1997           NA        NA          NA
#> 2  AFGHANISTAN YR1998 1046338.0000        NA          NA
#> 3  AFGHANISTAN YR1999           NA        NA          NA
#> 4  AFGHANISTAN YR2000  749360.0000        NA          NA
#> 5  AFGHANISTAN YR2001  773623.0000        NA          NA
#> 6  AFGHANISTAN YR2002     184.4947   2667629          NA

(dat2 <- data.frame(stringsAsFactors=FALSE,
                    Country.Name = c("AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN"),
                    year = c("YR1997", "YR1998", "YR1999", "YR2000", "YR2001", "YR2002"),
                    year.inschool = c(4.9, 5.2, 5.5, 5.9, 6.2, 6.5)))
#>   Country.Name   year year.inschool
#> 1  AFGHANISTAN YR1997           4.9
#> 2  AFGHANISTAN YR1998           5.2
#> 3  AFGHANISTAN YR1999           5.5
#> 4  AFGHANISTAN YR2000           5.9
#> 5  AFGHANISTAN YR2001           6.2
#> 6  AFGHANISTAN YR2002           6.5

(dat <- merge(x = dat1,
              y = dat2,
              all = TRUE))
#>   Country.Name   year          GDP enrolment Expenditure year.inschool
#> 1  AFGHANISTAN YR1997           NA        NA          NA           4.9
#> 2  AFGHANISTAN YR1998 1046338.0000        NA          NA           5.2
#> 3  AFGHANISTAN YR1999           NA        NA          NA           5.5
#> 4  AFGHANISTAN YR2000  749360.0000        NA          NA           5.9
#> 5  AFGHANISTAN YR2001  773623.0000        NA          NA           6.2
#> 6  AFGHANISTAN YR2002     184.4947   2667629          NA           6.5

Created on 2019-03-11 by the reprex package (v0.2.1)

0 Likes

#3

The countrycode package is designed to solve this annoying problem :slight_smile:

1 Like

#4

HI Yarnabrina,

Thank you for your response. It is something like this but when I do this the following thing happens:
Country.Name year GDP primary_school_enrolment Government_Ependiture YR_schooling
1 AFGHANISTAN YR1997 4.9
2 AFGHANISTAN YR1998 5.2
3 AFGHANISTAN YR1999 5.5
4 AFGHANISTAN YR2000 5.9

The other variables becomes NA.

0 Likes

#5

Thanks! I will check it out.

0 Likes

#6

I can't think of any reason. Does this happen with my code too? Do you still get the NA's, while I get expected output?

Can you please post your code and sample data (the observations corresponding to which you have observations on all the columns) in form of a reproducible example? Noting the exact errors you're facing, some of us may get an idea about the problem.

If you don't know how to make a reprex, here's a great post:

0 Likes

#7

Hi Yarnabrina,

Thank you for the tip to make a minimal dataset. Very helpful for debugging. The problem is fixed:) There was a space before the names in one of the datasets.

0 Likes

#8

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

0 Likes

closed #9

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

0 Likes