Copying values from row with matching ID to row with NA's

Hendrina · November 22, 2021, 1:56pm

Hello, I am trying to copy data from one row to another (or several) to replace NA's based on matching ID's. I have part of the dataset presented below.

data.frame(
          stringsAsFactors = FALSE,
               check.names = FALSE,
          `Date of survey` = c("20/10/21",
                               "20/10/21","19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021"),
          `Catchment area` = c("Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro"),
            `Village name` = c("Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1"),
                     Chief = c("Chezi 1",
                               "Chezi 1","Chezi1","chezi1",NA,"Chezi1",NA,NA,
                               "Chezi1","Chezi1","Chezi1","Chezi1",NA,NA),
  `# of houses in village` = c(187,NA,187,
                               187,NA,187,NA,NA,187,187,187,187,NA,NA),
                `CDS code` = c("CDS/A/XXX",
                               "CDS/A/XXX","CDS/A/YYY","CDS/A/001","CDS/A/001",
                               "CDS/A/002","CDS/A/002","CDS/A/002",
                               "CDS/A/003","CDS/A/004","CDS/A/005","CDS/A/006",
                               "CDS/A/006","CDS/A/006"),
     `Name of participant` = c("James dandy",
                               NA,"Cathy leggo","Kamy limomo",NA,
                               "Anny Chikupepe",NA,NA,"F Chauwadidi","G Chauwadidi",
                               "Lovely acar","Jose Chipper",NA,NA),
           `Head of house` = c("Yes",NA,
                               "Yes","No",NA,"No",NA,NA,"Yes","No","No",
                               "No",NA,NA),
       `# of people in HH` = c(4, NA, 3, 6, NA, 6, NA, NA, 2, 2, 5, 6, NA, NA)
)
#>    Date of survey Catchment area Village name   Chief # of houses in village
#> 1        20/10/21       Mndoliro      Chezi 1 Chezi 1                    187
#> 2        20/10/21       Mndoliro      Chezi 1 Chezi 1                     NA
#> 3      19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 4      19-10-2021       Mndoliro      Chezi 1  chezi1                    187
#> 5      19-10-2021       Mndoliro      Chezi 1    <NA>                     NA
#> 6      19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 7      19-10-2021       Mndoliro      Chezi 1    <NA>                     NA
#> 8      19-10-2021       Mndoliro      Chezi 1    <NA>                     NA
#> 9      19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 10     19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 11     19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 12     19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 13     19-10-2021       Mndoliro      Chezi 1    <NA>                     NA
#> 14     19-10-2021       Mndoliro      Chezi 1    <NA>                     NA
#>     CDS code Name of participant Head of house # of people in HH
#> 1  CDS/A/XXX         James dandy           Yes                 4
#> 2  CDS/A/XXX                <NA>          <NA>                NA
#> 3  CDS/A/YYY         Cathy leggo           Yes                 3
#> 4  CDS/A/001         Kamy limomo            No                 6
#> 5  CDS/A/001                <NA>          <NA>                NA
#> 6  CDS/A/002      Anny Chikupepe            No                 6
#> 7  CDS/A/002                <NA>          <NA>                NA
#> 8  CDS/A/002                <NA>          <NA>                NA
#> 9  CDS/A/003        F Chauwadidi           Yes                 2
#> 10 CDS/A/004        G Chauwadidi            No                 2
#> 11 CDS/A/005         Lovely acar            No                 5
#> 12 CDS/A/006        Jose Chipper            No                 6
#> 13 CDS/A/006                <NA>          <NA>                NA
#> 14 CDS/A/006                <NA>          <NA>                NA

To put it in context, in every household "CDS code" , multiple entries were made to represent each child, but for the second and/or third entry, only the ID was entered and a few other variables. I want to copy the other information from the first entry to the others so that it appears as below. Please note that in the real data the data ID's are mixed up so the "copy from above row" option may not work. I haven't been able to find a better solution

data.frame(
          stringsAsFactors = FALSE,
               check.names = FALSE,
          `Date of survey` = c("20/10/21",
                               "20/10/21","19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021","19-10-2021","19-10-2021",
                               "19-10-2021","19-10-2021"),
          `Catchment area` = c("Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                               "Mndoliro"),
            `Village name` = c("Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                               "Chezi 1","Chezi 1","Chezi 1","Chezi 1"),
                     Chief = c("Chezi 1",
                               "Chezi 1","Chezi1","Chezi2","Chezi3","Chezi4",
                               "Chezi5","Chezi6","Chezi7","Chezi8","Chezi9",
                               "Chezi10","Chezi11","Chezi12"),
  `# of houses in village` = c(187,187,187,
                               187,187,187,187,187,187,187,187,187,187,
                               187),
                `CDS code` = c("CDS/A/XXX",
                               "CDS/A/XXX","CDS/A/YYY","CDS/A/001","CDS/A/001",
                               "CDS/A/002","CDS/A/002","CDS/A/002",
                               "CDS/A/003","CDS/A/004","CDS/A/005","CDS/A/006",
                               "CDS/A/006","CDS/A/006"),
     `Name of participant` = c("James dandy",
                               "James dandy","Cathy leggo","Kamy limomo",
                               "Kamy limomo","Anny Chikupepe","Anny Chikupepe",
                               "Anny Chikupepe","F Chauwadidi","G Chauwadidi",
                               "Lovely acar","Jose Chipper","Jose Chipper",
                               "Jose Chipper"),
           `Head of house` = c("Yes","Yes",
                               "Yes","No","No","No","No","No","Yes","No",
                               "No","No","No","No"),
       `# of people in HH` = c(4, 4, 3, 6, 6, 6, 6, 6, 2, 2, 5, 6, 6, 6)
)
#>    Date of survey Catchment area Village name   Chief # of houses in village
#> 1        20/10/21       Mndoliro      Chezi 1 Chezi 1                    187
#> 2        20/10/21       Mndoliro      Chezi 1 Chezi 1                    187
#> 3      19-10-2021       Mndoliro      Chezi 1  Chezi1                    187
#> 4      19-10-2021       Mndoliro      Chezi 1  Chezi2                    187
#> 5      19-10-2021       Mndoliro      Chezi 1  Chezi3                    187
#> 6      19-10-2021       Mndoliro      Chezi 1  Chezi4                    187
#> 7      19-10-2021       Mndoliro      Chezi 1  Chezi5                    187
#> 8      19-10-2021       Mndoliro      Chezi 1  Chezi6                    187
#> 9      19-10-2021       Mndoliro      Chezi 1  Chezi7                    187
#> 10     19-10-2021       Mndoliro      Chezi 1  Chezi8                    187
#> 11     19-10-2021       Mndoliro      Chezi 1  Chezi9                    187
#> 12     19-10-2021       Mndoliro      Chezi 1 Chezi10                    187
#> 13     19-10-2021       Mndoliro      Chezi 1 Chezi11                    187
#> 14     19-10-2021       Mndoliro      Chezi 1 Chezi12                    187
#>     CDS code Name of participant Head of house # of people in HH
#> 1  CDS/A/XXX         James dandy           Yes                 4
#> 2  CDS/A/XXX         James dandy           Yes                 4
#> 3  CDS/A/YYY         Cathy leggo           Yes                 3
#> 4  CDS/A/001         Kamy limomo            No                 6
#> 5  CDS/A/001         Kamy limomo            No                 6
#> 6  CDS/A/002      Anny Chikupepe            No                 6
#> 7  CDS/A/002      Anny Chikupepe            No                 6
#> 8  CDS/A/002      Anny Chikupepe            No                 6
#> 9  CDS/A/003        F Chauwadidi           Yes                 2
#> 10 CDS/A/004        G Chauwadidi            No                 2
#> 11 CDS/A/005         Lovely acar            No                 5
#> 12 CDS/A/006        Jose Chipper            No                 6
#> 13 CDS/A/006        Jose Chipper            No                 6
#> 14 CDS/A/006        Jose Chipper            No                 6

Apologies for the long post

xvalda · November 22, 2021, 2:54pm

Hi @Hendrina , a left_join should work for this.
Just note that "# of people in HH" shows different results than in your second dataframe (it looks like you added 1 person per line).

#Store your dataframe in df
df <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  `Date of survey` = c("20/10/21",
                       "20/10/21","19-10-2021","19-10-2021","19-10-2021",
                       "19-10-2021","19-10-2021","19-10-2021",
                       "19-10-2021","19-10-2021","19-10-2021","19-10-2021",
                       "19-10-2021","19-10-2021"),
  `Catchment area` = c("Mndoliro",
                       "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                       "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                       "Mndoliro","Mndoliro","Mndoliro","Mndoliro",
                       "Mndoliro"),
  `Village name` = c("Chezi 1",
                     "Chezi 1","Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                     "Chezi 1","Chezi 1","Chezi 1","Chezi 1",
                     "Chezi 1","Chezi 1","Chezi 1","Chezi 1"),
  Chief = c("Chezi 1",
            "Chezi 1","Chezi1","chezi1",NA,"Chezi1",NA,NA,
            "Chezi1","Chezi1","Chezi1","Chezi1",NA,NA),
  `# of houses in village` = c(187,NA,187,
                               187,NA,187,NA,NA,187,187,187,187,NA,NA),
  `CDS code` = c("CDS/A/XXX",
                 "CDS/A/XXX","CDS/A/YYY","CDS/A/001","CDS/A/001",
                 "CDS/A/002","CDS/A/002","CDS/A/002",
                 "CDS/A/003","CDS/A/004","CDS/A/005","CDS/A/006",
                 "CDS/A/006","CDS/A/006"),
  `Name of participant` = c("James dandy",
                            NA,"Cathy leggo","Kamy limomo",NA,
                            "Anny Chikupepe",NA,NA,"F Chauwadidi","G Chauwadidi",
                            "Lovely acar","Jose Chipper",NA,NA),
  `Head of house` = c("Yes",NA,
                      "Yes","No",NA,"No",NA,NA,"Yes","No","No",
                      "No",NA,NA),
  `# of people in HH` = c(4, NA, 3, 6, NA, 6, NA, NA, 2, 2, 5, 6, NA, NA)
)

And this line should work, using the tiyverse:

suppressPackageStartupMessages(library(tidyverse))
df %>% 
  select(`CDS code`) %>% 
  left_join(drop_na(df)) %>% 
  relocate(`CDS code`, .before = `Name of participant`)

Hope it helps.

Hendrina · November 22, 2021, 7:55pm

Hello, thanks for noticing the error (I will fix it), it occurred because I copied the values using excel.

I tried to run the code below but am having issues installing tidyverse, I keep getting the error below


> suppressPackageStartupMessages(library(tidyverse))
Error in library(tidyverse) : there is no package called ‘tidyverse’

When I try a proper install, I get

install.packages("tidyverse")
Warning in install.packages :
  unable to access index for repository https://cran.rstudio.com/src/contrib:
  cannot open URL 'https://cran.rstudio.com/src/contrib/PACKAGES'
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into �C:/Users/USER/Documents/R/win-library/4.0�
(as �lib� is unspecified)
Warning in install.packages :
  unable to access index for repository https://cran.rstudio.com/src/contrib:
  cannot open URL 'https://cran.rstudio.com/src/contrib/PACKAGES'
Warning in install.packages :
  package ‘tidyverse’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning in install.packages :
  unable to access index for repository https://cran.rstudio.com/bin/windows/contrib/4.0:
  cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/PACKAGES'

Not sure what's causing this, (I have used tidyverse before)

xvalda · November 23, 2021, 10:00am

There's another post from the community that may help: install.packages unable to access index for repository: Try disabling “secure download method for HTTP”

nirgrahamuk · November 23, 2021, 10:19am

WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

Rtools link

Hendrina · November 24, 2021, 7:17pm

This worked thanks!!

Hendrina · November 24, 2021, 7:26pm

@xvalda the copying code didn't work, in the sample dataframe it does work, but when I apply it to the main data set (2123 rows and 30 variables) I somehow ended up with 4 extra rows in my data and most of the rows lost all the data i.e., became NA's.

I should clarify that I only want to copy data for selected (specific) variables not all of them. Any other suggestions?

xvalda · November 25, 2021, 3:40pm

Hi @Hendrina
You can try another approach then:

Define columns for which you want to copy values, in my example I excluded "Head of House" since copying previous values may ba assuming things wrongly, but feel free to adapt depending on what makes sense with your variables

cols_to_copy <- c("Chief", "# of houses in village", "Name of participant", "# of people in HH")

And then another method that groups by CDS codes, and look up for a non-na value in each specific group

df %>% 
  group_by(`CDS code`) %>% 
  mutate(across(cols_to_copy, ~ ifelse(is.na(.), .[!is.na(.)], as.character(.)))) %>% 
  ungroup()

I hope it works on the whole dataset now.

xvalda · November 28, 2021, 5:52pm

Hi @Hendrina , I just had second thoughts, you dataset may have CDS codes with only NAs, for which you need something a bit more robust, you can try this one:

#select columns for which you want to copy values
cols_to_copy <- c("Chief", "# of houses in village", "Name of participant", "# of people in HH")

df %>% 
  group_by(`CDS code`) %>% 
  mutate(across(.cols = cols_to_copy, .fns = ~ case_when(
    !is.na(.) ~ ., 
    sum(is.na(.)) - n() == 0 & is.na(.) ~ ., 
    sum(is.na(.)) - n() != 0 & is.na(.) ~ .[1]
  )))  %>% 
  ungroup()

Hendrina · December 10, 2021, 6:36am

Apologies for the delayed response. I will try this approach and give feedback if it works.

xvalda · December 10, 2021, 8:03am

or this alternative that is much shorter, not sure how both solutions would react though on your real data

library(tidyverse)
cols_to_copy <- c("Chief", "# of houses in village", "Name of participant", "# of people in HH")

df %>% 
  group_by(`CDS code`) %>% 
  fill(cols_to_copy)

system · December 31, 2021, 8:04am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.