Issue with bind_rows for lists

Hi,

I am trying to use bind_rows to combine content from a list of dataframes. It usaully works, but this time it provide lots of NAs in rows and also adds several extra columns/variables.

I couldn't make a reprex showing list itself. I could only create it for a dataframe. Please guide me on that as well if you can.

For now, I am showing the screenshot to give an idea with how values are behaving when using bind_rows. List df contains various regions and each dataframe within the list looks like this:

USA <- datapasta::df_paste(df1$USA)
data.frame(
       stringsAsFactors = FALSE,
            check.names = FALSE,
                Country = c("USA", "USA"),
              Awareness = c("UBA", "PBA"),
    `Total (N = 10930)` = c("0.68", "0.96199999999999997"),
             `2020 - 2 (N = 1211)` = c("0.72599999999999998",
                                       "0.96399999999999997")
           )data.frame(
       stringsAsFactors = FALSE,
            check.names = FALSE,
                Country = c("USA", "USA"),
              Awareness = c("UBA", "PBA"),
    `Total (N = 10930)` = c("0.68", "0.96199999999999997"),
                                                `2020 - 2 (N = 1211)` = c("0.72599999999999998","0.96399999999999997",
                                     NA)
         )

Canada <- datapasta::df_paste(df1$Canada)
data.frame(
      stringsAsFactors = FALSE,
           check.names = FALSE,
               Country = c("Canada", "Canada"),
             Awareness = c("UBA", "PBA"),
    `Total (N = 7467)` = c("0.623", "0.95499999999999996"),
                                               `2020 - 2 (N = 842)` = c("0.65600000000000003","0.94699999999999995",
                                    NA)
         )
Columbia <- datapasta::df_paste(df1$Colombia)
data.frame(
      stringsAsFactors = FALSE,
           check.names = FALSE,
               Country = c("Colombia", "Colombia"),
             Awareness = c("UBA", "PBA"),
                                                 `Total (N = 4704)` = c("0.83899999999999997","0.97399999999999998",
                                    NA),
  `2020 - 2 (N = 598)` = c("0.871", "0.97399999999999998")
         )

All the columns in each of the variables are same. When I try to combine them all using
df <- bind_rows(df), it provide lots of NAs in rows and also adds several extra columns/variables. Any help to solve this issue is appreciated.

Thanks!

Is it because the column names don't match? In that case you might see something like:

 col1   col2    col_A   col_B
    1      2        3      NA
    1      2        3      NA
    1      2        3      NA
    1      2       NA       1
    1      2       NA       1
    1      2       NA       1

if col_A is only in the first data frame, and col_B only in the second. Otherwise, what do the additional column names look like?

In your example the different dataframes have different column names. E.g. USA has Total (N = 10930) whereas Canada has Total (N = 7467) in the column names. If that is the case, you may need to rename the columns before binding.

1 Like

I see! N is different and that's what is making it different columns for them. I missed that! Thank you so much!
Actually, I do need the value of N as a separate column, "Sample" and I was trying to first combine all of these dataframes and then using gather and separate to have N value in separate column.

Initially I tried to apply separate using across for all the columns. I do have several columns in my original data vs what you see from reprex. But I am new to using across and still trying to figure out how I can separate N values as new columns. Any suggestions would be helpful!

Thank you!

If it's in the column name, I don't think across() will help. It's not really a dplyr habit to put data in the column names, you're probably better off with a:

str_split(colnames(df)[5], " ")

Or possibly

xx <- colnames(df)
str_split(xx[startsWith(xx, "Total")], " ")

And do some more processing. Also, you can consider avoiding postprocessing with a regex:

xx <- colnames(df)
str_match(xx, "^Total \\(N = ([:digits:]{4})\\)$")

Thank you! I will try these options. The only thing is Each column has different name also apart from N values such as Total(N = 12134), 2020-1(N=11111), 2020-2(N= 13245), 2019-1(N = 4654) and so on. I hope some of these can work on these columns.

Oh, based on your example I thought each data frame had one column Total and one column 2020 - 2, so you would have needed 2 regex to extract the relevant N values. If there are many columns with years, you might need to pivot_longer(), then the data will be inside a column and you can use the tidyr::separate() or extract().

Perfect! Thank you @AlexisW!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.