Readxl Changing my Column Headers

dlsweet · October 10, 2017, 12:54pm

I was testing out using readxl to import data contained in an Excel workbook rather than the XLConnect package. This data set has column headers such as 1.1, 1.2, 1.3, and so on. When I read it into R, some of the headings were changed, such as 1.1 to 1.1000000000000001 and 2.2 to 2.2000000000000002.

Does anyone have ideas as to why this is happening? My code is attached:

library(readxl)

path <- path

list <- lapply(excel_sheets(path), read_excel, path = path)

x <- list[[1]]

ConnorKirk · October 10, 2017, 1:01pm

Looks like a problem with data types. I believe this might be due to excel rather than R, but I can't say for sure.

Try changing the cell type of your headings to text

To do so highlight the desired cells, right click and select format cells. Then select text.

Save the changes and try to read it in again? It may be due to excel storing the number as a number rather than a string.

dlsweet · October 10, 2017, 1:05pm

That was a very good idea, but it didn't seem to help. Also, if that were the case wouldn't it have happened to every numeric header(such as 1.2, 1.3, etc.), not only some of them?

Do you have any other ideas?

martin.R · October 10, 2017, 1:13pm

I don't think it would necessarily happen to every numeric header. Non-integer numeric data are subject to the wonderful unpredictability of floating point arithmetic (even though nothing seems to be calculated here), hence the trailing 000000000000001, etc.

It's not a good idea to have column names be numbers, so you could either change them in your input file or rename them once read in.

I'm not aware of a more intelligent solution.

dlsweet · October 10, 2017, 1:15pm

I was hoping that there was a solution. Despite numeric headers not being a great idea, they do match up quite well as they are the same as the question numbering. Oh well, going through and adding a Q for Question isn't the worst thing in the world.

hadley · October 10, 2017, 2:16pm

More discussion of the problem at https://github.com/tidyverse/readxl/issues/360

dlsweet · October 10, 2017, 2:21pm

Thanks @hadley There seem to be some good solutions over there!

nick · October 10, 2017, 2:23pm

I could reproduce the issue by making an xlsx file with decimal headers. You could use a regex to fix it up. The example below assumes that any time you have six zeroes (or nines, after looking at the linked issue) or more in a row after a number or decimal point, you want to discard everything from the zeroes/nines onward.

library(tibble)
library(stringr)
dat <-
  structure(
    list(
      `1.1000000000000001` = c("a", "b", "c"),
      `1.2` = c(1, 2, 3),
      `2.1` = c("d", "e", "f"),
      `2.2000000000000002` = c(4,                                                                                                                                   5, 6)
    ),
    .Names = c("1.1000000000000001", "1.2", "2.1", "2.2000000000000002"),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA,                                                                                                                                                                                                                                                                         -3L)
  )
dat
#> # A tibble: 3 x 4
#>   `1.1000000000000001` `1.2` `2.1` `2.2000000000000002`
#>                  <chr> <dbl> <chr>                <dbl>
#> 1                    a     1     d                    4
#> 2                    b     2     e                    5
#> 3                    c     3     f                    6
names(dat) <- names(dat) %>%
  str_replace("([.[:digit:]])[09]{6,}(\\d+?)", 
              "\\1")
dat
#> # A tibble: 3 x 4
#>   `1.1` `1.2` `2.1` `2.2`
#>   <chr> <dbl> <chr> <dbl>
#> 1     a     1     d     4
#> 2     b     2     e     5
#> 3     c     3     f     6

martin.R · October 10, 2017, 2:32pm

The last part of hadley's link should work:

Likewise, we get what we want when the number is preceded by a single quote, which another way to force text.

This is what I used to do in Excel to force numbers to be text (I'd just forgotten I did so until this jogged my memory). The format option looks to be less reliable.

This still means editing the original Excel file though.

nutterb · October 10, 2017, 2:35pm

While not ideal, you could avoid editing the original file by something as silly as

library(readxl)
library(magrittr)

read_excel(...) %>%
  set_names(round(as.numeric(names(.)), 1))

This assumes that all of your names follow the x.y format

nick · October 10, 2017, 2:37pm

As someone who is usually dealing with files for analysis from a client, who may later send me updated versions of the file with new/corrected data, I generally try to deal with fixing issues in code as much as possible. If it gets too crazy (i.e., a "table" that actually represents three different tables intermingled and coded into groups by color) I'll likely ask the client to fix it or cross my fingers and hope I only have to fix it once. Simple issues like this are nice to handle without editing the data file, though.

dlsweet · October 10, 2017, 2:41pm

These are all great solutions! I was really just asking because I was curious about what was happening behind the scenes. The solution that I actually use, is just to create a vector of new column headers, and apply it to each of the surveys that I receive. Luckily, the organization I work with uses the same exact survey for ~90-95% of all of the trainings that they run.