csv import problem

Hello I need to import the csv file available here :

the file does not have header for all of the columns and some rows are so long that for some rows it skip to the next rows. It's a mess.

Thank you for your help.

I tried with read_delim(file, delim = ";") and it reads the file, but I also see the issues you described.

Warning: 6690 parsing failures.
row col   expected     actual                                                        file
  1  -- 29 columns 62 columns '..//Examples//muni-2014-resultats-com-1000-et-plus-t1.txt'
  3  -- 29 columns 51 columns '..//Examples//muni-2014-resultats-com-1000-et-plus-t1.txt'
  4  -- 29 columns 40 columns '..//Examples//muni-2014-resultats-com-1000-et-plus-t1.txt'
  6  -- 29 columns 40 columns '..//Examples//muni-2014-resultats-com-1000-et-plus-t1.txt'
  7  -- 29 columns 40 columns '..//Examples//muni-2014-resultats-com-1000-et-plus-t1.txt'

I think it's possible that some of the values have a ";" character in them that conflicts with the delimiter. If that's the case, I'm not sure if it's possible to even fix it.

1 Like

Thank you a lot, are you sure it is not possible with read. table or read.csv and using some mysterious argument ?
Also, I am now very concern if someone with this background face the same problem than me....

Also when droping the names of the columns using this : read.csv2(file, header=FALSE)
columns are a mess example row 19. I think I cannot reprex this since it doesn't display any error message.

Here's what I recommend! First of all, it appears the delimiter is a semicolon ";" instead of a comma, so any function you use to read the file should specify that.

  1. Read the file with read_delim(file, delim = ";") and take a look at the column names and column values to see if there's anything usable. Based on your understanding of the data, do the values correspond correctly with the column names? When I look at the result, it appears like there is usable information in the 29 columns it successfully reads. Is there critical information missing in columns 30+?

  2. Take a look at the first few rows of raw text in notepad (turn off Word Wrap in the menu) and count the number of semicolons in each. See if you can manually space the values into columns. Try to use your understanding of the data to detect if there are ";" characters that are part of the values. This is just a hypothesis based on my observation that read_delim seems to find a larger number of columns for some rows. For example, if you see a column that should contain names and you see a ";" separating the first name and last names, then you've brought some clarity to what's wrong.

  3. Worst case, contact the owner of the data and explain the issues you're experiencing.

1 Like

First, I really appreciate the time you spend on this, thank you.

When using your function, all the columns are fine and correspond with their names. But yes I need the information that is contained in the missing columns.

When using this one read.csv2(file, header=FALSE) it gave me 62 columns but I'm expecting much more.

Also in the text file I note that (with the word wrap off) it still skip to the next line for some cases (paste this in ctrl+f : "paris 10eme"). I don't know the total number of column that I should get so I don't think I can manually space the values like you suggest. I don't think there is ";" in this file other than those to separate the values.

I don't think the owner can help me with this, they are probably using other software.

There is a pattern to the data, though it is a mess. The columns beyond the last column with a header seem to be in groups of 11, representing more data of the type stored in the 11 columns from Code Nuance to % Voix/Exp. If you repeat those last 11 headers enough times, you should get a well formed data frame. I do not know how many times that is. I will post again if I figure that out. There seem to be more than 1000 columns, so it is a lot of repetitions.

1 Like

Oops, I replied to the wrong post. See my post above.

Yes there is a pattern. the data frame collect data about parties at mayoral election so the group of 11 variables is repeated for each party . Do you think I should then just copy/paste the header in the raw text or should I specify that in R? Also what will happened about the skip of rows ?
1000 columns seems to me a bit too much though.
Thank you for your reply. Much appreciate.

This is as far as I have gotten,

RawText <- readLines("muni.txt")
CountSemi <- stringr::str_count(RawText, pattern = ";")
HeaderReps <- (max(CountSemi)-28)/11 
HeaderText <- "Code Nuance;Sexe;Nom;Prénom;Liste;Sièges / Elu;Sièges Secteur;Sièges CC;Voix;% Voix/Ins;% Voix/Exp;"
RawText[1] <- paste0(RawText[1], paste0(rep(HeaderText, HeaderReps), collapse = ""))

writeLines(RawText, con = "muniFix.txt")
DF <- read.csv2("muniFix.txt")

The file muni.txt is the original file that I renamed to save typing. I count the number of columns by counting semicolons and then calculate how many times the header needs to be repeated. I then paste that number of replicates of the 11-column header to the first row of the file and save the new version. Both the original file and the new version have 9907 rows when I look at them in Notepad++.
The mystery is that the data frame I read in with read.csv2 only has 8738 rows. It has 1107 columns and its row 8701 matches row 9870 of the muniFix.csv file. Those are the rows with the highest number of populated columns. Their last rows also seem to match. I cannot find a reason for the missing rows in the data frame.

1 Like

This is insane are you some kind of R warlord ?
I think it work, I am stun that there is that much columns. I'll reply tomorrow when so I can examine this a little bit more in depth.
I can't explain the mystery too.
But really thank you for this !!

I understand now why there is so much columns it's because the ballot system is different in an island, and so there is 99 candidates in some city ...
Thank you for the light again.

I thought that this would have been the solution, but I think there's still something about the data format I don't understand.

It seemed to me that the first 17 columns are consistent for every row, and the names and properties of people appear in subsequent groups of 11 columns. It's definitely the case that the total number of columns is always 17 + an integer multiplication of 11.

The code runs but it appears those name columns values are misaligned.

If you can, I'd contact the owner of the data and see if he can use a different delimiter between the names.

library(tidyverse)

# read each row of text individually so we can parse out the information manually
election0 <- 
  read_delim(
    "C:\\Users\\st2516\\Downloads\\muni-2014-resultats-com-1000-et-plus-t1.txt", 
    "\n",
    col_names = FALSE) %>%
  setNames("line_text") %>%
  mutate(
    # split by delimiter
    split_text  = strsplit(line_text, ";"),
    # assume the first 17 elements are common
    split_df    = map(split_text, ~.[1:17]),
    # and everything past this is repeating 11
    split_names = map(split_text, ~.[-c(1:17)]),
    columns     = map_dbl(split_text, length),
    # the number of repeating 11 name data elements
    n_names     = (columns - 17)/11)
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   X1 = col_character()
#> )

header <- election0$split_text[[1]]

make_df <- function(split_text, n_rows, col_names) {
  # function to take the split text and create a data frame with n_rows and col_names
  matrix(split_text, nrow = n_rows) %>% 
    as_tibble() %>% 
    setNames(col_names)
}

#
election <- election0[-1, ] %>% # (first row had the header)
  transmute(
    id      = 1:n(),
    # parse the first 17 elements into a one-row data frame
    df      = pmap(list(split_df,          1, list(!!header[1:17])),     make_df),
    # parse the names (repeating 11 elements) with one row per name
    name_df = pmap(list(split_names, n_names, list(!!header[-c(1:17)])), make_df)) %>%
  unnest(df) %>%
  unnest(name_df)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
  
print(head(election, 20))
#> # A tibble: 20 x 29
#>       id `Date de l'expo~ `Code du d<U+653C><U+3E39>par~ `Type de scruti~ `Libell<U+653C><U+3E39> du d<U+653C>~
#>    <int> <chr>            <chr>            <chr>            <chr>           
#>  1     1 25/03/2014 12:5~ 01               LI2              AIN             
#>  2     1 25/03/2014 12:5~ 01               LI2              AIN             
#>  3     1 25/03/2014 12:5~ 01               LI2              AIN             
#>  4     1 25/03/2014 12:5~ 01               LI2              AIN             
#>  5     2 25/03/2014 12:5~ 01               LI2              AIN             
#>  6     3 25/03/2014 12:5~ 01               LI2              AIN             
#>  7     3 25/03/2014 12:5~ 01               LI2              AIN             
#>  8     3 25/03/2014 12:5~ 01               LI2              AIN             
#>  9     4 25/03/2014 12:5~ 01               LI2              AIN             
#> 10     4 25/03/2014 12:5~ 01               LI2              AIN             
#> 11     5 25/03/2014 12:5~ 01               LI2              AIN             
#> 12     6 25/03/2014 12:5~ 01               LI2              AIN             
#> 13     6 25/03/2014 12:5~ 01               LI2              AIN             
#> 14     7 25/03/2014 12:5~ 01               LI2              AIN             
#> 15     7 25/03/2014 12:5~ 01               LI2              AIN             
#> 16     8 25/03/2014 12:5~ 01               LI2              AIN             
#> 17     8 25/03/2014 12:5~ 01               LI2              AIN             
#> 18     9 25/03/2014 12:5~ 01               LI2              AIN             
#> 19    10 25/03/2014 12:5~ 01               LI2              AIN             
#> 20    10 25/03/2014 12:5~ 01               LI2              AIN             
#> # ... with 24 more variables: `Code de la commune` <chr>, `Libell<U+653C><U+3E39> de la
#> #   commune` <chr>, Inscrits <chr>, Abstentions <chr>, `% Abs/Ins` <chr>,
#> #   Votants <chr>, `% Vot/Ins` <chr>, `Blancs et nuls` <chr>, `%
#> #   BlNuls/Ins` <chr>, `% BlNuls/Vot` <chr>, `Exprim<U+653C><U+3E39>s` <chr>, `%
#> #   Exp/Ins` <chr>, `% Exp/Vot` <chr>, `Code Nuance` <chr>, Sexe <chr>,
#> #   Nom <chr>, `Pr<U+653C><U+3E39>nom` <chr>, Liste <chr>, `Si<U+653C><U+3E38>ges / Elu` <chr>,
#> #   `Si<U+653C><U+3E38>ges Secteur` <chr>, `Si<U+653C><U+3E38>ges CC` <chr>, Voix <chr>, `%
#> #   Voix/Ins` <chr>, `% Voix/Exp` <chr>

Created on 2021-06-02 by the reprex package (v1.0.0)

Normally it is suppose to have a group of columns for each party that was running at the election so it's consistent with the observation of having for each row a number of column equal to 17+i * 11 (i : number of parties)

I see now what you meant by parsing manually, yes I can see that for the columns 19 + observations are all over the place...
The code is a bit complicated for me so I can't really understand but is a problem with the encoding the cause of it ?

Well thank you I'll try my best to get those data !

I think I figured it out! You'll have to install tidyverse and janitor packages. I made a silly error with matrix() and didn't set byrow = TRUE.

library(tidyverse)

# read each row of text individually so we can parse out the information manually
election0 <- 
  read_delim(
    "C:\\Users\\st2516\\Downloads\\muni-2014-resultats-com-1000-et-plus-t1.txt", 
    "\n",
    col_names = FALSE) %>%
  setNames("line_text") %>%
  mutate(
    # split by delimiter
    split_text  = strsplit(line_text, ";"),
    # assume the first 17 elements are common
    split_df    = map(split_text, ~.[1:17]),
    # and everything past this is repeating 11
    split_names = map(split_text, ~.[-c(1:17)]),
    columns     = map_dbl(split_text, length),
    # the number of repeating 11 name data elements
    n_names     = (columns - 17)/11)
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   X1 = col_character()
#> )

# file header
header <- election0$split_text[[1]] %>% janitor::make_clean_names()

make_df <- function(split_text, n_rows, col_names) {
  # function to take the split text and create a data frame with n_rows and col_names
  matrix(split_text, nrow = n_rows, byrow = TRUE) %>% 
    as_tibble() %>% 
    setNames(col_names)
}

# parse the first 17 columns (df) separately from the names (in subsequent sets 
# of 11 columns) (name_df)
election <- election0[-1, ] %>% # (first row had the header)
  transmute(
    id      = 1:n(),
    df      = pmap(list(split_df,          1, list(!!header[1:17])),     make_df),
    name_df = pmap(list(split_names, n_names, list(!!header[-c(1:17)])), make_df)) %>%
  unnest(df) %>%
  unnest(name_df)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
  
glimpse(election, width = 80)
#> Rows: 22,899
#> Columns: 29
#> $ id                                   <int> 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, ...
#> $ date_de_lexport                      <chr> "25/03/2014 12:50:21", "25/03/...
#> $ code_du_d_gan_liepartement           <chr> "01", "01", "01", "01", "01", ...
#> $ type_de_scrutin                      <chr> "LI2", "LI2", "LI2", "LI2", "L...
#> $ libell_gan_lie_du_d_gan_liepartement <chr> "AIN", "AIN", "AIN", "AIN", "A...
#> $ code_de_la_commune                   <chr> "004", "004", "004", "004", "0...
#> $ libell_gan_lie_de_la_commune         <chr> "Amb<U+653C><U+3E39>rieu-en-Bugey", "Amb<U+653C><U+3E39>ri...
#> $ inscrits                             <chr> "00008198", "00008198", "00008...
#> $ abstentions                          <chr> "00003422", "00003422", "00003...
#> $ percent_abs_ins                      <chr> "41,74", "41,74", "41,74", "41...
#> $ votants                              <chr> "00004776", "00004776", "00004...
#> $ percent_vot_ins                      <chr> "58,26", "58,26", "58,26", "58...
#> $ blancs_et_nuls                       <chr> "00000191", "00000191", "00000...
#> $ percent_bl_nuls_ins                  <chr> "2,33", "2,33", "2,33", "2,33"...
#> $ percent_bl_nuls_vot                  <chr> "4,00", "4,00", "4,00", "4,00"...
#> $ exprim_gan_lies                      <chr> "00004585", "00004585", "00004...
#> $ percent_exp_ins                      <chr> "55,93", "55,93", "55,93", "55...
#> $ percent_exp_vot                      <chr> "96,00", "96,00", "96,00", "96...
#> $ code_nuance                          <chr> "LDVG", "LDVG", "LUMP", "LDVD"...
#> $ sexe                                 <chr> "F", "F", "M", "M", "M", "F", ...
#> $ nom                                  <chr> "EXPOSITO", "PIDOUX", "FORTIN"...
#> $ pr_gan_lienom                        <chr> "Josiane", "Catherine", "Chris...
#> $ liste                                <chr> "AMBERIEU AMBITION", "VIVONS N...
#> $ si_gan_houges_elu                    <chr> "0", "0", "0", "0", "19", "0",...
#> $ si_gan_houges_secteur                <chr> "0", "0", "0", "0", "0", "0", ...
#> $ si_gan_houges_cc                     <chr> "0", "0", "0", "0", "2", "0", ...
#> $ voix                                 <chr> "00000954", "00000822", "00001...
#> $ percent_voix_ins                     <chr> "11,64", "10,03", "16,87", "17...
#> $ percent_voix_exp                     <chr> "20,81", "17,93", "30,16", "31...

Created on 2021-06-02 by the reprex package (v1.0.0)

1 Like

Thank you so much this looks great and exploitable. Now I have to adapt this for some other dataset but I think I can figure it out! Have a great day to the both of you !

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.