how to read file with uneven number of columns in R

snowball · September 26, 2019, 1:38pm

Hi all,

I'm having trouble reading a txt file in R. For example, the first row has 10112 and 1, the second row has number 1 first, and then 3 ratios. The next row has 8 numbers. The rows are all like this. In fact, the rows with number 10112, 10114, 10115, etc. are specific IDs. They represent the IDs of each unit. I want to do some further calculations for each unit, but I don't know if R is possible to read such txt files. How to do this so that the rows after 10112 belong to the unit 10112, then the rows after 10114 belong to the unit 10114? I use read.table(), and got the error below. Thanks for your help.

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 8 elements

10112 1
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
2 0.20 0.20 0.60
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64
10114 1
1 0.20 0.30 0.50
9.78 8.64 8.33 7.60 10.57 7.16 9.05 8.58
2 0.30 0.40 0.30
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.80
3 0.10 0.20 0.70
1.40 0.67 5.22 0.96 1.23 2.52 1.36 4.81
10115 1
1 0.20 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9.00 9.89
2 0.30 0.50 0.20
6.14 3.20 0.90 4.72 5.06 4.22 1.29 2.38
10118 1
1 0.20 0.60 0.20

Yarnabrina · September 26, 2019, 2:28pm

Welcome to the community!

Whenever you have a question on programming, first explore Stack Overflow. It's just great. I've provided an answer below based on a sample text file that I generated, but it is totally based on (just short of exact copy-paste) the accepted answer on this thread.

sample_text <- "3 -0.6693227 -0.7873222 0.4245348 1.2967424 0.3274851 -0.1359788 -1.6941693 -0.4211109 0.5652219 -0.9217692
9 -0.1666452 -2.4127623 -0.3542472 -0.3218811 0.1158014 0.4872717
6 -0.22941872 1.08753826 1.19390226 1.13897886 1.00118742 0.67274102 -0.03330435 -0.62604317 1.39810065 0.72696443
10 -1.23715047 -0.03268193 -2.10548252 -1.24928972 -1.77623106 -0.49554881
2 2.3935846  0.3702620 -0.8196183
7 -1.358423 1.751524
8 -0.2161822 0.5810251 0.5377655 -0.6070761 0.6760455 0.3222419 0.3980245 -0.9955221 .7672507 -0.7689794
1 -0.9068758
5 -0.5313554 0.2262915 -1.0024394 1.3053317 -0.6471348 0.7764412
4 -0.0239177"

read_the_text <- scan(text = sample_text, # if your data is in a text file, use the file argument
                      what = character(),
                      sep = "\n")

split_each_line_by_spaces <- strsplit(x = read_the_text,
                                      split = " ")

get_element_names <- lapply(X = split_each_line_by_spaces,
                            FUN = `[[`,
                            i = 1)

get_element_values <- lapply(X = split_each_line_by_spaces,
                             FUN = `[`,
                             i = (-1))

required_result_as_character <- setNames(object = get_element_values,
                                         nm = get_element_names)

required_result <- lapply(X = required_result_as_character,
                          FUN = as.numeric)

required_result
#> $`3`
#>  [1] -0.6693227 -0.7873222  0.4245348  1.2967424  0.3274851 -0.1359788
#>  [7] -1.6941693 -0.4211109  0.5652219 -0.9217692
#> 
#> $`9`
#> [1] -0.1666452 -2.4127623 -0.3542472 -0.3218811  0.1158014  0.4872717
#> 
#> $`6`
#>  [1] -0.22941872  1.08753826  1.19390226  1.13897886  1.00118742
#>  [6]  0.67274102 -0.03330435 -0.62604317  1.39810065  0.72696443
#> 
#> $`10`
#> [1] -1.23715047 -0.03268193 -2.10548252 -1.24928972 -1.77623106 -0.49554881
#> 
#> $`2`
#> [1]  2.3935846         NA  0.3702620 -0.8196183
#> 
#> $`7`
#> [1] -1.358423  1.751524
#> 
#> $`8`
#>  [1] -0.2161822  0.5810251  0.5377655 -0.6070761  0.6760455  0.3222419
#>  [7]  0.3980245 -0.9955221  0.7672507 -0.7689794
#> 
#> $`1`
#> [1] -0.9068758
#> 
#> $`5`
#> [1] -0.5313554  0.2262915 -1.0024394  1.3053317 -0.6471348  0.7764412
#> 
#> $`4`
#> [1] -0.0239177

^{Created on 2019-09-26 by the reprex package (v0.3.0)}

Hope this helps.

snowball · September 27, 2019, 2:24am

Thanks for your reply. I looked at the original thread, but I thought it is different to apply to my case. In my case, the number 10112, 10114, 10115, etc. each represents a unit. Then in the following lines, for example after 10112, 1 is group 1, 2 is group 2, the lines with 8 numbers after group1 and group2 represent the attributes of each group. There may be different number of groups in each unit. For example, in unit 10114, there are 3 groups: group1, group2 and group3, with their corresponding attributes following.
I want to extract all lines for each unit, no matter how many groups there are. There may be no group in one unit. How to do this then? Thanks.

sample <- "10112 1
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64
10114 1
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
3 0.1 0.2 0.7
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
10115 1
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38"

readin <- scan(text= sample, #file = "path/sample.txt",
what = character(),
sep = "\n")

split.line <- strsplit(x = readin, split = " ")

snowball · September 27, 2019, 6:28am

Or maybe another way to look at this, start from the first row, if the 2nd number on row1 is 2, then the following 2 lines after "10112 2" belong to this unit. Then in the 4th row, the 2nd number is 5, so the following 5 rows after "10114 5" belong to this unit. Then next, in the row "10115 6", the 2nd number is 6, so the following 6 numbers belong to this unit, etc. The rows go on like this. Is it possible to do it in R? Thanks for your help.

sample2<- "10112 2
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
10114 5
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
10115 6
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64"

andresrcs · September 27, 2019, 2:16pm

I still don't understand your data structure but maybe this would take us a step closer

library(tidyverse)

sample2 <- "10112 2
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
10114 5
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
10115 6
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64
"

fil <- tempfile("temp")
cat(sample2, file = fil)
values <- readLines(fil) 
df <- as.data.frame(values, stringsAsFactors = FALSE)

df %>% 
    mutate(unit = if_else(str_detect(values, "\\d{5}\\s\\d"), values, NA_character_),
           unit = str_remove(unit, "(?<=...)\\s\\d")) %>% 
    fill(unit, .direction = "down") %>% 
    filter(!str_detect(values, "\\d{5}\\s\\d")) %>% 
    separate_rows(values, sep = "\\s", convert = TRUE)
#>    values  unit
#> 1    1.00 10112
#> 2    0.10 10112
#> 3    0.60 10112
#> 4    0.30 10112
#> 5    7.07 10112
#> 6    2.01 10112
#> 7    0.26 10112
#> 8    0.13 10112
#> 9    4.68 10112
#> 10   0.56 10112
#> 11   0.96 10112
#> 12   1.28 10112
#> 13   1.00 10114
#> 14   0.20 10114
#> 15   0.30 10114
#> 16   0.50 10114
#> 17   9.78 10114
#> 18   8.64 10114
#> 19   8.33 10114
#> 20   7.60 10114
#> 21  10.57 10114
#> 22   7.16 10114
#> 23   9.05 10114
#> 24   8.58 10114
#> 25   2.00 10114
#> 26   0.30 10114
#> 27   0.40 10114
#> 28   0.30 10114
#> 29   4.95 10114
#> 30   5.91 10114
#> 31   4.01 10114
#> 32   3.82 10114
#> 33   5.94 10114
#> 34   4.41 10114
#> 35   3.53 10114
#> 36   5.80 10114
#> 37   1.40 10114
#> 38   0.67 10114
#> 39   5.22 10114
#> 40   0.96 10114
#> 41   1.23 10114
#> 42   2.52 10114
#> 43   1.36 10114
#> 44   4.81 10114
#> 45   1.00 10115
#> 46   0.20 10115
#> 47   0.40 10115
#> 48   0.40 10115
#> 49  10.06 10115
#> 50  10.47 10115
#> 51   8.29 10115
#> 52   9.54 10115
#> 53  11.11 10115
#> 54   9.22 10115
#> 55   9.00 10115
#> 56   9.89 10115
#> 57   2.00 10115
#> 58   0.30 10115
#> 59   0.50 10115
#> 60   0.20 10115
#> 61   6.14 10115
#> 62   3.20 10115
#> 63   0.90 10115
#> 64   4.72 10115
#> 65   5.06 10115
#> 66   4.22 10115
#> 67   1.29 10115
#> 68   2.38 10115
#> 69   2.00 10115
#> 70   0.20 10115
#> 71   0.20 10115
#> 72   0.60 10115
#> 73   9.49 10115
#> 74   8.51 10115
#> 75   9.67 10115
#> 76   7.92 10115
#> 77  10.19 10115
#> 78   9.14 10115
#> 79   8.96 10115
#> 80   8.64 10115

snowball · September 27, 2019, 2:46pm

This is not what I want to get, but thanks. I think Yarnabrina's answer could give me some hints.
I have adjusted my method. So if I read sample2 correctly using scan(), then I could change 10112, 10114, 10115 to some other corresponding values, and keep the values in this file. The values 10112, 10114, 10115 could be put into a list. I will think about it, thanks for your help.

andresrcs · September 27, 2019, 3:06pm

A small sample of your desired output would be useful for understanding your problem, could you provide it?

snowball · September 27, 2019, 10:38pm

I want to do something for each whole unit, for example like this in the following-Copy each unit to two more units and change the unit code for each copy. But it has to recognize the unit code first. The unit code is the first column in the rows with only two numbers, such as 10112, 10114, 10115 in sample2. Thanks for any suggestion.

sample2<- "10112 2
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
10114 5
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
10115 6
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64"

desired_output <- "101121 2
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
101122 2
1 0.1 0.6 0.3
7.07 2.01 0.26 0.13 4.68 0.56 0.96 1.28
101141 5
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
101142 5
1 0.2 0.3 0.5
9.78 8.64 8.33 7.6 10.57 7.16 9.05 8.58
2 0.3 0.4 0.3
4.95 5.91 4.01 3.82 5.94 4.41 3.53 5.8
1.4 0.67 5.22 0.96 1.23 2.52 1.36 4.81
101151 6
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64
101152 6
1 0.2 0.4 0.4
10.06 10.47 8.29 9.54 11.11 9.22 9 9.89
2 0.3 0.5 0.2
6.14 3.2 0.9 4.72 5.06 4.22 1.29 2.38
2 0.2 0.2 0.6
9.49 8.51 9.67 7.92 10.19 9.14 8.96 8.64"

andresrcs · September 27, 2019, 11:50pm

I'm confused, in your original post you are asking to read the data into R as rows with a "unit" identifier (that's what I already did on my previous example), but now it seems like you want to manipulate a character string, since your desired output is not a dataframe with rows but another character string. If that is what you actually want to do, I recommend you to open a new topic for that, since the title in this one is going to be misleading for anyone trying to help.

snowball · September 28, 2019, 12:39am

I thought this way to describe the question may be clearer, but I opened another topic here How to manipulate character string for conditional rows. Thanks.

snowball · September 28, 2019, 3:40am

Thanks for your solution. I want a list that is displayed as a string. I do not understand what this means?

In fact, my question is that I read in this file, and then copy and change all unit IDs to other values, but the question is in a new post now. And I want to output the result as a txt file, like the initial file format. Thanks for your help.

system · October 19, 2019, 3:40am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.