What is the best way to copy/paste table/dataframe from internet into R without too much cleaning ?

Hi,
There is for example a table over there:

obraz

https://stackoverflow.com/questions/61112899/how-to-restructure-dataframe-having-multiple-variable-by-keeping-some-variable-s

I have tried to use datapasta, but everything landed in one column.
A function read.table(textConnection(.......) says it is not allowed to use the same rownames.

How do I do it without too much hassle and cleaning etc. Sometimes we do not have an access to dataframe creation code, but a topic is interesting enough in order to follow it.

2 Likes

My favorite way is by using read.table(text = "copy-paste-here", header =TRUE) as follows:

dat = read.table(text ="seq      count  percentage   Marking     count     Percentage     batch_no   count    Percentage
FRD      1      12.50%       S1          2         25.00%         6          1        12.50%
FHL      1      12.50%       S2          1         12.50%         7          2        25.00%
ABC      2      25.00%       S3          1         12.50%         8          2        25.00%
DEF      1      12.50%       Hold        2         25.00%         9          1        12.50%
XYZ      1      12.50%       NA          1         12.50%         NA         1        12.50%
ZZZ      1      12.50%       (Blank)     1         12.50%         (Blank)    1        12.50%
FRD      1      12.50%         -         -           -             -         -           -
NA       1      12.50%         -         -           -             -         -           -
(Blank)  0      0.00%          -         -           -             -         -           -
Total    8      112.50%        -         8         100.00%         -         8         100.00%",   header =TRUE)

I tried this on RStudio as well as on Run R code online and it works well. It automatically append numbers to duplicate column names.

Here is the result:

dat
       seq count percentage Marking count.1 Percentage batch_no count.2 Percentage.1
1      FRD     1     12.50%      S1       2     25.00%        6       1       12.50%
2      FHL     1     12.50%      S2       1     12.50%        7       2       25.00%
3      ABC     2     25.00%      S3       1     12.50%        8       2       25.00%
4      DEF     1     12.50%    Hold       2     25.00%        9       1       12.50%
5      XYZ     1     12.50%    <NA>       1     12.50%     <NA>       1       12.50%
6      ZZZ     1     12.50% (Blank)       1     12.50%  (Blank)       1       12.50%
7      FRD     1     12.50%       -       -          -        -       -            -
8     <NA>     1     12.50%       -       -          -        -       -            -
9  (Blank)     0      0.00%       -       -          -        -       -            -
10   Total     8    112.50%       -       8    100.00%        -       8      100.00%

Check the structure:

str(dat)
'data.frame':	10 obs. of  9 variables:
 $ seq         : chr  "FRD" "FHL" "ABC" "DEF" ...
 $ count       : int  1 1 2 1 1 1 1 1 0 8
 $ percentage  : chr  "12.50%" "12.50%" "25.00%" "12.50%" ...
 $ Marking     : chr  "S1" "S2" "S3" "Hold" ...
 $ count.1     : chr  "2" "1" "1" "2" ...
 $ Percentage  : chr  "25.00%" "12.50%" "12.50%" "25.00%" ...
 $ batch_no    : chr  "6" "7" "8" "9" ...
 $ count.2     : chr  "1" "2" "2" "1" ...
 $ Percentage.1: chr  "12.50%" "25.00%" "25.00%" "12.50%" ...
1 Like

Can you please describe what did you do step-by-step ?
I tried your code and I neither was able to replicate table from SO nor your table.
I mean that table:
obraz

I think there is no "best" method that will work in any situation since tables can be messy in a lot of ways but I get very good results with little to no extra work by using this package

For example on the SO topic you have linked, I only have to use a multi cursor to bind the timestamp object as a single string, just a couple of seconds of extra work.

library(read.so)

read_so("uid     Date                  batch_no       marking       seq
K-1     16/03/2020_12:11:33  7              S1            FRD
K-1     16/03/2020_12:11:33  7              S1            FHL
K-2     16/03/2020_12:11:33  8              SE_hold1      ABC
K-3     16/03/2020_12:11:33  9              SD_hold2      DEF
K-4     16/03/2020_12:11:33  8              S1            XYZ
K-5     16/03/2020_12:11:33                 NA            ABC
K-6     16/03/2020_12:11:33  7                            ZZZ
K-7     16/03/2020_12:11:33  NA             S2            NA
K-8     16/03/2020_12:11:33  6              S3            FRD")
#> Warning: 2 parsing failures.
#> row col  expected    actual         file
#>   6  -- 5 columns 4 columns literal data
#>   7  -- 5 columns 4 columns literal data
#> # A tibble: 9 × 5
#>   uid   Date                batch_no marking  seq  
#>   <chr> <chr>                  <dbl> <chr>    <chr>
#> 1 K-1   16/03/2020_12:11:33        7 S1       FRD  
#> 2 K-1   16/03/2020_12:11:33        7 S1       FHL  
#> 3 K-2   16/03/2020_12:11:33        8 SE_hold1 ABC  
#> 4 K-3   16/03/2020_12:11:33        9 SD_hold2 DEF  
#> 5 K-4   16/03/2020_12:11:33        8 S1       XYZ  
#> 6 K-5   16/03/2020_12:11:33       NA ABC      <NA> 
#> 7 K-6   16/03/2020_12:11:33        7 ZZZ      <NA> 
#> 8 K-7   16/03/2020_12:11:33       NA S2       <NA> 
#> 9 K-8   16/03/2020_12:11:33        6 S3       FRD

Created on 2022-06-18 by the reprex package (v2.0.1)

I get this error msg from read.table(text = "copy-paste-here", header =TRUE)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 6 did not have 6 elements
Calls: read.table -> scan
Execution halted

Yes, exactly, this table is difficult to reading-in as has got an empty cells.

I tried data.table::fread() function, but it does not recognise Date column properly meaning it separates Date column into two columns. In one there is a date and time is placed in the other. Maybe somebody knows how to tweak fread() to work properly. Another thing is fread() stops at empty lines and only part of dataframe is read in.

Thank you andresrcs,

I didn't know library(read.so ) but I would like to show another example which I do not fully understand.
What is happening here:
Example from this link:
https://stackoverflow.com/questions/9341865/alternatives-to-statsreshape

library(read.so)
library(stats)

tmp <- read_so(
 "id val val2 cat
1   1   14   a
1   2   13   b
2   3   12   b
2   4   11   a")

tmp2 <- tmp

tmp2$t <- ave(tmp2$val, tmp2$id, FUN=seq_along)

stats::reshape(tmp2, idvar= "id", timevar = "t", direction = "wide")
#> # A tibble: 2 × 4
#>      id `val.c(1, 2)` `val2.c(1, 2)` `cat.c(1, 2)`
#>   <dbl>         <dbl>          <dbl> <chr>        
#> 1     1            NA             NA <NA>         
#> 2     2            NA             NA <NA>

but when I convert tmp2 to data.table object:

``` r
library(data.table)
library(read.so)
library(stats)


tmp <- read_so(
 "id val val2 cat
1   1   14   a
1   2   13   b
2   3   12   b
2   4   11   a")

tmp2 <- tmp

tmp2$t <- ave(tmp2$val, tmp2$id, FUN=seq_along)

tmp2dt <- setDT(tmp2)

class(tmp2)
#> [1] "data.table" "data.frame"

stats::reshape(tmp2dt, idvar= "id", timevar = "t", direction = "wide")
#>    id val.1 val2.1 cat.1 val.2 val2.2 cat.2
#> 1:  1     1     14     a     2     13     b
#> 2:  2     3     12     b     4     11     a

Created on 2022-06-19 by the reprex package (v2.0.1)

which gives us completely different results:

Additionally what is interesting here, when I create tmp2, its class is as follows:

class(tmp2)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

then when I convert it to data.table object named tmp2dt suddenly tmp2 becomes data.table object as well (allowing for correct calculations with tmp2 as well):

tmp2dt <- setDT(tmp2)

>class(tmp2)
[1] "data.table" "data.frame"

so I would like to kindly ask why is that ?

Reading the documentation of read.so I discovered that the read_so() function returns a tibble object (I wasn't aware of this because I rarely use base R functions for reshaping) which apparently is not compatible with stats::reshape() but the read.so() function returns a data frame object so there is no problem with reshape().

library(read.so)
library(stats)

tmp <- read.so(
    "id val val2 cat
1   1   14   a
1   2   13   b
2   3   12   b
2   4   11   a")

class(tmp)
#> [1] "data.frame"

tmp2 <- tmp
tmp2$t <- ave(tmp2$val, tmp2$id, FUN=seq_along)

stats::reshape(tmp2,
               idvar= "id",
               timevar = "t",
               direction = "wide")
#>   id val.1 val2.1 cat.1 val.2 val2.2 cat.2
#> 1  1     1     14     a     2     13     b
#> 3  2     3     12     b     4     11     a

Created on 2022-06-19 by the reprex package (v2.0.1)

Thank you Andresrcs and All for interesting discussion.

maybe it's not relevant anymore, but the datapasta package was designed specifically for handling data copy/pasted from websites GitHub - MilesMcBain/datapasta: On top of spaghetti, all covered in cheese.....

Also, you can convert from tibbles to dataframes with as.data.frame so you would have been able to use read_so too :slight_smile:

2 Likes

Thank you aghaynes,
I tried datapasta and it turned out to work poorly especially when tables are getting a bit complicated.

the table in the original example is a fixed width format, so

readr::read_fwf(
"uid     Date                  batch_no       marking       seq
K-1     16/03/2020  12:11:33  7              S1            FRD
K-1     16/03/2020  12:11:33  7              S1            FHL
K-2     16/03/2020  12:11:33  8              SE_hold1      ABC
K-3     16/03/2020  12:11:33  9              SD_hold2      DEF
K-4     16/03/2020  12:11:33  8              S1            XYZ
K-5     16/03/2020  12:11:33                 NA            ABC
K-6     16/03/2020  12:11:33  7                            ZZZ
K-7     16/03/2020  12:11:33  NA             S2            NA
K-8     16/03/2020  12:11:33  6              S3            FRD")

would work.
but it does underline why on this forum we emphasise the reprex guide, and suggest people share their datasets via datapasta or dput, rather than arbitrarily showing 'images' of their data.

1 Like

Thank you,
I didn't know, it works perfectly.

readr::read_fwf()
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.