Table with a lot of special signs reorder HELP for a desperate

Hey,
i am new here and i am just a R-Studio basics user. But I am very impressed by the possibilitys R Studio got!

I hope somebody can help me with the following task.

My table looks like:
.txt

H Ti ID mer Z L MT Mik F E Hst F_E e R xy
27 Christina 3 587 54572 3 5 37,9 5,99 3,58 250 1,7** 47,1 5
20 3 598 27377 2 7 38,7 3,47 3,45 146 1 36,5 4
192 Betina 3 608 33915 1 8 24,6 6,19 3,5 276 1,8** 31 5
2 Antje 3 608 33881 1 10 32,5 4,18 3,31 126 1,3 33 4
65 Soleika 3 608 33887 1 11 32,1 4,29 3,08 140 1,4 32,6 4
179 Carmen 3 587 54567 3 11 41,3 4,53 3,15 125 1,4 43,2 4
17 Gundula 3 598 27413 2 13 44,1 3,68 3,42 211 1,1 42,6 5
71 Annika 3 598 27454 1 14 30,7 5,23 3,22 255 1,6** 34,8 5
89 Erna 3 608 33888 1 16 35,8 4,03 3,16 111 1,3 35,4 4
152 Erle 3 598 27408 2 19 55 3,33 2,97 145 1,1 49,3 1
136 Emily 3 587 54569 3 23 56 4,03 3,22 200 1,3 55,6 5
56 Paola 3 587 54607 3 24 46,1 4,04 3,01 238 1,3 45,2 2
121 3 598 27376 2 25 35,9 3,67 3,67 210 1 35,2 5
98 3 587 54558 3 32 40,9 3,3 3,09 153 1,1 36,8 5
182 Anja 3 608 33892 1 35 ?39,0 3,41 2,98 212 1,1 35,3 2
93 Paola 3 587 54576 3 56 59,2 2,6 2,95 156 0,9 * 48 2

There are following basics and problems by coloumn:

H. these is ok
Ti,. could be a name or a n.a but n.a is not recognized when saved as .txt so values are moved from the following coloumn "ID" to these coloumn
ID. is always a "3" and i thing a good argument for coding??
mer. always a count of three numbers but not the same in each line
Z. always a count of five numbers but not the same in each line
L. always one number but not the same in each line (sometimes it could go up to two numbers - very rare)
MT. a count from one up to three numbers but not the same in each line (these increases from the first to the last line - here 5 (first line H=27) .... 56 (last line H=93)
Mik. a count of up to three numbers. One or two in front and one after a "," and sometimes there is an aditional "↑" or "↓" in the front of the number (here it is a "?" not the "↑" - e.g. line H=182)
F. a count of three numbers. One in front and one or tow after a ","
E. a count of three numbers. One in front and one or tow after a ","
Hst. a count of three numbers sometimes aditional "" after the count
F_E a count of tow numbers separated by "," one in front one after the "," - sometimes aditional " *" (recognize the space in front of *) or "
" (without a space) after the count
e. a count of up to three numbers. One or two in front and one after a ","
R. a singel number
xy. contains the number belonging to coloumn R which are moved through saving as .txt and would not exist in the original .pdf

In the end all "↓" or "↑" (here"?") or " *" or "**" should be removed and the counts should be in the correct coloumn.... as:

H Ti ID mer Z L MT Mik F E Hst F_E e R xy
27 Christina 3 587 54572 3 5 37,9 5,99 3,58 250 1,7 47,1 5
20 Christinaa 3 598 27377 2 7 38,7 3,47 3,45 146 1 36,5 4
192 Betina 3 608 33915 1 8 24,6 6,19 3,5 276 1,8 31 5
2 Antje 3 608 33881 1 10 32,5 4,18 3,31 126 1,3 33 4
65 Soleika 3 608 33887 1 11 32,1 4,29 3,08 140 1,4 32,6 4
179 Carmen 3 587 54567 3 11 41,3 4,53 3,15 125 1,4 43,2 4
17 Gundula 3 598 27413 2 13 44,1 3,68 3,42 211 1,1 42,6 5
71 Annika 3 598 27454 1 14 30,7 5,23 3,22 255 1,6 34,8 5
89 Erna 3 608 33888 1 16 35,8 4,03 3,16 111 1,3 35,4 4
152 Erle 3 598 27408 2 19 55 3,33 2,97 145 1,1 49,3 1
136 Emily 3 587 54569 3 23 56 4,03 3,22 200 1,3 55,6 5
56 Paola 3 587 54607 3 24 46,1 4,04 3,01 238 1,3 45,2 2
121 Christinaaa 3 598 27376 2 25 35,9 3,67 3,67 210 1 35,2 5
98 Christinaaaa 3 587 54558 3 32 40,9 3,3 3,09 153 1,1 36,8 5
182 Anja 3 608 33892 1 35 39,0 3,41 2,98 212 1,1 35,3 2
93 Paola 3 587 54576 3 56 59,2 2,6 2,95 156 0,9 48 2

all "n.a" Ti are filled with "Christina" plus an aditional "a" retrospective the last "Christina" up in the table

If there is anyone who can help... please !
I am a newb................

Best regards!

If I was you I would go back to the data source from which the txt was extracted and extract it in a structured way, like CSV for example.

ok, i did it with csv but i got the same structure... same problems

i need some command for:

each line:
a. take all values in line from "3" on in coloumn 2 or 3 and move them so that they start with "3" in coloumn 3
b. remove all "?" and all " *" and all "**"
c. and the last step would be fill up the emergine gaps with the values followig in the line.......

that would solve it, but i am not able to program that... at least i wasnt for 8 hours or more........ :wink:

Hey,
the original is a pdf. I already tried to read the table from pdf directly, but I was not able to do it in a good wy. So I copied it to excel and did "text to coloumn". After that I saved it as txt or csv.
The document consists of 10 pages with different tables. I copied a page of a table witch is most important and similar to my own. That style is the basis of my one (the version page up) but I changed the coloumn names. But I dont need to change them.

It would be so helpful, if there is a way to sort it - if its read as pdf, csv or txt

BY NOW I CANT UPLOAD CAUSE I AM NEW... I WILL DO TAHT SOON

Best regards

May you can use this link:


Best regards!

I believe the shared PDF does not contain text, but is entirely image based.
Had you used OCR to 'extract' the text? Im confused on this point...

Anyway, the way you represented example data to this forum is copy and pasteable, so i pasted it to a local file for myself. the raw text once copied appeared tab delimited, so i globally replaced tabs with semi-colons. and resaved the file. I could probably have kept it with tabs and picked a read function which recognises tab delimiters, but i didnt.

once i have semi-colon delimited file I used this code:

library(tidyverse)
almost_csv<- read.delim("forforum_notquite.csv",
                        sep = ";",
                        stringsAsFactors = FALSE,
                        colClasses = "character") %>% 
  as_tibble()


#realign TI/and ID with 3 rule
almost_csv2<- as.matrix(almost_csv )

for (r in 1:nrow(almost_csv2)){
  
  if(almost_csv2[r,2] =='3' )
  {
    almost_csv2[r,3:15] <- almost_csv2[r,2:14]
    almost_csv2[r,2] <- NA_character_
  }
}

#remove all asterix ,and turn all commas to decimals, then H,ID,mer,Z,L,MT,R to integer, the rest to double
almost_csv3 <- as_tibble(almost_csv2) %>% 
  transmute_all(~stringr::str_replace_all(string=.,pattern ="\\*" ,replacement = "")) %>%
  transmute_all(~stringr::str_replace_all(string=.,pattern ="\\," ,replacement = ".")) %>%
  mutate_at(.vars = c("H","ID","mer","Z","L","MT","R","xy"),
            .funs = as.integer) %>%
  mutate_at(.vars = c("Mik", "F",     "E",     "Hst",   "F_E",   "e"     ),
            .funs = as.double)
> almost_csv3
# A tibble: 16 x 15
       H Ti           ID   mer     Z     L    MT   Mik     F     E   Hst   F_E     e     R    xy
   <int> <chr>     <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
 1    27 Christina     3   587 54572     3     5  37.9  5.99  3.58   250   1.7  47.1     5    NA
 2    20 NA            3   598 27377     2     7  38.7  3.47  3.45   146   1    36.5     4    NA
 3   192 Betina        3   608 33915     1     8  24.6  6.19  3.5    276   1.8  31       5    NA
 4     2 Antje         3   608 33881     1    10  32.5  4.18  3.31   126   1.3  33       4    NA
 5    65 Soleika       3   608 33887     1    11  32.1  4.29  3.08   140   1.4  32.6     4    NA
 6   179 Carmen        3   587 54567     3    11  41.3  4.53  3.15   125   1.4  43.2     4    NA
 7    17 Gundula       3   598 27413     2    13  44.1  3.68  3.42   211   1.1  42.6     5    NA
 8    71 Annika        3   598 27454     1    14  30.7  5.23  3.22   255   1.6  34.8     5    NA
 9    89 Erna          3   608 33888     1    16  35.8  4.03  3.16   111   1.3  35.4     4    NA
10   152 Erle          3   598 27408     2    19  55    3.33  2.97   145   1.1  49.3     1    NA
11   136 Emily         3   587 54569     3    23  56    4.03  3.22   200   1.3  55.6     5    NA
12    56 Paola         3   587 54607     3    24  46.1  4.04  3.01   238   1.3  45.2     2    NA
13   121 NA            3   598 27376     2    25  35.9  3.67  3.67   210   1    35.2     5    NA
14    98 NA            3   587 54558     3    32  40.9  3.3   3.09   153   1.1  36.8     5    NA
15   182 Anja          3   608 33892     1    35  NA    3.41  2.98   212   1.1  35.3     2    NA
16    93 Paola         3   587 54576     3    56  59.2  2.6   2.95   156   0.9  NA      48     2

Hey nirgrahamuk,
What a great function! Thanks! looks quite good... but line 15 and 16 are still wrong.

182 Anja 3 608 33892 1 35 39,0 3,41 2,98 212 1,1 35,3 2
93 Paola 3 587 54576 3 56 59,2 2,6 2,95 156 0,9 48 2

Somehow in lien 15 the 39 Mik are gone. And line 16 the numbers from the last two coloumns should be one to the left... so coloumn "xy" ist only NA ... "xy" can be removed after transfomation...

library(tidyverse)
almost_csv<- read.delim("forforum_notquite.csv",
                        sep = ";",
                        stringsAsFactors = FALSE,
                        colClasses = "character") %>% 
  as_tibble()


#realign TI/and ID with 3 rule
almost_csv2<- as.matrix(almost_csv )

for (r in 1:nrow(almost_csv2)){
  
  if(almost_csv2[r,2] =='3' )
  {
    almost_csv2[r,3:15] <- almost_csv2[r,2:14]
    almost_csv2[r,2] <- NA_character_
  }
}

#manual move of r,XY for row 16 to e , R
almost_csv2[16,13:14]  <- almost_csv2[16,14:15] 
#remove all asterix ,and turn all commas to decimals, then H,ID,mer,Z,L,MT,R to integer, the rest to double
almost_csv3 <- as_tibble(almost_csv2) %>% 
  transmute_all(~stringr::str_replace_all(string=.,pattern ="\\?" ,replacement = "")) %>%
  transmute_all(~stringr::str_replace_all(string=.,pattern ="\\*" ,replacement = "")) %>%
  transmute_all(~stringr::str_replace_all(string=.,pattern ="\\," ,replacement = ".")) %>%
  mutate_at(.vars = c("H","ID","mer","Z","L","MT","R","xy"),
            .funs = as.integer) %>%
  mutate_at(.vars = c("Mik", "F",     "E",     "Hst",   "F_E",   "e"     ),
            .funs = as.double) %>% select (-xy)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

I think the problem begins with the issue about the NAs which by the way must be able to export in your .txt. If you have your original file and can share with us a reproducible example of your data (or full data would be better) maybe I can help you a little more.

maybe upload the document into a public repo on github (for free) and simply link to it ?