Filtering dataframe to relocate row values

olmedoalonso · April 30, 2021, 2:11pm

I am new in R and in this forum. I would appreciate anyone´s help. I am trying to manipulate a dataframe to reorder the values that comes from a PDF table. I did the PDF extract tables and it looks as follows.

#Extracting the PDF table
pathway <- file.path("./", "0005-LLANO_BONITO_sept_2020.pdf")
llb_sept_20 <- extract_tables(pathway)
llb_sept_20 <- do.call(rbind, llb_sept_20)
llb_sept_20 <- as.data.frame(llb_sept_20)

#Cleaning the dataframe
llb_sept_20_a <- llb_sept_20 %>% select(1,2)
llb_sept_20_b <- llb_sept_20 %>% select(3,4)
llb_sept_20_a_1 <- llb_sept_20_a %>% filter(row_number() %% 7 != 3)
llb_sept_20_a_2 <- llb_sept_20_a_1 %>% filter(row_number() %% 6 != 4)

I need to reorder the values in (2:4, V2); (7:9, V2); (12:14, V2) to be part of the V1 (#ID) look like this:

I would appreciate your help.

FJCC · April 30, 2021, 3:25pm

Since I do not have your data, I invented a small data set.. The fill() function fills in missing values with the most recent non-missing value so that every row in V1 has a value. The group_by() and Mutate() functions work together to make a new column named ColName that stores the names of the columns that will be made in the next step. You might want to make names that are more informative than V2, V3, etc. The pivot_wider() function pivots the data so there is one row for each value of V1.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(V1 = c("A", NA,NA,NA,"B",NA,NA,NA,"C",NA,NA, NA),
                 V2 = c("AA", "BB", "CC", "DD", "EE", "FF", "GG", "HH", "II", "JJ", "KK", "LL"))
DF
#>      V1 V2
#> 1     A AA
#> 2  <NA> BB
#> 3  <NA> CC
#> 4  <NA> DD
#> 5     B EE
#> 6  <NA> FF
#> 7  <NA> GG
#> 8  <NA> HH
#> 9     C II
#> 10 <NA> JJ
#> 11 <NA> KK
#> 12 <NA> LL

DFpivot <- DF %>% fill(V1) %>%
  group_by(V1) %>% 
  mutate(ColName = paste0("V",row_number() + 1)) %>% 
  pivot_wider(names_from = ColName, values_from = V2)

DFpivot
#> # A tibble: 3 x 5
#> # Groups:   V1 [3]
#>   V1    V2    V3    V4    V5   
#>   <chr> <chr> <chr> <chr> <chr>
#> 1 A     AA    BB    CC    DD   
#> 2 B     EE    FF    GG    HH   
#> 3 C     II    JJ    KK    LL

^{Created on 2021-04-30 by the reprex package (v0.3.0)}

olmedoalonso · April 30, 2021, 5:01pm

Thanks very much. I will check it and let you know how I did.

olmedoalonso · April 30, 2021, 6:06pm

It worked just amazingly well!!! Thanks very much for your quick support!!! God bless you!!!

olmedoalonso · May 1, 2021, 2:02am

I tried to do the same with a different set of values and did not get the expected results. I modified the code in R to the new variables but it did not work. Let me show you...

This is the expected result:

I used the following code (just modified the variables):

llb_sept_20_b_4 <- llb_sept_20_b_3 %>% fill(V3) %>% group_by(V3) %>% mutate(ColName = paste0("V",row_number() + 1)) %>% pivot_wider(names_from = ColName, values_from = V4)

But I got the following message:

Error: Failed to create output due to bad names.

Choose another strategy with names_repair
Run rlang::last_error() to see where the error occurred.

FJCC · May 1, 2021, 3:20am

The problem is that the code is trying to make duplicate column names. This part of the code,

mutate(ColName = paste0("V",row_number() + 1))

makes the column names V2, V3, V4, V5 if each value of the original V3 spans four rows. You already have a column named V3 and I suppose the function does not allow duplicate column names. The easiest fix is to name the new columns with a letter other than V. Try

llb_sept_20_b_4 <- llb_sept_20_b_3 %>% fill(V3) %>% 
    group_by(V3) %>% 
    mutate(ColName = paste0("X",row_number())) %>% 
    pivot_wider(names_from = ColName, values_from = V4)

Notice that the new column numbering now starts from 1 since I dropped the + 1 from the paste0 function that makes the column names. That was there only to avoid a name conflict with the V1 column in the original data.

olmedoalonso · May 2, 2021, 3:21pm

Thanks for your quick response. Yesterday we had a power outage due to a thunderstorm in my area. So I could not get back to you.

I ran the code and worked fine. The only issue is that it added other columns, from X5 to X28, filled with NA values. Let me show you the image. I appreciate your help.

FJCC · May 2, 2021, 8:57pm

Are those extra columns really NA in all rows? I expect somewhere there are some non-NA values. Find the V3 values of those rows and then inspect the original data for the reason so many rows are assigned to that V3. You may have missing values of V3 so that the fill() function keeps using the previous V3 value until a new one is encountered. You can find where you do not have NA with

which(!is.na(llb_sep_20_b_4$X28))

olmedoalonso · May 2, 2021, 11:49pm

Thanks for the help. You know, as you might have seen in the images, I split a pdf table in half to work it out better. Well, at the end of the last 3rd and 4th columns there were missing ID´s values. That´s why the NA columns from X5 to X28. But I did not spent time with it so I selected the columns I needed to work with. I solved it that way.

Thanks very much for your help.

olmedoalonso · May 3, 2021, 2:33pm

But now I face another issue. I have a similar pdf file that looks like the previous one and when I ran the function extract_tables() and do.call() an error comes up. I attach images of this issue. If you could help me with this. Thanks in advance.

pathway_sjb <- file.path("./", "0004-SAN_JUAN_BAUTISTA_sept_2020.pdf")
sjb_sept_20 <- extract_tables(pathway_sjb)
sjb_sept_20 <- do.call(rbind, sjb_sept_20)

FJCC · May 3, 2021, 3:04pm

This seems to be an entirely different problem having to do with Illegal Reflective Access by RJavaTools. Please start another topic with a descriptive title so others with knowledge about that will see it. I know nothing about RJavaTools.

olmedoalonso · May 3, 2021, 3:08pm

Ok. Thanks very much.

olmedoalonso · May 6, 2021, 3:11am

Sorry I write again. But in this case the error is about the number of columns do not match and send me to see argument 97. Actually this argument is at the end of the data frame and contains 13 rows and 2 columns and the 96 previous arguments have 83 rows and 4 columns. My question is now, how can I balance this difference in the 97 argument with this column difference?

See the image below.

Thanks again for your help...

FJCC · May 6, 2021, 4:12am

Judging from one of your earlier posts, each table in your document consists of two pairs of column. Each pair has No. Cedula and Nombres y Appellidos. I expect the last table only has one of those pairs because that was all that was needed to complete the data. It is a bad idea to have a data frame with columns with duplicate names, so the best solution is probably to rearrange all of the data frames that have 4 columns so that they have two columns. That is, take columns 3 & 4 and append them at the end of columns 1 & 2 so that the new data frame is 166 x 2 instead of 83 x 4. A function to do that would look like

MakeTwoCol <- function(DF) {
  if(ncol(DF) == 4) {
    return(rbind(DF[, 1:2], DF[, 3:4])
  } else {
    return(DF)
  }

I just wrote that without testing it, so it may well not be exactly right. After you run that on all of the 97 data frames, you will have a list where all of the data frames have two columns and you can rbind them.

olmedoalonso · May 6, 2021, 1:53pm

That is correct. I ran it and gave me an error. See the image that follows. Again, I appreciate your help.

FJCC · May 6, 2021, 5:16pm

Try

sjb_sept_20_2col <- lapply(sjb_sept_20, MakeTwoCol)

This uses lapply to run MakeTwoCol on every element of sjb_sep_20

olmedoalonso · May 7, 2021, 5:37am

I got it this time. I finally completed part of the task I am doing. If I need any other help with the other part I will write through here. Thanks again.

system · May 14, 2021, 5:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.