Sorting to create a data frame off a pdf

I am trying to sort through the data that I pulled directly off the site. It is in pdf format and when I run the code it comes up extremely sloppy. I would like to be able to automate the process to where whenever the site updates the data it will read it into r and automatically convert it into a data frame so that I can use the information.

Hi Ryan, welcome!

To help us help you, could you please turn this into a proper reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

1 Like

This is probably so fragile that it will not work with a different file. I hope it gives you a start on solving the general case. Note that all of the columns in the final data frame are factors! Most will probably need to be converted to numeric.

library(pdftools)
#> Warning: package 'pdftools' was built under R version 3.5.3
library(stringr)
suppressPackageStartupMessages(library(dplyr))
df <- download.file("http://www.mslc.com/Indiana/Resources/documents/ltcisrpt6.pdf",
              "ltcisrpt6.pdf", mode = "wb")
RevenuePatientDay <- pdf_text("ltcisrpt6.pdf")
RawPage <- str_split(RevenuePatientDay, "\\n") #break into lines
Hdr <- RawPage[[1]][9] #Define col names from the 9th line
Hdr <- str_replace(Hdr, "^ ", "") #remove leading space
Hdr <- str_replace(Hdr, "\\s+$", "") #remove trailing space
Hdr <- str_replace(Hdr, "For\\s+Profit", "For_Profit") #remove space within col name
Hdr <- str_split(Hdr, "\\s+")

Data <- RawPage[[1]][10:length(RawPage[[1]])] #get all rows after header
Data <- str_replace_all(Data, ",", "") #remove , from numbers
Boundary <- which(grepl("Revenues Per Patient Day", Data)) #Find text-only line
Data <- Data[-Boundary] #remove text only line
Data <- str_replace_all(Data, "(\\w)\\s(\\w)", "\\1_\\2") #replace space with _
Data <- str_replace(Data, "\\s+$", "") #remove trailing space
Data <- Data[-length(Data)] #remove empty line at end
ForDF <- str_split(Data, "\\s+")
#names(ForDF) <- Hdr[[1]]
Mat <- matrix(unlist(ForDF),byrow = TRUE, ncol = 6)
dfFinal <- as.data.frame(Mat)
colnames(dfFinal) <- Hdr[[1]]
dfFinal
#>    Number                          Description  State For_Profit
#> 1     142                       Beds_Available     98         99
#> 2     143             Total_Bed_Days_Available  35797      36222
#> 3     144                Medicaid_Patient_Days  16897      13850
#> 4     148                   Total_Patient_Days  26661      23816
#> 5     151                 Occupancy_Percentage 74.48%     65.75%
#> 6     152                 Medicaid_Utilization 63.38%     58.16%
#> 7     153                   Total_Hours_Worked 161533     147805
#> 8     158                     Hours_Worked_PPD   6.06       6.21
#> 9     160            Total_Number_of_Providers    525         27
#> 10    211                Routine_Daily_Service 278.25     281.37
#> 11    231                     Physical_Therapy  25.34      35.64
#> 12    232         Speech_and_Audiology_Therapy   7.76       9.76
#> 13    233                 Occupational_Therapy  24.53      34.42
#> 14    234                  Respiratory_Therapy   2.64       0.05
#> 15    235     Sale_of_Routine_Medical_Supplies   0.69       0.85
#> 16    236 Sale_of_Non-Routine_Medical_Supplies   4.19       1.52
#> 17    237                 X-Ray_and_Laboratory   1.34       2.94
#> 18    238                   Pharmacy_and_Drugs  13.91      11.40
#> 19    239     Parenteral_and_Enteral_Nutrition   0.14       0.00
#> 20    241                              Florist   0.00       0.00
#> 21    242                   Barber/Beauty_Shop   0.29       0.13
#> 22    243                     Vending_Machines   0.02       0.02
#> 23    244                   Personal_Purchases   0.01       0.04
#> 24    245   Meals_Sold_to_Guests_and_Employees   0.18       0.09
#> 25    246                       Activity_Sales   0.00       0.00
#> 26    247                    Investment_Income   0.63       0.11
#> 27    248                        Other_Revenue   3.29       3.33
#> 28    261                       Gross_Revenues 363.22     381.65
#> 29    262                       Less_Bad_Debts  -2.24      -6.00
#> 30    263  Less_Contractual_Charity_Allowances -78.78     -89.71
#> 31    267                Less_Other_Reductions  -0.29      -1.03
#> 32    268                         Net_Revenues 281.91     284.91
#>    Non-Profit Government
#> 1          59         99
#> 2       21598      36124
#> 3        6510      17322
#> 4       18878      27012
#> 5      87.41%     74.77%
#> 6      34.49%     64.13%
#> 7      167659     162145
#> 8        8.88       6.00
#> 9          12        486
#> 10     277.87     278.10
#> 11      37.65      24.62
#> 12       5.61       7.69
#> 13      32.63      23.90
#> 14       1.29       2.79
#> 15       0.57       0.69
#> 16       2.87       4.34
#> 17       2.68       1.24
#> 18      21.23      13.91
#> 19       0.00       0.15
#> 20       0.00       0.00
#> 21       1.05       0.29
#> 22       0.26       0.01
#> 23       0.00       0.01
#> 24       1.71       0.16
#> 25       0.54       0.00
#> 26      15.25       0.40
#> 27       7.67       3.21
#> 28     408.89     361.53
#> 29      -1.96      -2.07
#> 30     -47.57     -78.78
#> 31      -0.01      -0.26
#> 32     359.36     280.43

Created on 2019-05-24 by the reprex package (v0.2.1)

1 Like
library(tidyverse)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(cronR)
library(miniUI)
library(shiny)
library(shinyFiles)
library(pdftools)
library(tm)
#> Loading required package: NLP
#> 
#> Attaching package: 'NLP'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate
library(xlsx)
#> Warning in system("/usr/libexec/java_home", intern = TRUE): running command
#> '/usr/libexec/java_home' had status 1
#> Error: package or namespace load failed for 'xlsx':
#>  .onLoad failed in loadNamespace() for 'rJava', details:
#>   call: dyn.load(file, DLLpath = DLLpath, ...)
#>   error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so':
#>   dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-11.0.1.jdk/Contents/Home/lib/server/libjvm.dylib
#>   Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so
#>   Reason: image not found
library(readtext)
library(stringr)
library(plyr)
#> -------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> -------------------------------------------------------------------------
#> 
#> Attaching package: 'plyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#> The following object is masked from 'package:purrr':
#> 
#>     compact
library(datapasta)
datapasta::df_paste(download.file("http://www.mslc.com/Indiana/Resources/documents/ltcisrpt6.pdf",
              "ltcisrpt6.pdf", mode = "wb"))
#> Could not format input_table as table. Unexpected class.
datapasta::df_paste(RevenuePatientDay <- pdf_text("ltcisrpt6.pdf"))
#> Could not format input_table as table. Unexpected class.
RevenuePatientDay
#> [1] "Sort By: Organization Type                              Myers and Stauffer LC                                             10/01/18\n                                                                                                                 Quarter:\n                                                           Indiana Medicaid\n                                                                                                                 Date:    12/05/18\n                                                  Long Term Care Information System\n                                                                                                                 Page:    8\n                                                      Statistical Data Per Facility\n Line                                                                               Proprietary       Voluntary\n Number       Description                                       State                For Profit       Non-Profit          Government\n142          Beds Available                                                  98                    99               59                99\n143          Total Bed Days Available                                    35,797               36,222           21,598             36,124\n144          Medicaid Patient Days                                       16,897               13,850             6,510            17,322\n148          Total Patient Days                                          26,661               23,816           18,878             27,012\n151          Occupancy Percentage                                       74.48%               65.75%           87.41%             74.77%\n152          Medicaid Utilization                                       63.38%               58.16%           34.49%             64.13%\n153          Total Hours Worked                                        161,533              147,805          167,659             162,145\n158          Hours Worked PPD                                              6.06                  6.21             8.88              6.00\n160          Total Number of Providers                                      525                    27               12               486\n                                                        Revenues Per Patient Day\n211          Routine Daily Service                                       278.25               281.37           277.87             278.10\n231          Physical Therapy                                             25.34                 35.64            37.65             24.62\n232          Speech and Audiology Therapy                                  7.76                  9.76             5.61              7.69\n233          Occupational Therapy                                         24.53                 34.42            32.63             23.90\n234          Respiratory Therapy                                           2.64                  0.05             1.29              2.79\n235          Sale of Routine Medical Supplies                              0.69                  0.85             0.57              0.69\n236          Sale of Non-Routine Medical Supplies                          4.19                  1.52             2.87              4.34\n237          X-Ray and Laboratory                                          1.34                  2.94             2.68              1.24\n238          Pharmacy and Drugs                                           13.91                 11.40            21.23             13.91\n239          Parenteral and Enteral Nutrition                              0.14                  0.00             0.00              0.15\n241          Florist                                                       0.00                  0.00             0.00              0.00\n242          Barber/Beauty Shop                                            0.29                  0.13             1.05              0.29\n243          Vending Machines                                              0.02                  0.02             0.26              0.01\n244          Personal Purchases                                            0.01                  0.04             0.00              0.01\n245          Meals Sold to Guests and Employees                            0.18                  0.09             1.71              0.16\n246          Activity Sales                                                0.00                  0.00             0.54              0.00\n247          Investment Income                                             0.63                  0.11            15.25              0.40\n248          Other Revenue                                                 3.29                  3.33             7.67              3.21\n261          Gross Revenues                                              363.22               381.65           408.89             361.53\n262          Less Bad Debts                                               -2.24                 -6.00            -1.96             -2.07\n263          Less Contractual Charity Allowances                         -78.78               -89.71           -47.57             -78.78\n267          Less Other Reductions                                        -0.29                 -1.03            -0.01             -0.26\n268          Net Revenues                                                281.91               284.91           359.36             280.43\n"

Created on 2019-05-24 by the reprex package (v0.2.1)

Looks like you got an error loading a package

library(xlsx)
#> Warning in system("/usr/libexec/java_home", intern = TRUE): running command
#> '/usr/libexec/java_home' had status 1
#> Error: package or namespace load failed for 'xlsx':
#>  .onLoad failed in loadNamespace() for 'rJava', details:
#>   call: dyn.load(file, DLLpath = DLLpath, ...)
#>   error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so':
#>   dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-11.0.1.jdk/Contents/Home/lib/server/libjvm.dylib
#>   Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so
#>   Reason: image not found

There is a discussion here on a way to solve this (note the replies with instructions to update to the jdk):

Or the discussions here:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.