Write function data wrangling to standardize dataset

Hi,

I am currently working with a data set which I identified that the data pulled needs some work.

This is the first time I'm trying to make a complex function that would handle issues for me so please bear with me, because I would really like to learn how to create functions to solve these kind of issues.

# Setup Libs
libs <- c("tidyverse",
          "lubridate")
lapply(libs, require, character.only = TRUE)
rm(libs)

# Data set example:
df_a <- tribble(
~controlno, ~sufix, ~status, ~ssrr, ~custname, ~payorname, ~invoicedate,            
#--|--|----           
1268, 1, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 2, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 3, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 4, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 5, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
1268, 6, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 7, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
1268, 8, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
1268, 9, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
1268, 10, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
5069, 1, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
5069, 2, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
5069, 3, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME5", "1900-01-01 00:00:00.000", 
5069, 4, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
5069, 5, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
5069, 6, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
5069, 7, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
5069, 8, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
5069, 9, "INV", "EGLV", "ANOTHER NAME3", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
)

While investigating the larger data set I identified the above example. custname & payorname inconsistancy. I do know that each controlno should have same names for these observations.

# The Group `controlno` = 1268, `custname` & `payorname` not consistent.
df_a %>% select(`controlno`, `sufix`, `status`, `ssrr`,`custname`, `payorname`,`invoicedate`,) %>% 
  filter(`controlno` == "1268")

Since I had no idea how to start but believed that a for loop or case_when() I opted for the for loop because I believe that I would need to output controlno which had a conflict like many different names and how to handle.

Unfortunately I didn't get very far, here is what I've got, I kept the notes of my testing script for insight of my considerations:

# How to create a function regarding this issue?
uniq_cntr <-  unique(df_a$controlno) # A list of all unique controls that has to go through.
# No idea how to start, maybe use case_when() would be another alternative?
# Wonder if creating the uniq_cntr is needed or used `unique(df_a$controlno)` in the for loop...
parse_name <-  function(df_a,`controlno`,`custname`, `payorname` ){
  for(uniq_cntr in df_a$controlno){

  if(df_a$controlno == uniq_cntr ){
  # Not sure how to "filter each unique `controlno` and check the names (most likely either NULL or 1 name), 
  # if there are two or more distinct name, pick the one that shows up most and ouput `controlno` which 
  # this occurred, if all names match within a column, move-on to next cntrol
}}}

The idea of the function would be to provide the data and specify all obs (columns) which to process based on the data set always having the control but I migth want to pass 1 col of obs, say just custname or just payorname or maybe I might have to parse more than these 2 obs at once, all I would have to do is specify them in the function.

# How I believe the final function would work as:
# Where `controlno` is a necessary argument and at least 1 obs col.
df_a %>% parse_name(`controlno`,`custname`, `payorname`)

I would exepect the output to be something like:

 controlno sufix status ssrr  custname      payorname     invoicedate            
       <dbl> <dbl> <chr>  <chr> <chr>         <chr>         <chr>                  
 1      1268     1 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 2      1268     2 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 3      1268     3 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 4      1268     4 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 5      1268     5 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 6      1268     6 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 7      1268     7 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 8      1268     8 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
 9      1268     9 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
10      1268    10 INV    SUDU  CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
11      5069     1 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
12      5069     2 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
13      5069     3 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
14      5069     4 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
15      5069     5 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
16      5069     6 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
17      5069     7 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
18      5069     8 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000
19      5069     9 INV    EGLV  ANOTHER NAME  ANOTHER NAME  1900-01-01 00:00:00.000

With some sort of output or warning of the controlno numbers that had conflicts (more than 1 name for each observation), the ones that had 1 customer name and NULL don't need to be on this output / warning,
So for controlno = 1268 it shoiuldn't need to appear in this output but controlno = 5069 something like :

# A tibble: 3 x 3
  controlno custname      payorname    
      <dbl> <chr>         <chr>        
1      5069 ANOTHER NAME  ANOTHER NAME 
2      5069 ANOTHER NAME2 ANOTHER NAME2
3      5069 ANOTHER NAME3 ANOTHER NAME5

Any assistance is welcome.
Thank you for you time,
LF.

Does this do what you want?

library(tidyverse)

df_a <- tribble(~ controlno, ~ sufix, ~ status, ~ ssrr, ~ custname, ~ payorname, ~ invoicedate,
                1268, 1, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 2, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 3, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 4, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 5, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 6, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 7, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 8, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 9, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 10, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
                5069, 1, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 2, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 3, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME5", "1900-01-01 00:00:00.000", 
                5069, 4, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 5, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 6, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
                5069, 7, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
                5069, 8, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 9, "INV", "EGLV", "ANOTHER NAME3", "ANOTHER NAME", "1900-01-01 00:00:00.000")

df_a %>% 
  group_by(controlno, custname) %>% 
  mutate(unique_payorname = n_distinct(payorname)) %>% 
  ungroup() %>% 
  filter(unique_payorname > 1) %>% 
  distinct(controlno, custname, payorname)
#> # A tibble: 4 x 3
#>   controlno custname      payorname    
#>       <dbl> <chr>         <chr>        
#> 1      5069 ANOTHER NAME  ANOTHER NAME 
#> 2      5069 ANOTHER NAME  ANOTHER NAME5
#> 3      5069 ANOTHER NAME2 ANOTHER NAME4
#> 4      5069 ANOTHER NAME2 ANOTHER NAME

Created on 2020-06-13 by the reprex package (v0.3.0)

2 Likes

Firstly,

Thank you for your reply, this is definitely a step on the right direction. The code you provided does output what needs to be changed, but doesn't address the part where I create a function that checks these unique names, picks the most common one and replaces or creates a new data frame with the "clean" structure.

I will continue to tackle this.

Thank you,
LF.

I'll help you get started with the basic function. You supply 3 arguments to this function; the data frame and the 2 columns to compare. You can use this as a framework to build upon.

compare_cols <- function(data, col1_name, col2_name) {
  data %>% 
    group_by(controlno, {{ col1_name }}) %>% 
    mutate(unique_vals = n_distinct(({{ col2_name }}))) %>%
    ungroup() %>%
    filter(unique_vals > 1) %>%
    distinct(controlno, {{ col1_name }}, {{ col2_name }})
}

compare_cols(df_a, custname, payorname)
# A tibble: 4 x 3
  controlno custname      payorname    
      <dbl> <chr>         <chr>        
1      5069 ANOTHER NAME  ANOTHER NAME 
2      5069 ANOTHER NAME  ANOTHER NAME5
3      5069 ANOTHER NAME2 ANOTHER NAME4
4      5069 ANOTHER NAME2 ANOTHER NAME
2 Likes

Thank you so much for your input.

I do have a few questions regarding the creation of the function (which is my main goal) but I am in a time crunch and need to find a solution for this and I believe using dplyr straight up can do it but I need your assistance. I will come back to the function as I get this squared.

df_comb is the whole dataset I'm working on, I've used what you've presented to add that unique column. And wanted to take control 1268 as an example.

> df_comb %>%  group_by(controlno,custname) %>% 
+   mutate(unique_payorname = n_distinct(payorname)) %>% select(`controlno`, `sufix`, `status`, `ssrr`,
+                                                               `custname`, `payorname`,`payorname`,`invoicedate`,
+                                                               `unique_payorname`) %>% filter(`status` != "CX",
+                                                                                              `controlno` == "1268")
# A tibble: 20 x 8
# Groups:   controlno, custname [3]
   controlno sufix status ssrr  custname    payorname   invoicedate             unique_payorname
       <dbl> <dbl> <chr>  <chr> <chr>       <chr>       <chr>                              <int>
 1      1268     1 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 2      1268     2 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 3      1268     3 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 4      1268     4 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 5      1268     5 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
 6      1268     6 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 7      1268     7 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
 8      1268     8 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
 9      1268     9 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
10      1268    10 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
11      1268    11 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
12      1268    12 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
13      1268    13 INV    SUDU  NA          NA          1900-01-01 00:00:00.000                1
14      1268    14 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
15      1268    15 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
16      1268    16 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
17      1268    17 INV    SUDU  NULL        NULL        1900-01-01 00:00:00.000                1
18      1268    18 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
19      1268    19 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1
20      1268    20 INV    SUDU  COMP_NAME  COMP_NAME    1900-01-01 00:00:00.000                1

How do I mutate custname & payorname to all have "COMP_NAME" in their fields? Considering that custname and payorname need to be "addressed" separately since they can have different names. And I would have to do this for every controlno available in the data set, getting rid of these NULL or NA values and substituting it for "COMP_NAME" for each filtered controlno.

The thought routine would be:

  • Determine all different controlno in the data set
  • For each controlno check the corresponding custname check if it is NULL or NA, find the controlno that actually contain the company name (for example above rows: 5,8,9 .... etc)
  • Change all the NULL an NA to the company name for that controlno
  • Repeat the above process with payorname

I can't figure a way to apply this filter and identify what is the comapny name and what needs to be corrected and do so.

I would greatly appreciate your assistance and once I get around this I do have a few questions about the function.

Thank you for your time,
LF

I would approach it this way:

  • Select the names that I want to keep for each controlno (in my example the criteria is the most common combination)
  • Join the result back with the original dataset.

And, I would recommend you to stay away from loops, when working with R, a loop is usually not the best possible solution (with some exceptions).

library(tidyverse)

df_a <- tribble(~ controlno, ~ sufix, ~ status, ~ ssrr, ~ custname, ~ payorname, ~ invoicedate,
                1268, 1, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 2, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 3, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 4, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 5, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 6, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 7, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000", 
                1268, 8, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 9, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000", 
                1268, 10, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
                5069, 1, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 2, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 3, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME5", "1900-01-01 00:00:00.000", 
                5069, 4, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 5, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 6, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
                5069, 7, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000", 
                5069, 8, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME", "1900-01-01 00:00:00.000", 
                5069, 9, "INV", "EGLV", "ANOTHER NAME3", "ANOTHER NAME", "1900-01-01 00:00:00.000")

# Select names to keep
names <- df_a %>% 
    mutate(across(everything(), ~ na_if(., "NULL"))) %>%
    drop_na(custname, payorname) %>% 
    group_by(controlno) %>% 
    count(custname, payorname) %>% 
    filter(n == max(n)) %>% 
    select(-n)

# Join back
df_a %>% 
    select(-custname, - payorname) %>% 
    left_join(names, by = "controlno")
#> # A tibble: 19 x 7
#>    controlno sufix status ssrr  invoicedate            custname     payorname   
#>        <dbl> <dbl> <chr>  <chr> <chr>                  <chr>        <chr>       
#>  1      1268     1 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  2      1268     2 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  3      1268     3 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  4      1268     4 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  5      1268     5 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  6      1268     6 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  7      1268     7 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  8      1268     8 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#>  9      1268     9 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#> 10      1268    10 INV    SUDU  1900-01-01 00:00:00.0… CUSTOMER NA… CUSTOMER NA…
#> 11      5069     1 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 12      5069     2 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 13      5069     3 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 14      5069     4 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 15      5069     5 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 16      5069     6 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 17      5069     7 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 18      5069     8 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
#> 19      5069     9 INV    EGLV  1900-01-01 00:00:00.0… ANOTHER NAME ANOTHER NAME
1 Like

Dear @andresrcs thank you!

I'm having an issue, I'm getting the error:

Error in across(everything(), ~na_if(., "NULL")) : 
  could not find function "across"

So I thought it might be we are using different versions of tidyverse. I tried install.packages("tidyverse"), ran the code, same error so after a quick research thought this might be a version of dplyr which I did not have so I went ahead and:

> remotes::install_github("tidyverse/dplyr")
Downloading GitHub repo tidyverse/dplyr@master
trying URL 'https://cran.rstudio.com/bin/windows/Rtools/Rtools35.exe'
Content type 'application/x-msdownload' length 108622512 bytes (103.6 MB)
downloaded 103.6 MB

Error: Could not find tools necessary to compile a package
> remotes::install_github("tidyverse/dplyr")
Downloading GitHub repo tidyverse/dplyr@master
These packages have more recent versions available.
Which would you like to update?

1:   R6 (2.3.0 -> 2.4.0) [CRAN]

Enter one or more numbers separated by spaces, or an empty line to cancel
1: 2.4.0
1: 1
Invalid input, please try again
1: R6
1: 
Invalid input, please try again
1:  
Installing 1 packages: vctrs
Installing package into ‘C:/Users/luisf/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
trying URL 'https://mran.microsoft.com/snapshot/2019-04-15/bin/windows/contrib/3.5/vctrs_0.1.0.zip'
Content type 'application/zip' length 617318 bytes (602 KB)
downloaded 602 KB

package ‘vctrs’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\luisf\AppData\Local\Temp\RtmpaSHEmp\downloaded_packages
√  checking for file 'C:\Users\luisf\AppData\Local\Temp\RtmpaSHEmp\remotes471c5dea44c3\tidyverse-dplyr-6380011/DESCRIPTION' (752ms)
-  preparing 'dplyr': (1.1s)
√  checking DESCRIPTION meta-information ... 
-  cleaning src
-  installing the package to process help pages (393ms)
         -----------------------------------
   ERROR: dependency 'lifecycle' is not available for package 'dplyr'
-  removing 'C:/Users/luisf/AppData/Local/Temp/Rtmp6Zoyrn/Rinst473c9a1339d/dplyr'
   In R CMD INSTALL
         -----------------------------------
   ERROR: package installation failed
Error in (function (command = NULL, args = character(), error_on_status = TRUE,  : 
  System command error

Since I had no clue what I was doing I restarted the session.

Any idea what is happening?

Best regards,
LF

Yes, you simply have to update dplyr to the latest version from CRAN (no need to install from GitHub)

Install.packages("dplyr")

@andresrcs I really have no clue what is going on.
I began reinstalling all packages:

install.packages("tidyverse")
install.packages("lubridate")
install.packages("formattable")
install.packages("dplyr")
install.packages("ggplot2")
library(dplyr)
library(ggplot2)
library(tidyverse)
Restarting R session...

Microsoft R Open 3.5.3
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2019 Microsoft Corporation

Using the Intel MKL for parallel mathematical computing (using 4 cores).

Default CRAN mirror snapshot taken on 2019-04-15.
See: https://mran.microsoft.com/.

> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library(ggplot2)
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v tibble  2.1.1     v purrr   0.3.2
v tidyr   0.8.3     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
> df_ordr <- read_csv("IMORDER_10K.csv")
Parsed with column specification:
cols(
  .default = col_character(),
  controlno = col_double(),
  sufix = col_double(),
  assignname = col_logical(),
  DlStop = col_double(),
  EDI210Sent = col_double(),
  RELOAD = col_double(),
  EDIOrder = col_double(),
  REZ1 = col_double(),
  PODMT = col_double(),
  RailBilled = col_double(),
  Rated = col_double(),
  IsChassisSplit = col_double(),
  ShipperOwned = col_double(),
  FumigationRequire = col_double(),
  ClearPortalBillLines = col_double(),
  FLEXEDHCST = col_double(),
  BondRequired = col_double(),
  OFFHIRE = col_double()
)
See spec(...) for full column specifications.
Warning: 491 parsing failures.
row col    expected      actual              file
520  -- 124 columns 125 columns 'IMORDER_10K.csv'
625  -- 124 columns 125 columns 'IMORDER_10K.csv'
673  -- 124 columns 125 columns 'IMORDER_10K.csv'
699  -- 124 columns 125 columns 'IMORDER_10K.csv'
700  -- 124 columns 125 columns 'IMORDER_10K.csv'
... ... ........... ........... .................
See problems(...) for more details.

> df_cntr <- read_csv("IMCONTAIN_10K.csv")
Parsed with column specification:
cols(
  .default = col_character(),
  controlno = col_double(),
  suffix = col_double(),
  containerno = col_double(),
  ReloadPromised = col_double()
)
See spec(...) for full column specifications.
Warning: 2370 parsing failures.
row col   expected     actual                file
 35  -- 97 columns 99 columns 'IMCONTAIN_10K.csv'
 36  -- 97 columns 99 columns 'IMCONTAIN_10K.csv'
 37  -- 97 columns 99 columns 'IMCONTAIN_10K.csv'
 38  -- 97 columns 99 columns 'IMCONTAIN_10K.csv'
 59  -- 97 columns 99 columns 'IMCONTAIN_10K.csv'
... ... .......... .......... ...................
See problems(...) for more details.

> df_date <- read_csv("IMDATES_10K.csv")
Parsed with column specification:
cols(
  .default = col_character(),
  controlno = col_double(),
  sufix = col_double(),
  entred = col_datetime(format = ""),
  appt = col_datetime(format = ""),
  release = col_datetime(format = ""),
  ingate = col_datetime(format = ""),
  outgate = col_datetime(format = ""),
  cutoff = col_datetime(format = ""),
  actualoutdate = col_datetime(format = ""),
  actualingate = col_datetime(format = ""),
  ArrivalDate = col_datetime(format = ""),
  DepartureDate = col_datetime(format = ""),
  lxDte = col_datetime(format = ""),
  MaxActDttm = col_datetime(format = ""),
  TruckETA = col_datetime(format = ""),
  DriverETA = col_datetime(format = ""),
  ActionDate = col_datetime(format = ""),
  IxDatesApproved = col_double()
)
See spec(...) for full column specifications.
> #Orders
> df_ordr_clr <- df_ordr %>% select(`controlno`,`sufix`,`status`,`class`,`ssrr`,`billed`, `billedamt`, `custname`,
+                                   `payorname`, `loadtype`,`terminal`,`haulclass`,`invoicedate`) %>% drop_na(controlno)
> # Containers
> df_cntr_clr <- df_cntr %>%  mutate(`sufix` = `suffix`) %>% 
+   select(`controlno`, `sufix`,`eqptno`, `puzone`, `delzone`, `rtnzone`, `DateCreated`,
+          `stat`,`iobound`,`containerpkid`,`containerpkloc`,`containerpkstate`,`containerpkcustname`,
+          `containerdelid`,`containerdelloc`,`containerdelstate`,`containerdelcustname`,
+          `containerrtid`,`containerrtloc`,`containerrtstate`,`containerrtcustname`) %>% drop_na(controlno)
> 
> #Dates
> df_date_clr <- df_date %>% select( `controlno`, `sufix`, `entred`,`ingate`, `outgate`, `cutoff`,
+                                    `actualoutdate`, `actualingate`, `ArrivalDate`) %>% drop_na(`controlno`)
> 
> #### combined dataframes ####
> df_comb <- df_ordr_clr %>% left_join(df_cntr_clr, by = c("controlno","sufix")) %>% 
+   left_join(df_date_clr, by = c("controlno","sufix")) %>% 
+   mutate("DateCreated" = as.Date(`DateCreated`),
+          "entred" = as.Date(`entred`),
+          "ingate" = as.Date(`ingate`),
+          "outgate"  = as.Date(`outgate`),
+          "cutoff"  = as.Date(`cutoff`),
+          "actualoutdate"  = as.Date(`actualoutdate`),
+          "actualingate"  = as.Date(`actualingate`),
+          "ArrivalDate"  = as.Date(`ArrivalDate`),
+   ) %>% filter(`status` != "CX")
> df_comb %>% 
+   mutate(across(everything(), ~ na_if(., "NULL"))) %>%
+   drop_na(custname, payorname) %>% 
+   group_by(controlno) %>% 
+   count(custname, payorname) %>% 
+   filter(n == max(n)) %>% 
+   select(-n)
Error in across(everything(), ~na_if(., "NULL")) : 
  could not find function "across"

Doesn't seem to be working, should I reinstall Rstudio? Still don't get the across() function.

Any ideas?
Ty.
LF.

The package library is independent from RStudio so it would have no effect.

I think the problem is that you are using Microsoft R open, instead of a regular R installation, most likely dplyr has not being updated yet on Microsoft repositories. I have never used the Microsoft version so I can't help you any further in that regard.

Firstly, thank you for your feedback and help!

I wished I was able to full integrate what I want to do with function. I opted to use the practical approach @andresrcs suggested to solve my problem in the short term.

On the other hand since my goal was to get better on making my own functions, I do not understand why did you use {{}} when specifying the column variables.

Thank you for your time.
LF.

Take a look at this article

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.