Hi,
I am currently working with a data set which I identified that the data pulled needs some work.
This is the first time I'm trying to make a complex function that would handle issues for me so please bear with me, because I would really like to learn how to create functions to solve these kind of issues.
# Setup Libs
libs <- c("tidyverse",
"lubridate")
lapply(libs, require, character.only = TRUE)
rm(libs)
# Data set example:
df_a <- tribble(
~controlno, ~sufix, ~status, ~ssrr, ~custname, ~payorname, ~invoicedate,
#--|--|----
1268, 1, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 2, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 3, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 4, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 5, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
1268, 6, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 7, "INV", "SUDU", "NULL", "NULL", "1900-01-01 00:00:00.000",
1268, 8, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
1268, 9, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
1268, 10, "INV", "SUDU", "CUSTOMER NAME", "CUSTOMER NAME", "1900-01-01 00:00:00.000",
5069, 1, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000",
5069, 2, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000",
5069, 3, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME5", "1900-01-01 00:00:00.000",
5069, 4, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000",
5069, 5, "INV", "EGLV", "ANOTHER NAME", "ANOTHER NAME", "1900-01-01 00:00:00.000",
5069, 6, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000",
5069, 7, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME4", "1900-01-01 00:00:00.000",
5069, 8, "INV", "EGLV", "ANOTHER NAME2", "ANOTHER NAME", "1900-01-01 00:00:00.000",
5069, 9, "INV", "EGLV", "ANOTHER NAME3", "ANOTHER NAME", "1900-01-01 00:00:00.000",
)
While investigating the larger data set I identified the above example. custname
& payorname
inconsistancy. I do know that each controlno
should have same names for these observations.
# The Group `controlno` = 1268, `custname` & `payorname` not consistent.
df_a %>% select(`controlno`, `sufix`, `status`, `ssrr`,`custname`, `payorname`,`invoicedate`,) %>%
filter(`controlno` == "1268")
Since I had no idea how to start but believed that a for loop
or case_when()
I opted for the for loop
because I believe that I would need to output controlno
which had a conflict like many different names and how to handle.
Unfortunately I didn't get very far, here is what I've got, I kept the notes of my testing script for insight of my considerations:
# How to create a function regarding this issue?
uniq_cntr <- unique(df_a$controlno) # A list of all unique controls that has to go through.
# No idea how to start, maybe use case_when() would be another alternative?
# Wonder if creating the uniq_cntr is needed or used `unique(df_a$controlno)` in the for loop...
parse_name <- function(df_a,`controlno`,`custname`, `payorname` ){
for(uniq_cntr in df_a$controlno){
if(df_a$controlno == uniq_cntr ){
# Not sure how to "filter each unique `controlno` and check the names (most likely either NULL or 1 name),
# if there are two or more distinct name, pick the one that shows up most and ouput `controlno` which
# this occurred, if all names match within a column, move-on to next cntrol
}}}
The idea of the function would be to provide the data and specify all obs (columns) which to process based on the data set always having the control
but I migth want to pass 1 col of obs, say just custname
or just payorname
or maybe I might have to parse more than these 2 obs at once, all I would have to do is specify them in the function.
# How I believe the final function would work as:
# Where `controlno` is a necessary argument and at least 1 obs col.
df_a %>% parse_name(`controlno`,`custname`, `payorname`)
I would exepect the output to be something like:
controlno sufix status ssrr custname payorname invoicedate
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1268 1 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
2 1268 2 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
3 1268 3 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
4 1268 4 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
5 1268 5 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
6 1268 6 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
7 1268 7 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
8 1268 8 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
9 1268 9 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
10 1268 10 INV SUDU CUSTOMER NAME CUSTOMER NAME 1900-01-01 00:00:00.000
11 5069 1 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
12 5069 2 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
13 5069 3 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
14 5069 4 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
15 5069 5 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
16 5069 6 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
17 5069 7 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
18 5069 8 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
19 5069 9 INV EGLV ANOTHER NAME ANOTHER NAME 1900-01-01 00:00:00.000
With some sort of output or warning of the controlno
numbers that had conflicts (more than 1 name for each observation), the ones that had 1 customer name and NULL don't need to be on this output / warning,
So for controlno
= 1268 it shoiuldn't need to appear in this output but controlno
= 5069 something like :
# A tibble: 3 x 3
controlno custname payorname
<dbl> <chr> <chr>
1 5069 ANOTHER NAME ANOTHER NAME
2 5069 ANOTHER NAME2 ANOTHER NAME2
3 5069 ANOTHER NAME3 ANOTHER NAME5
Any assistance is welcome.
Thank you for you time,
LF.