Extracting two types of data from variable, splitting it in two other variables and choosing one given condition.

luisferlante · March 28, 2020, 2:46am

Hi there,

My dataset has a column that contains mixed data types, characters and dates.
I will put things into context please bear with me, the question is at the bottom.

> df_all$`Job Revenue Recognition Date` 
   [1] NA                             "ARV 02-Apr-17"                "ARV 04-Apr-17"                NA                             "ARV 29-Mar-17, DEP 08-Mar-17" "ARV 29-Mar-17, DEP 08-Mar-17"
   [7] "ARV 10-Apr-17"                "ARV 07-Apr-17"                "ARV 30-Mar-17"                "ARV 28-Mar-17"                "ARV 03-Apr-17"                "ARV 09-Apr-17"               
  [13] "ARV 05-Apr-17"                "ARV 10-Apr-17"                "ARV 11-Apr-17"                "ARV 26-Mar-17"                "ARV 06-Apr-17"                "ARV 26-Mar-17"               
  [19] "ARV 22-Mar-17, DEP 05-Mar-17"

In order to fix that, I came up with the following solution while selecting the columns of interest:

# df_all_og is just the original dataset.
  df_all <- df_all_og %>% 
    select(
      `Shipment ID`,
      Trans,
      Mode:`House Ref`,
      `Goods Description`:`Destination ETA`,
      Added:Direction,
      starts_with("Total"),
      `Job Revenue Recognition Date`) %>%
      separate(`Job Revenue Recognition Date`, into = c("ARV", "DEP"),
               sep = ", ", remove = FALSE) %>% 
                    separate(`ARV`,into = c("ARV.type", "ARV.date"), sep= " ") %>% 
                    separate(`DEP`,into = c("DEP.type", "DEP.date"), sep= " ")

Which breaks that single column into 4 columns because the data contained could be:

ARV
DEP
ARV DATE
DEP DATE

$ ARV.type      <chr> NA, "ARV", "ARV", NA, "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", NA, "ARV", "ARV", "ARV"...
$ ARV.date      <date> NA, 2017-04-02, 2017-04-04, NA, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-03-30, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-1...
$ DEP.type      <chr> NA, NA, NA, NA, "DEP", "DEP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DEP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DEP", NA, NA, NA, NA, NA...
$ DEP.date      <date> NA, NA, NA, NA, 2017-03-08, 2017-03-08, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-05, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-0...

Now I had to get the largest date (newest) from AR.date or DEP.date and store that in another column.

So I left the dataset a side, replicated it and called it x, to play around and find a solution (which has an issue):

# Searching for a possible solution for date comparison.
>    for (i in seq_along(x$ARV.date)){
+         if(is.na(x$ARV.date[i]) | is.na(x$DEP.date[i])){
+               
+               x$Job.Recog.Date[i] <- NA
+               
+         }else if(!is.na(x$ARV.date[i]) & !is.na(x$DEP.date[i])){
+               
+               x$Job.Recog.Date[i] <- ymd(max(c(x$ARV.date[i],x$DEP.date[i])))
+             
+           }
+    }
Warning message:
Unknown or uninitialised column: 'Job.Recog.Date'. 
> glimpse(x)
Observations: 43,856
Variables: 11
$ `Origin ETD`         <date> 2017-01-16, 2017-03-02, 2017-03-04, 2017-02-09, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-01, 2017-02-15, 2017-03-06, 2017-03-07, 2017-03-06, 2017-03-17, 2017-03-16...
$ `Destination ETA`    <date> 2017-02-21, 2017-04-02, 2017-04-04, 2017-04-20, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-04-01, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-10, 2017-04-16...
$ Added                <date> 2016-12-27, 2017-02-08, 2017-02-08, 2017-02-08, 2017-02-09, 2017-02-09, 2017-02-09, 2017-02-09, 2017-02-10, 2017-02-10, 2017-02-13, 2017-02-13, 2017-02-13, 2017-02-13, 2017-02-13...
$ `Job Opened`         <date> 2017-09-18, 2017-03-27, 2017-03-15, 2017-04-12, 2017-03-23, 2017-03-23, 2017-03-22, 2017-03-22, 2017-03-24, 2017-03-06, 2017-03-16, 2017-03-06, 2017-03-13, 2017-03-20, 2017-04-03...
$ `ETD First Load`     <date> NA, 2017-03-02, 2017-03-09, 2017-03-22, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-01, 2017-03-06, 2017-03-06, 2017-03-07, 2017-03-13, 2017-03-19, 2017-03-16, 2017-0...
$ `ETA Last Discharge` <date> NA, 2017-04-02, 2017-03-28, 2017-04-20, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-04-01, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-03, 2017-04-09, 2017-04-11, 2017-0...
$ `ETD Load`           <date> NA, 2017-03-02, 2017-03-16, 2017-03-22, 2017-03-08, 2017-03-08, 2017-03-16, 2017-03-16, 2017-03-01, 2017-03-06, 2017-03-06, 2017-03-07, 2017-03-13, 2017-03-19, 2017-03-16, 2017-0...
$ `ETA Discharge`      <date> NA, 2017-03-31, 2017-03-28, 2017-04-10, 2017-03-29, 2017-03-29, 2017-03-28, 2017-03-28, 2017-03-23, 2017-03-21, 2017-04-03, 2017-04-09, 2017-03-28, 2017-04-01, 2017-04-07, 2017-0...
$ ARV.date             <date> NA, 2017-04-02, 2017-04-04, NA, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-03-30, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-10, 2017-04-11, 2017-03-26, 20...
$ DEP.date             <date> NA, NA, NA, NA, 2017-03-08, 2017-03-08, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-05, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-02, NA, NA, NA, NA, NA, NA, NA...
$ Job.Recog.Date       <dbl> NA, NA, NA, NA, 17254, 17254, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17247, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17261, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Note that on the newly added $Job.Recog.Date column a date was expected yet I got a number.

I've also noted that using the loop increased substantially the time it took to run the code, nothing drastic, but noticable when everything was basically instantaneous. Which leads me to believe this is a possible method, even if I corrected so that the output gave me a correct date, but not an efficient one.

My question is, how should I have approached this problem to avoid this For loop. Which I would still have to create a function around it and figure out how to set have the syntactical col names being generated on the pipe when doing separate() to then manipulate to find which date out of the two is needed. Can this be done all while piping?

Side question: Do people mind long descriptive questions or just put the code there with minimum context? I ask this because I would find it a lot easier to help others when I can clearly understand the goal. Not to mention others that are starting and come across this, I guess it would make "digestible".

Like always, thanks for your time and patience.
LF.

technocrat · March 28, 2020, 3:25am

Algorithms + Data Structures = Programs. Niklaus Wirth

Without some representative data to go with the code and problem statement (which are otherwise fine), it's hard to say more than this:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
as_date(17254)
#> [1] "2017-03-29"

^{Created on 2020-03-27 by the reprex package (v0.3.0)}

andresrcs · March 28, 2020, 3:34am

The best way to ask coding related questions is with a minimal REPRoducible EXample (reprex), can you please provide one?

luisferlante · March 28, 2020, 4:40am

Hey mate! Thanks for the assist!

Gotcha! Makes total sense and from now on will adhere to standard.
Thanks for that!

library(tidyverse)
df_all <- tibble(
  "Shipment ID" = c("S00001009", "S00001033", "S00001034", "S00001036"),
  "Trans" = c("SEA", "SEA", "SEA", "SEA"),
  "Mode" = c("FCL", "FCL" ,"FCL" ,"BCN"),
  "Job Revenue Recognition Date" = c("ARV 05-Apr-17, DEP 02-Mar-17", "ARV 10-Apr-17", NA, "ARV 22-Mar-17, DEP 05-Mar-17")
)

df_all <- df_all %>% separate(`Job Revenue Recognition Date`, into = c("ARV", "DEP"),
                    sep = ", ", remove = FALSE) %>% 
  separate(`ARV`,into = c("ARV.type", "ARV.date"), sep= " ") %>% 
  separate(`DEP`,into = c("DEP.type", "DEP.date"), sep= " ") 
  
# figuring loop 
x  <- df_all

                                                                                                           
for (i in seq_along(nrow(x))){
  if(is.na(x$ARV.date[i]) | is.na(x$DEP.date[i])){
    
    x$Job.Recog.Date[i] <- NA
    
  }else if(!is.na(x$ARV.date[i]) & !is.na(x$DEP.date[i])){
    
    x$Job.Recog.Date[i] <- ymd(max(c(x$ARV.date[i],x$DEP.date[i])))
    
  }
}
# Not expected output date (at least format)
x$Job.Recog.Date 
glimpse(x)

 
# The code works separately but fails when I try to 
# do everything in one go, producing an empty value.

df2 <- df_all %>% separate(`Job Revenue Recognition Date`, into = c("ARV", "DEP"),
                    sep = ", ", remove = FALSE) %>% 
  separate(`ARV`,into = c("ARV.type", "ARV.date"), sep= " ") %>% 
  separate(`DEP`,into = c("DEP.type", "DEP.date"), sep= " ") %>% 
                                                  for (i in seq_along(nrow(df_all))){
                                                    if(is.na(ARV.date[i]) | is.na(DEP.date[i])){
                                                      
                                                      Job.Recog.Date[i] <- NA
                                                      
                                                    }else if(!is.na(ARV.date[i]) & !is.na(DEP.date[i])){
                                                      
                                                      Job.Recog.Date[i] <- ymd(max(c(ARV.date[i],DEP.date[i])))
                                                      
                                                    }
                                                  }

The context was provided in the crappy first post.
Once again, thank you for your time and assistance!
LF

luisferlante · March 28, 2020, 4:43am

Yep @andresrcs had to draw it out for me mate!

But I'm learning! I hope

technocrat · March 28, 2020, 5:41am

Your instinct was right a pipe chain is a better way to do this. Anyone who came to R from a procedural/imperative background, as I did, reaches for familiar concepts. After a while the realization dawns that although for loops are syntactical, and are much used under the hood, R presents to users mainly as a functional language. Think school algebra writ large, f(x) = y, with the additional feature that in R everything is an object that can serve as a function call: g(f(x)) = y.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr)) 

df_all <- tibble(
  "Shipment ID" = c("S00001009", "S00001033", "S00001034", "S00001036"),
  "Trans" = c("SEA", "SEA", "SEA", "SEA"),
  "Mode" = c("FCL", "FCL", "FCL", "BCN"),
  "Job Revenue Recognition Date" = c("ARV 05-Apr-17, DEP 02-Mar-17", "ARV 10-Apr-17", NA, "ARV 22-Mar-17, DEP 05-Mar-17")
)

df_all %>%
  separate(`Job Revenue Recognition Date`,
    into = c("ARV", "DEP"),
    sep = ", ", remove = TRUE
  ) %>%
  separate(`ARV`, into = c("ARV.type", "ARV.date"), sep = " ") %>%
  separate(`DEP`, into = c("DEP.type", "DEP.date"), sep = " ") %>%
  select(-ARV.type, -DEP.type) %>%
  mutate(ARV.date = dmy(ARV.date)) %>%
  mutate(DEP.date = dmy(DEP.date)) %>%
  mutate(Job.Recog.Date = ifelse(is.na(ARV.date) | is.na(DEP.date), NA,
    ifelse(ARV.date > DEP.date, ARV.date, DEP.date)
  )) %>%
  mutate(Job.Recog.Date = as_date(Job.Recog.Date))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 4 x 6
#>   `Shipment ID` Trans Mode  ARV.date   DEP.date   Job.Recog.Date
#>   <chr>         <chr> <chr> <date>     <date>     <date>        
#> 1 S00001009     SEA   FCL   2017-04-05 2017-03-02 2017-04-05    
#> 2 S00001033     SEA   FCL   2017-04-10 NA         NA            
#> 3 S00001034     SEA   FCL   NA         NA         NA            
#> 4 S00001036     SEA   BCN   2017-03-22 2017-03-05 2017-03-22

^{Created on 2020-03-27 by the reprex package (v0.3.0)}

I'm not sure I followed the logic for record 2, but another ifelse should fix it if you wanted 2017-04-10 to be placed in mutate(Job.Recog.Date ...)

luisferlante · March 28, 2020, 7:57pm

As always, thank you mate.

I come from C++ / VBA background, I see what you mean about how R handles stuff. Guess with time, I'll have the thought process re-trained.

Well pointed out regarding record 2, you did mention that another mutate(Job.Recog.Date ...) would be necessary. My dumb ass thought another ifelse() should take care of that. Obviously I was incorrect.

df_all2 %>%
  separate(`Job Revenue Recognition Date`,
           into = c("ARV", "DEP"),
           sep = ", ", remove = TRUE
  ) %>%
  separate(`ARV`, into = c("ARV.type", "ARV.date"), sep = " ") %>%
  separate(`DEP`, into = c("DEP.type", "DEP.date"), sep = " ") %>%
  select(-ARV.type, -DEP.type) %>%
  mutate(ARV.date = dmy(ARV.date)) %>%
  mutate(DEP.date = dmy(DEP.date)) %>%
  mutate(Job.Recog.Date = ifelse(is.na(ARV.date) | is.na(DEP.date), NA,
                                 ifelse(!is.na(ARV.date) & is.na(DEP.date), ARV.date,
                                 ifelse(is.na(ARV.date) & !is.na(DEP.date), DEP.date,        
                                 ifelse(ARV.date > DEP.date, ARV.date, DEP.date))
  ))) %>%
  mutate(Job.Recog.Date = as_date(Job.Recog.Date))

Based on what you assisted me with before, ifelse(!is.na(ARV.date) & is.na(DEP.date), ARV.date, which means I can "daisy-chain" these ifelse(). So I decided to take a few jabs at it, I now have a black eye.

For record 2, !is.na(ARV.date) returns TRUE and is.na(DEP.date) also returns TRUE and ARV.date should be returned, it didn't.

Then I thought of another case (not present in the example but exists on the dataset), what if ARV.date is NA but DEP.date has the date, then DEP.date should return.

Why did it not work? It probably has to do with why you mentioned that another mutate() would be necessary.

Could case_when() be utilized instead of ifelse() ?

Once again, much appreciated the assistance and mentor-ship.

Best regards,
LF.

technocrat · March 28, 2020, 8:43pm

Don't beat yourself up, @luisferlante! Trying to make things too complicated happened to me also until I put together what I had learned from a dip into Haskell with the realization that the classic UNIX™️ command line utilities were functions that operated on stdin with other optional arguments and produced stdout.

The other thing that I dredged out of the archives of memory (which now seem to be stored on 8-track ASCII tape) is the admonition against premature optimization and the principles that code is primarily for humans, secondarily for machines (Knuth) and, because debugging is twice as hard as coding, if you code as cleverly as possible you will never be able to debug it (Kernighan).

That's why I prefer to take things in bite-size chunks, getting one function to work before piping it on to another, even if it's operating on the same object.

  mutate(Job.Recog.Date = ifelse(is.na(ARV.date) | is.na(DEP.date), NA,
    ifelse(ARV.date > DEP.date, ARV.date, DEP.date)

is already Byzantine enough, so what I would do is to introduce a second mutate to take care of the missing DEP.date.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr)) 

df_all <- tibble(
  "Shipment ID" = c("S00001009", "S00001033", "S00001034", "S00001036"),
  "Trans" = c("SEA", "SEA", "SEA", "SEA"),
  "Mode" = c("FCL", "FCL", "FCL", "BCN"),
  "Job Revenue Recognition Date" = c("ARV 05-Apr-17, DEP 02-Mar-17", "ARV 10-Apr-17", NA, "ARV 22-Mar-17, DEP 05-Mar-17")
)

df_all %>%
  separate(`Job Revenue Recognition Date`,
    into = c("ARV", "DEP"),
    sep = ", ", remove = TRUE
  ) %>%
  separate(`ARV`, into = c("ARV.type", "ARV.date"), sep = " ") %>%
  separate(`DEP`, into = c("DEP.type", "DEP.date"), sep = " ") %>%
  select(-ARV.type, -DEP.type) %>%
  mutate(ARV.date = dmy(ARV.date)) %>%
  mutate(DEP.date = dmy(DEP.date)) %>%
  mutate(Job.Recog.Date = ifelse(is.na(ARV.date) | is.na(DEP.date), NA,
    ifelse(ARV.date > DEP.date, ARV.date, DEP.date)
  )) %>%
  mutate(Job.Recog.Date = ifelse(is.na(DEP.date) & !is.na(ARV.date),ARV.date, Job.Recog.Date)) %>% 
  mutate(Job.Recog.Date = as_date(Job.Recog.Date))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 4 x 6
#>   `Shipment ID` Trans Mode  ARV.date   DEP.date   Job.Recog.Date
#>   <chr>         <chr> <chr> <date>     <date>     <date>        
#> 1 S00001009     SEA   FCL   2017-04-05 2017-03-02 2017-04-05    
#> 2 S00001033     SEA   FCL   2017-04-10 NA         2017-04-10    
#> 3 S00001034     SEA   FCL   NA         NA         NA            
#> 4 S00001036     SEA   BCN   2017-03-22 2017-03-05 2017-03-22

^{Created on 2020-03-28 by the reprex package (v0.3.0)}

The reverse situation is the same with the is.na() arguments switched.

The tricklette is do something once, take a second pass to do something different to the same object if conditions are met, otherwise leave it alone and then in the third pass, tinker with the format.

The heritage of scarce computing resources firmly implanted the idea of compact expression as possible, and it can take a while to re-orient in these modern times of ridiculously cheap, fast CPU, cache, RAM, SSD and highly tuned OS (well, I don't know about Windows since it's been so long that I've used it).

Good luck. C'mon back if needed. Start a separate thread if new issues arise that don't directly relate.

Cheers

andresrcs · March 28, 2020, 9:03pm

This seems unnecessarily complex, why not to simply do something like this? am I misunderstanding the problem?

library(tidyverse)
library(lubridate)

df_all <- tibble(
    "Shipment ID" = c("S00001009", "S00001033", "S00001034", "S00001036"),
    "Trans" = c("SEA", "SEA", "SEA", "SEA"),
    "Mode" = c("FCL", "FCL", "FCL", "BCN"),
    "Job Revenue Recognition Date" = c("ARV 05-Apr-17, DEP 02-Mar-17", "ARV 10-Apr-17", NA, "ARV 22-Mar-17, DEP 05-Mar-17")
)

df_all %>%
    separate(`Job Revenue Recognition Date`,
             into = c("ARV_date", "DEP_date"),
             sep = ", ", remove = TRUE
    ) %>%
    mutate_at(vars(ends_with("_date")),
              ~parse_date(str_remove(., "^.{3}\\s"), format = "%d-%b-%y")) %>%
    rowwise() %>% 
    mutate(Job.Recog.Date = max(ARV_date, DEP_date, na.rm = TRUE)) %>% 
    ungroup()
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> Warning in max.default(structure(NA_real_, class = "Date"), structure(NA_real_,
#> class = "Date"), : ningun argumento finito para max; retornando -Inf
#> # A tibble: 4 x 6
#>   `Shipment ID` Trans Mode  ARV_date   DEP_date   Job.Recog.Date
#>   <chr>         <chr> <chr> <date>     <date>     <date>        
#> 1 S00001009     SEA   FCL   2017-04-05 2017-03-02 2017-04-05    
#> 2 S00001033     SEA   FCL   2017-04-10 NA         2017-04-10    
#> 3 S00001034     SEA   FCL   NA         NA         NA            
#> 4 S00001036     SEA   BCN   2017-03-22 2017-03-05 2017-03-22

luisferlante · March 28, 2020, 10:10pm

@andresrcs, that is is exactly what I was trying to do.

Obviously you nailed it, I did think that regexp and string manipulation would be the way to go but I am unfortunately comfortable with them.

If you don't mind commenting:

How does R know to look at the columns names for the end_with("_date") ? This isn't clear to me if that is the default behaviour.

    mutate_at(vars(ends_with("_date"))

~ what is the function of this character? I've only seen it when using facet() from ggplot2 or modeling y ~x.

str_remove(. - I've seen the use "." before but not sure why / when to use them.

~parse_date(str_remove(., "^.{3}\\s"), format = "%d-%b-%y"))

Breaking down regexp: "^.{3}\\s"
^ - starts the string - I'm ok with
. - bring any character - I'm ok with
{3} - Isn't this exact matches? Why 3? This is the where I get confused, what's being matched?
\\s - escaping the letter s?

rowwise() - read the documentation, tried the example and it seemed it just converted data into a tibble.

df <- expand.grid(x = 1:3, y = 3:1)
df
df %>% rowwise()
Source: local data frame [9 x 2]
Groups: <by row>

# A tibble: 9 x 2
      x     y
* <int> <int>
1     1     3
2     2     3
3     3     3
4     1     2
5     2     2
6     3     2
7     1     1
8     2     1
9     3     1

ungroup() - What does this do, I've seen it before but never got it. I've tried running your code with and without including the ungroup() with no obvious change in output.

Would really appreciate any feedback so that I can study further.
Best regards,
LF.

luisferlante · March 28, 2020, 10:16pm

Absolutely @technocrat!

Deeply appreciate the assistance and feedback. Love that you understand what I'm trying to accomplish but at the same time guide the line of thought as opposed to offer a magical solution which I would not be familiar with to begin with.

For the time being I'll stick to your/ mine approach while I decipher @andresrcs elegant solution.

andresrcs · March 28, 2020, 10:26pm

This is a selector helper from tidyselect, ends_with() returns the names of the variables ended with "_date".

It creates a lambda function, this is a concept common to many programming languages, and the best way I have to synthesize it is an anonymous function.

In the context of mutate_at(), the dot represents each one of the selected columns (from the previous argument with ends_with()), be aware that this syntax only works in the context of lambda functions (~).

^ - the start of the string
. - any character
{3} - exactly three times (This affects the previous metacharacter i.e .)
\\s - an empty space

The max() function is vectorized so it would calculate the maximum in the entire column, by using rowwise() the data frame gets grouped by each row reducing each column to a single value, that way I can calculate the maximum value for both columns on each row.

This is to remove the row-wise grouping introduced on the previous step, otherwise, this could bring issues while applying posterior commands.

technocrat · March 28, 2020, 10:26pm

When you're up-to-speed, @andresrcs definitely shows the way to go. I tend to be kinda didactic and cut up the whole steer into kebobs

technocrat · March 28, 2020, 10:33pm

See the R for Data Science book by the authors of many of the tidyverse packages. Heck, it's even worth buying a copy. Another great resource is the R Cookbook, 2nd Ed., also worth shelling out for the hard copy. Both are exceptionally helpful, the later surprisingly so. I wrote it up last year.

dromano · March 28, 2020, 11:10pm

I didn't know about the R Cookbook -- it looks excellent! Thanks for linking it, @technocrat.

luisferlante · March 29, 2020, 12:25am

@technocrat I have read the entire book, just finished it about 2 weeks ago. Did all exercises, but it is different to have the fluency with the language to execute as o opposed to read what was done and understanding like "yeah that makes sense" and then do a similar problem as exercise.

This is my first full attempt at a full project on my "own" (gladly this community exists). Didn't know the Cookbook which I will now have in hands!

Since you mentioned the book, I did notice I got a little lost on the modeling part of the book. Not because I did not understand the code but how to assess the models. I believe I need a to study the actual math of model making. I am a Material Sciences Engineer and did have advanced statistics but not modeling. So I do have the ability to understand the statistics around it but what I call the math & processes of modeling aren't as clear.

Any suggestions on where to go for a little more reading on the math concepts as well as programming in R for modeling? How to then tweak the formulas to improve the model for example.

luisferlante · March 29, 2020, 12:32am

@andresrcs thanks you very much for taking your time with this.

Took a few minutes but I got it!
You da man thank you!

technocrat · March 29, 2020, 1:46am

Another great resource is Rafael Irizarry's HarvardX course and [on-line text] (https://github.com/rafalab/dsbook) (this is the github page with all the source). He covers basics of various types of machine learning, which will give you an idea of the major approaches.

There's quite a range of approaches, depending on the applications. They sort of break down into the broad categories of classification and prediction. The machine learning techniques strike me (I haven't gone in very deeply) as more algorithmic than mathematical. Bootstrap sampling, cross-fold validation, K nearest neighbor, K means neighbor, gradient boosting, random forests. Oh, heck

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
modelLookup()$model
#>   [1] "ada"                 "ada"                 "ada"                
#>   [4] "AdaBag"              "AdaBag"              "adaboost"           
#>   [7] "adaboost"            "AdaBoost.M1"         "AdaBoost.M1"        
#>  [10] "AdaBoost.M1"         "amdai"               "ANFIS"              
#>  [13] "ANFIS"               "avNNet"              "avNNet"             
#>  [16] "avNNet"              "awnb"                "awtan"              
#>  [19] "awtan"               "bag"                 "bagEarth"           
#>  [22] "bagEarth"            "bagEarthGCV"         "bagFDA"             
#>  [25] "bagFDA"              "bagFDAGCV"           "bam"                
#>  [28] "bam"                 "bartMachine"         "bartMachine"        
#>  [31] "bartMachine"         "bartMachine"         "bartMachine"        
#>  [34] "bayesglm"            "binda"               "blackboost"         
#>  [37] "blackboost"          "blasso"              "blassoAveraged"     
#>  [40] "bridge"              "brnn"                "BstLm"              
#>  [43] "BstLm"               "bstSm"               "bstSm"              
#>  [46] "bstTree"             "bstTree"             "bstTree"            
#>  [49] "C5.0"                "C5.0"                "C5.0"               
#>  [52] "C5.0Cost"            "C5.0Cost"            "C5.0Cost"           
#>  [55] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
#>  [58] "cforest"             "chaid"               "chaid"              
#>  [61] "chaid"               "CSimca"              "ctree"              
#>  [64] "ctree2"              "ctree2"              "cubist"             
#>  [67] "cubist"              "dda"                 "dda"                
#>  [70] "deepboost"           "deepboost"           "deepboost"          
#>  [73] "deepboost"           "deepboost"           "DENFIS"             
#>  [76] "DENFIS"              "dnn"                 "dnn"                
#>  [79] "dnn"                 "dnn"                 "dnn"                
#>  [82] "dwdLinear"           "dwdLinear"           "dwdPoly"            
#>  [85] "dwdPoly"             "dwdPoly"             "dwdPoly"            
#>  [88] "dwdRadial"           "dwdRadial"           "dwdRadial"          
#>  [91] "earth"               "earth"               "elm"                
#>  [94] "elm"                 "enet"                "enet"               
#>  [97] "evtree"              "extraTrees"          "extraTrees"         
#> [100] "fda"                 "fda"                 "FH.GBML"            
#> [103] "FH.GBML"             "FH.GBML"             "FIR.DM"             
#> [106] "FIR.DM"              "foba"                "foba"               
#> [109] "FRBCS.CHI"           "FRBCS.CHI"           "FRBCS.W"            
#> [112] "FRBCS.W"             "FS.HGD"              "FS.HGD"             
#> [115] "gam"                 "gam"                 "gamboost"           
#> [118] "gamboost"            "gamLoess"            "gamLoess"           
#> [121] "gamSpline"           "gaussprLinear"       "gaussprPoly"        
#> [124] "gaussprPoly"         "gaussprRadial"       "gbm"                
#> [127] "gbm"                 "gbm"                 "gbm"                
#> [130] "gbm_h2o"             "gbm_h2o"             "gbm_h2o"            
#> [133] "gbm_h2o"             "gbm_h2o"             "gcvEarth"           
#> [136] "GFS.FR.MOGUL"        "GFS.FR.MOGUL"        "GFS.FR.MOGUL"       
#> [139] "GFS.LT.RS"           "GFS.LT.RS"           "GFS.LT.RS"          
#> [142] "GFS.THRIFT"          "GFS.THRIFT"          "GFS.THRIFT"         
#> [145] "glm"                 "glm.nb"              "glmboost"           
#> [148] "glmboost"            "glmnet"              "glmnet"             
#> [151] "glmnet_h2o"          "glmnet_h2o"          "glmStepAIC"         
#> [154] "gpls"                "hda"                 "hda"                
#> [157] "hda"                 "hdda"                "hdda"               
#> [160] "hdrda"               "hdrda"               "hdrda"              
#> [163] "HYFIS"               "HYFIS"               "icr"                
#> [166] "J48"                 "J48"                 "JRip"               
#> [169] "JRip"                "JRip"                "kernelpls"          
#> [172] "kknn"                "kknn"                "kknn"               
#> [175] "knn"                 "krlsPoly"            "krlsPoly"           
#> [178] "krlsRadial"          "krlsRadial"          "lars"               
#> [181] "lars2"               "lasso"               "lda"                
#> [184] "lda2"                "leapBackward"        "leapForward"        
#> [187] "leapSeq"             "Linda"               "lm"                 
#> [190] "lmStepAIC"           "LMT"                 "loclda"             
#> [193] "logicBag"            "logicBag"            "LogitBoost"         
#> [196] "logreg"              "logreg"              "lssvmLinear"        
#> [199] "lssvmPoly"           "lssvmPoly"           "lssvmPoly"          
#> [202] "lssvmRadial"         "lssvmRadial"         "lvq"                
#> [205] "lvq"                 "M5"                  "M5"                 
#> [208] "M5"                  "M5Rules"             "M5Rules"            
#> [211] "manb"                "manb"                "mda"                
#> [214] "Mlda"                "mlp"                 "mlpKerasDecay"      
#> [217] "mlpKerasDecay"       "mlpKerasDecay"       "mlpKerasDecay"      
#> [220] "mlpKerasDecay"       "mlpKerasDecay"       "mlpKerasDecay"      
#> [223] "mlpKerasDecayCost"   "mlpKerasDecayCost"   "mlpKerasDecayCost"  
#> [226] "mlpKerasDecayCost"   "mlpKerasDecayCost"   "mlpKerasDecayCost"  
#> [229] "mlpKerasDecayCost"   "mlpKerasDecayCost"   "mlpKerasDropout"    
#> [232] "mlpKerasDropout"     "mlpKerasDropout"     "mlpKerasDropout"    
#> [235] "mlpKerasDropout"     "mlpKerasDropout"     "mlpKerasDropout"    
#> [238] "mlpKerasDropoutCost" "mlpKerasDropoutCost" "mlpKerasDropoutCost"
#> [241] "mlpKerasDropoutCost" "mlpKerasDropoutCost" "mlpKerasDropoutCost"
#> [244] "mlpKerasDropoutCost" "mlpKerasDropoutCost" "mlpML"              
#> [247] "mlpML"               "mlpML"               "mlpSGD"             
#> [250] "mlpSGD"              "mlpSGD"              "mlpSGD"             
#> [253] "mlpSGD"              "mlpSGD"              "mlpSGD"             
#> [256] "mlpSGD"              "mlpWeightDecay"      "mlpWeightDecay"     
#> [259] "mlpWeightDecayML"    "mlpWeightDecayML"    "mlpWeightDecayML"   
#> [262] "mlpWeightDecayML"    "monmlp"              "monmlp"             
#> [265] "msaenet"             "msaenet"             "msaenet"            
#> [268] "multinom"            "mxnet"               "mxnet"              
#> [271] "mxnet"               "mxnet"               "mxnet"              
#> [274] "mxnet"               "mxnet"               "mxnetAdam"          
#> [277] "mxnetAdam"           "mxnetAdam"           "mxnetAdam"          
#> [280] "mxnetAdam"           "mxnetAdam"           "mxnetAdam"          
#> [283] "mxnetAdam"           "naive_bayes"         "naive_bayes"        
#> [286] "naive_bayes"         "nb"                  "nb"                 
#> [289] "nb"                  "nbDiscrete"          "nbSearch"           
#> [292] "nbSearch"            "nbSearch"            "nbSearch"           
#> [295] "nbSearch"            "neuralnet"           "neuralnet"          
#> [298] "neuralnet"           "nnet"                "nnet"               
#> [301] "nnls"                "nodeHarvest"         "nodeHarvest"        
#> [304] "null"                "OneR"                "ordinalNet"         
#> [307] "ordinalNet"          "ordinalNet"          "ordinalRF"          
#> [310] "ordinalRF"           "ordinalRF"           "ORFlog"             
#> [313] "ORFpls"              "ORFridge"            "ORFsvm"             
#> [316] "ownn"                "pam"                 "parRF"              
#> [319] "PART"                "PART"                "partDSA"            
#> [322] "partDSA"             "pcaNNet"             "pcaNNet"            
#> [325] "pcr"                 "pda"                 "pda2"               
#> [328] "penalized"           "penalized"           "PenalizedLDA"       
#> [331] "PenalizedLDA"        "plr"                 "plr"                
#> [334] "pls"                 "plsRglm"             "plsRglm"            
#> [337] "polr"                "ppr"                 "PRIM"               
#> [340] "PRIM"                "PRIM"                "protoclass"         
#> [343] "protoclass"          "qda"                 "QdaCov"             
#> [346] "qrf"                 "qrnn"                "qrnn"               
#> [349] "qrnn"                "randomGLM"           "ranger"             
#> [352] "ranger"              "ranger"              "rbf"                
#> [355] "rbfDDA"              "Rborist"             "Rborist"            
#> [358] "rda"                 "rda"                 "regLogistic"        
#> [361] "regLogistic"         "regLogistic"         "relaxo"             
#> [364] "relaxo"              "rf"                  "rFerns"             
#> [367] "RFlda"               "rfRules"             "rfRules"            
#> [370] "ridge"               "rlda"                "rlm"                
#> [373] "rlm"                 "rmda"                "rmda"               
#> [376] "rocc"                "rotationForest"      "rotationForest"     
#> [379] "rotationForestCp"    "rotationForestCp"    "rotationForestCp"   
#> [382] "rpart"               "rpart1SE"            "rpart2"             
#> [385] "rpartCost"           "rpartCost"           "rpartScore"         
#> [388] "rpartScore"          "rpartScore"          "rqlasso"            
#> [391] "rqnc"                "rqnc"                "RRF"                
#> [394] "RRF"                 "RRF"                 "RRFglobal"          
#> [397] "RRFglobal"           "rrlda"               "rrlda"              
#> [400] "rrlda"               "RSimca"              "rvmLinear"          
#> [403] "rvmPoly"             "rvmPoly"             "rvmRadial"          
#> [406] "SBC"                 "SBC"                 "SBC"                
#> [409] "sda"                 "sda"                 "sdwd"               
#> [412] "sdwd"                "simpls"              "SLAVE"              
#> [415] "SLAVE"               "SLAVE"               "slda"               
#> [418] "smda"                "smda"                "smda"               
#> [421] "snn"                 "sparseLDA"           "sparseLDA"          
#> [424] "spikeslab"           "spls"                "spls"               
#> [427] "spls"                "stepLDA"             "stepLDA"            
#> [430] "stepQDA"             "stepQDA"             "superpc"            
#> [433] "superpc"             "svmBoundrangeString" "svmBoundrangeString"
#> [436] "svmExpoString"       "svmExpoString"       "svmLinear"          
#> [439] "svmLinear2"          "svmLinear3"          "svmLinear3"         
#> [442] "svmLinearWeights"    "svmLinearWeights"    "svmLinearWeights2"  
#> [445] "svmLinearWeights2"   "svmLinearWeights2"   "svmPoly"            
#> [448] "svmPoly"             "svmPoly"             "svmRadial"          
#> [451] "svmRadial"           "svmRadialCost"       "svmRadialSigma"     
#> [454] "svmRadialSigma"      "svmRadialWeights"    "svmRadialWeights"   
#> [457] "svmRadialWeights"    "svmSpectrumString"   "svmSpectrumString"  
#> [460] "tan"                 "tan"                 "tanSearch"          
#> [463] "tanSearch"           "tanSearch"           "tanSearch"          
#> [466] "tanSearch"           "treebag"             "vbmpRadial"         
#> [469] "vglmAdjCat"          "vglmAdjCat"          "vglmContRatio"      
#> [472] "vglmContRatio"       "vglmCumulative"      "vglmCumulative"     
#> [475] "widekernelpls"       "WM"                  "WM"                 
#> [478] "wsrf"                "xgbDART"             "xgbDART"            
#> [481] "xgbDART"             "xgbDART"             "xgbDART"            
#> [484] "xgbDART"             "xgbDART"             "xgbDART"            
#> [487] "xgbDART"             "xgbLinear"           "xgbLinear"          
#> [490] "xgbLinear"           "xgbLinear"           "xgbTree"            
#> [493] "xgbTree"             "xgbTree"             "xgbTree"            
#> [496] "xgbTree"             "xgbTree"             "xgbTree"            
#> [499] "xyf"                 "xyf"                 "xyf"                
#> [502] "xyf"

^{Created on 2020-03-28 by the reprex package (v0.3.0)}

system · April 5, 2020, 1:46am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.