How to add back existing columns to model after building

Hello Friends,
I am trying to build a predictive model using logistic regression. But I keep getting an error: "cannot allocate vector of size 3.1 Gb" I think it's because of the size of the data and my RAM.
A simple solution would be to remove some unnecessary columns that do not add value when developing the model. However, my issue is, how do I remove those unwanted columns, build the model and then add those unwanted columns back to the model.

Below is a simple reprex example to buttress my pain points. In the example below, I want to remove the customer's name and account number, then add them back to the model that I built.

activity <- data.frame(Customer_Name=c("Jane","Bill","Fred","Tina","Joe"),
                       Account_No=c("332","432","556", "884","119"),
                       supply_line=c("York","shark","Aba","kwara","Bethel"),
                      Cons=c("0-2300","2300-4003","4003-1121","1121-3022","3022-1713"),
                      Supply_hrs = c(9,5,8,10,1))

activity2 <- bind_rows(replicate(6,activity, simplify = F))
#> Error in bind_rows(replicate(6, activity, simplify = F)): could not find function "bind_rows"
head(activity2)
#> Error in head(activity2): object 'activity2' not found

load dplyr

suppressPackageStartupMessages({
  library(dplyr)
})
activity <- data.frame(
  Customer_Name = c("Jane", "Bill", "Fred", "Tina", "Joe"),
  Account_No = c("332", "432", "556", "884", "119"),
  supply_line = c("York", "shark", "Aba", "kwara", "Bethel"),
  Cons = c("0-2300", "2300-4003", "4003-1121", "1121-3022", "3022-1713"),
  Supply_hrs = c(9, 5, 8, 10, 1)
)

activity2 <- bind_rows(replicate(6, activity, simplify = F))
head(activity2)
#>   Customer_Name Account_No supply_line      Cons Supply_hrs
#> 1          Jane        332        York    0-2300          9
#> 2          Bill        432       shark 2300-4003          5
#> 3          Fred        556         Aba 4003-1121          8
#> 4          Tina        884       kwara 1121-3022         10
#> 5           Joe        119      Bethel 3022-1713          1
#> 6          Jane        332        York    0-2300          9

Created on 2020-11-14 by the reprex package (v0.3.0.9001)

Thank you Technocrat,

Let's assume here's the dataset:

suppressPackageStartupMessages({
  library(dplyr)
})
activity <- data.frame(
  Customer_Name = c("Jane", "Bill", "Fred", "Tina", "Joe"),
  Account_No = c("332", "432", "556", "884", "119"),
  supply_line = c("York", "shark", "Aba", "kwara", "Bethel"),
  Cons = c("0-2300", "2300-4003", "4003-1121", "1121-3022", "3022-1713"),
  Supply_hrs = c(9, 5, 8, 10, 1)
)

I should have looked beyond the immediate problem with creating activity2 to ask what is meant by "adding back the existing column to model., especially in light of the toy data that doesn't seem at all suited for a logistic regression model.

Also, it doesn't appear that there is a great pickup in object size between giving the whole data frame as an argument and only a subset

fit <- glm(am ~ mpg + cyl + disp, data = mtcars, family = "binomial")
object.size(fit)
#> 226936 bytes
fit <- glm(am ~ mpg, data = mtcars, family = "binomial" )
object.size(fit)
#> 220592 bytes
fit <- glm(am ~ mpg, data = mtcars[,c(1,9)], family = "binomial")
object.size(fit)
#> 217800 bytes

Created on 2020-11-14 by the reprex package (v0.3.0.9001)

Let's check first--can you post your sessionInfo? And it would be helpful to be able to reproduce the cannot allocate error.

The issue is not necessarily be memory‐ it could also be due to a 32-bit installation of R, which only supports up to 3GB address space. If your OS supports it, install the 64-bit version of R.

I called activity toy data only because it seemed designed solely to illuminate the issue of manipulating data frames. Because it has four categorical variables and one numeric variable, it's not obvious how a logistic model is being constructed, since each of the categorical variables have unique values. The Big Book of R has many examples and explanations, but finding the appropriate ones requires careful framing of the question.

The goal is to model a dependent variable Y as a function of X_1 \dots X_n where Y is continuous and \forall X_i is categorical.

For example, with Supply_hrs in the role of Y and the remaining variables X_1 \dots X_4, it becomes clear immediately that a logistic model is inappropriate:

> fit <- glm(Supply_hrs ~ ., data = activity, family = "binomial")
Error in eval(family$initialize) : y values must be 0 <= y <= 1`

By contrast, an OLS model will run, although the data characteristics limit its usefulness (all values are unique)

suppressPackageStartupMessages({
  library(dplyr)
})
activity <- data.frame(
  Customer_Name = c("Jane", "Bill", "Fred", "Tina", "Joe"),
  Account_No = c("332", "432", "556", "884", "119"),
  supply_line = c("York", "shark", "Aba", "kwara", "Bethel"),
  Cons = c("0-2300", "2300-4003", "4003-1121", "1121-3022", "3022-1713"),
  Supply_hrs = c(9, 5, 8, 10, 1)
)

fit <- lm(Supply_hrs ~ ., data = activity)
summary(fit)
#> 
#> Call:
#> lm(formula = Supply_hrs ~ ., data = activity)
#> 
#> Residuals:
#> ALL 5 residuals are 0: no residual degrees of freedom!
#> 
#> Coefficients: (12 not defined because of singularities)
#>                   Estimate Std. Error t value Pr(>|t|)
#> (Intercept)              5         NA      NA       NA
#> Customer_NameFred        3         NA      NA       NA
#> Customer_NameJane        4         NA      NA       NA
#> Customer_NameJoe        -4         NA      NA       NA
#> Customer_NameTina        5         NA      NA       NA
#> Account_No332           NA         NA      NA       NA
#> Account_No432           NA         NA      NA       NA
#> Account_No556           NA         NA      NA       NA
#> Account_No884           NA         NA      NA       NA
#> supply_lineBethel       NA         NA      NA       NA
#> supply_linekwara        NA         NA      NA       NA
#> supply_lineshark        NA         NA      NA       NA
#> supply_lineYork         NA         NA      NA       NA
#> Cons1121-3022           NA         NA      NA       NA
#> Cons2300-4003           NA         NA      NA       NA
#> Cons3022-1713           NA         NA      NA       NA
#> Cons4003-1121           NA         NA      NA       NA
#> 
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:    NaN 
#> F-statistic:   NaN on 4 and 0 DF,  p-value: NA

Created on 2020-11-14 by the reprex package (v0.3.0.9001)

Thank you Technocrat, you have been most helpful. Although the dataset which I shared here is not truly a pure representation of the original dataset for privacy reasons. I expected that anyone could simply look at it and know my pain points. I use a 64GB version of R.

After so much trial, this is the error I am getting:

 # remove unwanted columns
model_input_df <- ml[, c(-1, -2,-3,-4,-5,-6,-7)]
glimpse(model_input_df)
#Preliminary casting to the appropriate data type.
model_input_df$Status <- as.factor(model_input_df$Status)
model_input_df$Feeder <- as.character(model_input_df$Feeder)
model_input_df$group_cons <- as.factor(model_input_df$group_cons)
#...........................................................................
#...........................................................................
  #BUILDING THE MACHINE LEARNING MODEL/partitionin the data
  intrain<- createDataPartition(model_input_df$Status,p=0.75,list=FALSE)
function "createDataPartition"
  set.seed(2017)
  training<- model_input_df[intrain,]
  testing<- model_input_df[-intrain,]
#............................................................................
#Confirm the splitting is correct:
  dim(training); dim(testing)
  LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)
  print(summary(LogModel))
#...............................................................................
  colnames(model_input_df)
  LogModel <- c(1, 2, 3, 4, 5,6,7,8,9)
  # binding them together using rbind function of Base R 
  final_df <- rbind(ml[, c(-1, -2,-3,-4,-5,-6,-7)], "pred_values" = LogModel)
  head(final_df)

I get the error message:
Warning messages:
1: In [<-.factor(*tmp*, ri, value = 6) :
invalid factor level, NA generated
2: In [<-.factor(*tmp*, ri, value = 7) :
invalid factor level, NA generated
3: In [<-.factor(*tmp*, ri, value = 8) :
invalid factor level, NA generated
4: In [<-.factor(*tmp*, ri, value = 9) :
invalid factor level, NA generated

Alas, it can be very difficult to reverse engineer a problem without a reprex. It doesn't have to be all the data or even the same data, so long as the structure is the same. There are packages, such as {charalton} to generate fake data to substitute for the missing object dl.

If I substitute mtcars for dl, the next problem is createDataPartition, which is not in the namespace. Since I don't recognize the function off the top of my head, I'd have to go hunting for it, since the line

function "createDataPartition"

is malformed.

Looking at

  LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)

I have to assume that Status is binary. Then I have to wonder what, after running glm and assigning the result to LogModel

  colnames(model_input_df)

is supposed to do, since glm objects don't have columns, and the return will be NULL. Then I have to wonder why LogModel is then overwritten by

  LogModel <- c(1, 2, 3, 4, 5,6,7,8,9)

which replaces the fitted model with a vector.

The creation of final_df makes syntactic sense

> final_df <- rbind(mtcars[, c(-1, -2,-3,-4,-5,-6,-7)], "pred_values" = LogModel)
> tail(final_df)
               vs am gear carb
Lotus Europa    1  1    5    2
Ford Pantera L  0  1    5    4
Ferrari Dino    0  1    5    6
Maserati Bora   0  1    5    8
Volvo 142E      1  1    4    2
pred_values     1  2    3    4

but only works because everything in the mtcars and LogModel is of the same class, numeric. The error message indicates that `ml isn't.

> is.factor(mtcars$mpg)
[1] FALSE

The reason for appending pred_value as a row to ml is doubly unclear because

  1. It's an arbitrary numeric, not the results of any model fit
  2. It's unclear how the augmented ml object is being used.

Then, if pred_value is supposed to be a predicted value of status in a logistic model, you would need to dig out the log likelihood. (See my post here, which is based on the standard text.) If, on the other hand, it's supposed to be the estimates of the independent variables, those need to be extracted from the model output. In either event, I don't think that I've ever seen either presented in the same table as the source data.

All of which is to say

  1. Try to make it as clear as possible what the goal is.
  2. Try to make it as easy as possible to help progress toward the goal.

Thank you so much @technocrat, you have been so helpful. I have rewritten the code and I think it makes sense now. My goal is to see those customers whose status are either 1 or 0. Since my RAM wasn't enough to perform the assigned task, I had to resort to removing some columns that slowed down my machine.

"Then, if pred_value is supposed to be a predicted value of status in a logistic model, you would need to dig out the log likelihood. (See my post here , which is based on the standard text.) If, on the other hand, it's supposed to be the estimates of the independent variables, those need to be extracted from the model output."

can you buttress further on the log-likelihood part and the estimates?

Below is my new codebase:

#remove unwanted columns
model_input_df <- ml[, c(-1,-2,-3,-4,-5,-6,-7,-9)]
glimpse(model_input_df)
#Preliminary casting to the appropriate data type.
model_input_df$Status <- as.factor(model_input_df$Status)
model_input_df$Feeder <- as.character(model_input_df$Feeder)
model_input_df$group_cons <- as.factor(model_input_df$group_cons)

#...........................................................................
#...........................................................................
  #BUILDING THE MACHINE LEARNING MODEL/partitioning the data
  intrain<- createDataPartition(model_input_df$Status,p=0.75,list=FALSE)
  set.seed(2017)
  training<- model_input_df[intrain,]
  testing<- model_input_df[-intrain,]
  
#memory.limit(size = 56000)
#............................................................................
#Confirm the splitting is correct:
dim(training); dim(testing)
#Fitting the Logistic Regression Model:
LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)
print(summary(LogModel))
#...............................................................................
#colnames(model_input_df)
#LogModel <- c(1, 2, 3, 4, 5,6,7,8,9)
# binding them together using rbind function of Base R 
#final_df <- rbind(ml[, c(-1, -2,-3,-4,-5,-6,-7)], "pred_values" = LogModel)
#head(final_df)
  
#saveRDS(LogModel, "logmodel.rds")
#..............................................Adding Acc No back.....................

#Feature Analysis:
anova(LogModel, test="Chisq")
head(testing)
#Assessing the predictive ability of the Logistic Regression model
#testing$Status <- as.character(testing$Status)
#testing$Status[testing$Status=="0"] <- "0"
#testing$Status[testing$Status=="1"] <- "1"
  
  fitted.results <- predict(LogModel,newdata=testing,type='response')
  fitted.results <- ifelse(fitted.results > 0.5,1,0)
  misClasificError <- mean(fitted.results != testing$Status)
  print(paste('Logistic Regression Accuracy',1-misClasificError))
  #class(testing$Average.Consumption)

  final_df <- rbind(ml[, c(1,2,3,4,5,6,7,9)],"Pred_values"=fitted.results)

But it throws in a warning message

Warning messages:

1: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
4: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
5: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
6: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
7: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated

I just want to merge the predicted outcome with the list of customers in this case.

Still no ml to test this on. Assuming nrow(ml) == length(fitted.values) try cbind, instead of rbind. If each row is an observation of a customer, don't you want the fitted.result on the same row in its own column?

(Not that I think that fitted.values is likely to be meaningful as a metric for individual customers, but again, without an ml or surrogate, it's impossible to say.)

nrow(ml) 87433
length(fitted.results)21858

@technocrat, There's obviously an issue with my dataset. Can I share the entire dataset with you( with minor modifications for privacy concerns) and my code block so that you can look at it?

Sure, that will open a lot of doors. Do you have a GitHub repo to use? Or I can set up a private one in my account.

Okay.

Here's the Github url to the project.

Hello @technocrat, have you been able to access the repository?

About the simplest example I can give of fitting a model on only a few model variables (i.e. reduced columns training set), but having the predictions as well as all original data on a final dataset.

conceptually do you have a very different workflow from this ?

library(tidyverse)

# just for example train on odd rows of iris which has 150 rows
(train0 <- iris[as.logical(1:150 %% 2), ] %>%
  as_tibble(rownames = "rownum"))
# omit rownum, Sepal.Length and Petal.Length from train

(train1 <- select(
  train0,
  -rownum, -Sepal.Length, -Petal.Length
))

# fit a model on train
my_model <- lm(Petal.Width ~ ., data = train1)
summary(my_model)

# can predict on the full data which has all the variables...

iris2 <- iris %>% as_tibble()
iris2$pred_petal_width <- predict(my_model, newdata = iris2)
iris2

Yes. Sorry for the delay. Next in queue.

Here are the variables in the source data, converted to a data frame:

> colnames(ml)
 [1] "Customer AC No"              "Customer Name"              
 [3] "Customer address"            "Customer tariff as per bill"
 [5] "Date of visit"               "Meter Make"                 
 [7] "Meter No."                   "Duplicate Checker"          
 [9] "Status"                      "Feeder"                     
[11] "Average Consumption"         "Availability"               
[13] "Estimated Connected Load"  

Please confirm that ml[9] status should not be excluded and which field is supposed to serve as group_cons. Note that status has an appreciable portion of unique entries, which are excluded from the partitioning; also there are also a large number of missing entries, which should be.

Ml(9) Status is our predictor variable which is very important for our Model prediction. Group-cons was feature-engineered from average consumption variable just like meter-bypass and tariff-bypass respectively.

The data is somewhat messy with missing values - something we may care less about provided it doesn't matter really.

@technocrat

To what extent can you be of help.. I know you're a very busy person.

The data should be scrubbed along the following lines before progress can be made:

suppressPackageStartupMessages({
  library(caret)
  library(dplyr)
  library(readxl)
  library(stringr)
  library(tidyr)
  library(vctrs)
 })

# substitute your data set
readxl::read_xlsx("/home/roc/projects/Energy-Theft-Detection/etd.xlsx") %>% 
  as.data.frame() -> ml
renames <- c("Customer_AC_No",
            "Customer_Name",
            "Customer_address",
            "Customer_tariff_as_per_bill",
            "Date_of_visit",
            "Meter_Make",
            "Meter_No.",
            "Duplicate_Checker",
            "Status",
            "Feeder",
            "Average_Consumption",
            "Availability",
            "Estimated_Connected_Load")

colnames(ml) <- renames

ml %>% 
  select(Meter_No., Duplicate_Checker,Status,Average_Consumption,Availability,Estimated_Connected_Load) %>%
  filter(Duplicate_Checker == "Unique") %>%
  select(-Duplicate_Checker) %>%
  filter(!is.na(Status)) %>%
  mutate(Status = tolower(Status)) %>%
  mutate(Status = str_replace_all(Status,"by pass","bypass")) %>%    
  mutate(Status = str_replace_all(Status,"faulty","faulty meter")) %>%  
  mutate(Status = str_replace_all(Status,"meter bypass","bypass")) %>% 
  mutate(Status = str_replace_all(Status,"meter meter","meter")) %>% 
  mutate(Status = str_replace_all(Status,";meter bypass","")) %>%
  mutate(Status = str_replace_all(Status,"nill","nil")) %>% 
  mutate(Status = str_replace_all(Status,"nl","nil")) %>%
  mutate(Status = str_replace_all(Status,"unauthorized","illegal")) %>%
  mutate(Status = str_replace_all(Status,";other tampers","")) %>%
  mutate(Status = str_replace_all(Status,", bypass","")) %>%
  mutate(Status = str_replace_all(Status,";indicator bypass","")) %>%
  mutate(Status = str_replace_all(Status,"indicator bypass","bypass")) %>%
  mutate(Status = str_replace_all(Status,";bypass","")) %>%
  mutate(Status = str_replace_all(Status,"consumption","connection")) %>%
  mutate(Status = str_replace_all(Status,"illegalreconnetion","illegal connection")) %>%
  mutate(Status = str_replace_all(Status,"lor","loss of revenue")) %>%
  mutate(Status = str_replace_all(Status,"unauthorized","illegal")) %>%
  mutate(Status = str_replace_all(Status,"bad meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"burnt meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"reconnection","connection")) %>%
  mutate(Status = str_replace_all(Status,"direct","illegal")) %>%
  mutate(Status = str_replace_all(Status,"illega ","illegal")) %>%
  mutate(Status = str_replace_all(Status,"bad meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"illlegal","illegal")) %>%
  mutate(Status = str_replace_all(Status,", other tampers","")) %>%
  mutate(Status = str_replace_all(Status,"; other tampers","")) %>%
  mutate(Status = str_replace_all(Status,"illegalreconnetion","illegal connection")) %>%
  filter(!is.na(Estimated_Connected_Load)) %>%
  filter(!is.na(Meter_No.)) %>%
  filter(Availability > 0) %>%
  mutate(Status = as.factor(Status)) %>%
  filter(Average_Consumption > 0 & Estimated_Connected_Load > 0) %>%
  mutate(Excess = ifelse(Average_Consumption > Availability,1,0)) -> input

sum(nrow(input)) == sum(complete.cases(input))
  
spec = c(train = .6, test = .2, validate = .2)

g = sample(cut(
  seq(nrow(input)), 
  nrow(input)*cumsum(c(0,spec)),
  labels = names(spec)
))

suite = split(input, g)
suite[[1]] -> training
suite[[2]] -> test
suite[[3]] -> validation

summary(test)


Thank you so much @technocrat. I see the data scrubbing huddle you had to go through. Yeah, a few observations though:\

  1. Using input derived from Average_Consumption > Availability,1,0 shouldn't suffice as that wasn't the response variable. The response variable is Status as I have indicated before now.
  2. It still does not answer my question, that is, outputting a list of customers that bypassed based on the result of the test set.

I have cleaned the dataset in my Github page, please revisit my GitHub page for the new dataset. I really need to get off this and move on to building a shiny dashboard based on the results of the model.