How to add back existing columns to model after building

Hello Friends,
I am trying to build a predictive model using logistic regression. But I keep getting an error: "cannot allocate vector of size 3.1 Gb" I think it's because of the size of the data and my RAM.
A simple solution would be to remove some unnecessary columns that do not add value when developing the model. However, my issue is, how do I remove those unwanted columns, build the model and then add those unwanted columns back to the model.

Below is a simple reprex example to buttress my pain points. In the example below, I want to remove the customer's name and account number, then add them back to the model that I built.

activity <- data.frame(Customer_Name=c("Jane","Bill","Fred","Tina","Joe"),
                       Account_No=c("332","432","556", "884","119"),
                       supply_line=c("York","shark","Aba","kwara","Bethel"),
                      Cons=c("0-2300","2300-4003","4003-1121","1121-3022","3022-1713"),
                      Supply_hrs = c(9,5,8,10,1))

activity2 <- bind_rows(replicate(6,activity, simplify = F))
#> Error in bind_rows(replicate(6, activity, simplify = F)): could not find function "bind_rows"
head(activity2)
#> Error in head(activity2): object 'activity2' not found

load dplyr

suppressPackageStartupMessages({
  library(dplyr)
})
activity <- data.frame(
  Customer_Name = c("Jane", "Bill", "Fred", "Tina", "Joe"),
  Account_No = c("332", "432", "556", "884", "119"),
  supply_line = c("York", "shark", "Aba", "kwara", "Bethel"),
  Cons = c("0-2300", "2300-4003", "4003-1121", "1121-3022", "3022-1713"),
  Supply_hrs = c(9, 5, 8, 10, 1)
)

activity2 <- bind_rows(replicate(6, activity, simplify = F))
head(activity2)
#>   Customer_Name Account_No supply_line      Cons Supply_hrs
#> 1          Jane        332        York    0-2300          9
#> 2          Bill        432       shark 2300-4003          5
#> 3          Fred        556         Aba 4003-1121          8
#> 4          Tina        884       kwara 1121-3022         10
#> 5           Joe        119      Bethel 3022-1713          1
#> 6          Jane        332        York    0-2300          9

Created on 2020-11-14 by the reprex package (v0.3.0.9001)

Thank you Technocrat,

Let's assume here's the dataset:

suppressPackageStartupMessages({
  library(dplyr)
})
activity <- data.frame(
  Customer_Name = c("Jane", "Bill", "Fred", "Tina", "Joe"),
  Account_No = c("332", "432", "556", "884", "119"),
  supply_line = c("York", "shark", "Aba", "kwara", "Bethel"),
  Cons = c("0-2300", "2300-4003", "4003-1121", "1121-3022", "3022-1713"),
  Supply_hrs = c(9, 5, 8, 10, 1)
)

Alas, it can be very difficult to reverse engineer a problem without a reprex. It doesn't have to be all the data or even the same data, so long as the structure is the same. There are packages, such as {charalton} to generate fake data to substitute for the missing object dl.

If I substitute mtcars for dl, the next problem is createDataPartition, which is not in the namespace. Since I don't recognize the function off the top of my head, I'd have to go hunting for it, since the line

function "createDataPartition"

is malformed.

Looking at

  LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)

I have to assume that Status is binary. Then I have to wonder what, after running glm and assigning the result to LogModel

  colnames(model_input_df)

is supposed to do, since glm objects don't have columns, and the return will be NULL. Then I have to wonder why LogModel is then overwritten by

  LogModel <- c(1, 2, 3, 4, 5,6,7,8,9)

which replaces the fitted model with a vector.

The creation of final_df makes syntactic sense

> final_df <- rbind(mtcars[, c(-1, -2,-3,-4,-5,-6,-7)], "pred_values" = LogModel)
> tail(final_df)
               vs am gear carb
Lotus Europa    1  1    5    2
Ford Pantera L  0  1    5    4
Ferrari Dino    0  1    5    6
Maserati Bora   0  1    5    8
Volvo 142E      1  1    4    2
pred_values     1  2    3    4

but only works because everything in the mtcars and LogModel is of the same class, numeric. The error message indicates that `ml isn't.

> is.factor(mtcars$mpg)
[1] FALSE

The reason for appending pred_value as a row to ml is doubly unclear because

  1. It's an arbitrary numeric, not the results of any model fit
  2. It's unclear how the augmented ml object is being used.

Then, if pred_value is supposed to be a predicted value of status in a logistic model, you would need to dig out the log likelihood. (See my post here, which is based on the standard text.) If, on the other hand, it's supposed to be the estimates of the independent variables, those need to be extracted from the model output. In either event, I don't think that I've ever seen either presented in the same table as the source data.

All of which is to say

  1. Try to make it as clear as possible what the goal is.
  2. Try to make it as easy as possible to help progress toward the goal.

Thank you so much @technocrat, you have been so helpful. I have rewritten the code and I think it makes sense now. My goal is to see those customers whose status are either 1 or 0. Since my RAM wasn't enough to perform the assigned task, I had to resort to removing some columns that slowed down my machine.

"Then, if pred_value is supposed to be a predicted value of status in a logistic model, you would need to dig out the log likelihood. (See my post here , which is based on the standard text.) If, on the other hand, it's supposed to be the estimates of the independent variables, those need to be extracted from the model output."

can you buttress further on the log-likelihood part and the estimates?

Below is my new codebase:

#remove unwanted columns
model_input_df <- ml[, c(-1,-2,-3,-4,-5,-6,-7,-9)]
glimpse(model_input_df)
#Preliminary casting to the appropriate data type.
model_input_df$Status <- as.factor(model_input_df$Status)
model_input_df$Feeder <- as.character(model_input_df$Feeder)
model_input_df$group_cons <- as.factor(model_input_df$group_cons)

#...........................................................................
#...........................................................................
  #BUILDING THE MACHINE LEARNING MODEL/partitioning the data
  intrain<- createDataPartition(model_input_df$Status,p=0.75,list=FALSE)
  set.seed(2017)
  training<- model_input_df[intrain,]
  testing<- model_input_df[-intrain,]
  
#memory.limit(size = 56000)
#............................................................................
#Confirm the splitting is correct:
dim(training); dim(testing)
#Fitting the Logistic Regression Model:
LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)
print(summary(LogModel))
#...............................................................................
#colnames(model_input_df)
#LogModel <- c(1, 2, 3, 4, 5,6,7,8,9)
# binding them together using rbind function of Base R 
#final_df <- rbind(ml[, c(-1, -2,-3,-4,-5,-6,-7)], "pred_values" = LogModel)
#head(final_df)
  
#saveRDS(LogModel, "logmodel.rds")
#..............................................Adding Acc No back.....................

#Feature Analysis:
anova(LogModel, test="Chisq")
head(testing)
#Assessing the predictive ability of the Logistic Regression model
#testing$Status <- as.character(testing$Status)
#testing$Status[testing$Status=="0"] <- "0"
#testing$Status[testing$Status=="1"] <- "1"
  
  fitted.results <- predict(LogModel,newdata=testing,type='response')
  fitted.results <- ifelse(fitted.results > 0.5,1,0)
  misClasificError <- mean(fitted.results != testing$Status)
  print(paste('Logistic Regression Accuracy',1-misClasificError))
  #class(testing$Average.Consumption)

  final_df <- rbind(ml[, c(1,2,3,4,5,6,7,9)],"Pred_values"=fitted.results)

But it throws in a warning message

Warning messages:

1: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
4: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
5: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
6: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated
7: In `[<-.factor`(`*tmp*`, ri, value = 0) :
  invalid factor level, NA generated

I just want to merge the predicted outcome with the list of customers in this case.

Still no ml to test this on. Assuming nrow(ml) == length(fitted.values) try cbind, instead of rbind. If each row is an observation of a customer, don't you want the fitted.result on the same row in its own column?

(Not that I think that fitted.values is likely to be meaningful as a metric for individual customers, but again, without an ml or surrogate, it's impossible to say.)

nrow(ml) 87433
length(fitted.results)21858

@technocrat, There's obviously an issue with my dataset. Can I share the entire dataset with you( with minor modifications for privacy concerns) and my code block so that you can look at it?

Sure, that will open a lot of doors. Do you have a GitHub repo to use? Or I can set up a private one in my account.

Okay.

Here's the Github url to the project.

Hello @technocrat, have you been able to access the repository?

Yes. Sorry for the delay. Next in queue.

Here are the variables in the source data, converted to a data frame:

> colnames(ml)
 [1] "Customer AC No"              "Customer Name"              
 [3] "Customer address"            "Customer tariff as per bill"
 [5] "Date of visit"               "Meter Make"                 
 [7] "Meter No."                   "Duplicate Checker"          
 [9] "Status"                      "Feeder"                     
[11] "Average Consumption"         "Availability"               
[13] "Estimated Connected Load"  

Please confirm that ml[9] status should not be excluded and which field is supposed to serve as group_cons. Note that status has an appreciable portion of unique entries, which are excluded from the partitioning; also there are also a large number of missing entries, which should be.

Ml(9) Status is our predictor variable which is very important for our Model prediction. Group-cons was feature-engineered from average consumption variable just like meter-bypass and tariff-bypass respectively.

The data is somewhat messy with missing values - something we may care less about provided it doesn't matter really.

@technocrat

To what extent can you be of help.. I know you're a very busy person.

The data should be scrubbed along the following lines before progress can be made:

suppressPackageStartupMessages({
  library(caret)
  library(dplyr)
  library(readxl)
  library(stringr)
  library(tidyr)
  library(vctrs)
 })

# substitute your data set
readxl::read_xlsx("/home/roc/projects/Energy-Theft-Detection/etd.xlsx") %>% 
  as.data.frame() -> ml
renames <- c("Customer_AC_No",
            "Customer_Name",
            "Customer_address",
            "Customer_tariff_as_per_bill",
            "Date_of_visit",
            "Meter_Make",
            "Meter_No.",
            "Duplicate_Checker",
            "Status",
            "Feeder",
            "Average_Consumption",
            "Availability",
            "Estimated_Connected_Load")

colnames(ml) <- renames

ml %>% 
  select(Meter_No., Duplicate_Checker,Status,Average_Consumption,Availability,Estimated_Connected_Load) %>%
  filter(Duplicate_Checker == "Unique") %>%
  select(-Duplicate_Checker) %>%
  filter(!is.na(Status)) %>%
  mutate(Status = tolower(Status)) %>%
  mutate(Status = str_replace_all(Status,"by pass","bypass")) %>%    
  mutate(Status = str_replace_all(Status,"faulty","faulty meter")) %>%  
  mutate(Status = str_replace_all(Status,"meter bypass","bypass")) %>% 
  mutate(Status = str_replace_all(Status,"meter meter","meter")) %>% 
  mutate(Status = str_replace_all(Status,";meter bypass","")) %>%
  mutate(Status = str_replace_all(Status,"nill","nil")) %>% 
  mutate(Status = str_replace_all(Status,"nl","nil")) %>%
  mutate(Status = str_replace_all(Status,"unauthorized","illegal")) %>%
  mutate(Status = str_replace_all(Status,";other tampers","")) %>%
  mutate(Status = str_replace_all(Status,", bypass","")) %>%
  mutate(Status = str_replace_all(Status,";indicator bypass","")) %>%
  mutate(Status = str_replace_all(Status,"indicator bypass","bypass")) %>%
  mutate(Status = str_replace_all(Status,";bypass","")) %>%
  mutate(Status = str_replace_all(Status,"consumption","connection")) %>%
  mutate(Status = str_replace_all(Status,"illegalreconnetion","illegal connection")) %>%
  mutate(Status = str_replace_all(Status,"lor","loss of revenue")) %>%
  mutate(Status = str_replace_all(Status,"unauthorized","illegal")) %>%
  mutate(Status = str_replace_all(Status,"bad meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"burnt meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"reconnection","connection")) %>%
  mutate(Status = str_replace_all(Status,"direct","illegal")) %>%
  mutate(Status = str_replace_all(Status,"illega ","illegal")) %>%
  mutate(Status = str_replace_all(Status,"bad meter","faulty meter")) %>%
  mutate(Status = str_replace_all(Status,"illlegal","illegal")) %>%
  mutate(Status = str_replace_all(Status,", other tampers","")) %>%
  mutate(Status = str_replace_all(Status,"; other tampers","")) %>%
  mutate(Status = str_replace_all(Status,"illegalreconnetion","illegal connection")) %>%
  filter(!is.na(Estimated_Connected_Load)) %>%
  filter(!is.na(Meter_No.)) %>%
  filter(Availability > 0) %>%
  mutate(Status = as.factor(Status)) %>%
  filter(Average_Consumption > 0 & Estimated_Connected_Load > 0) %>%
  mutate(Excess = ifelse(Average_Consumption > Availability,1,0)) -> input

sum(nrow(input)) == sum(complete.cases(input))
  
spec = c(train = .6, test = .2, validate = .2)

g = sample(cut(
  seq(nrow(input)), 
  nrow(input)*cumsum(c(0,spec)),
  labels = names(spec)
))

suite = split(input, g)
suite[[1]] -> training
suite[[2]] -> test
suite[[3]] -> validation

summary(test)


Thank you so much @technocrat. I see the data scrubbing huddle you had to go through. Yeah, a few observations though:\

  1. Using input derived from Average_Consumption > Availability,1,0 shouldn't suffice as that wasn't the response variable. The response variable is Status as I have indicated before now.
  2. It still does not answer my question, that is, outputting a list of customers that bypassed based on the result of the test set.

I have cleaned the dataset in my Github page, please revisit my GitHub page for the new dataset. I really need to get off this and move on to building a shiny dashboard based on the results of the model.

You want to predict Status from what?

The problem of bypassed is not a prediction problem, it's a classification problem.

Classification or prediction is a matter of definition and intent. In my case, I am trying to first classify existing customers based on their known energy theft history and bypass outcome. But it doesn't stop there, I want to use the classification model to predict the outcome of a new customer.

I understand you're busy and you have a lot on your table, that's why I am so appreciative of your efforts so far. I just need to get this part done. For the Shiny, I already have a template in mind.

We just need to move forward and get this part completed.

Thank you @technocrat... I did it!

Specificity is 65%

Accuracy 87%

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.