Trying to execute Logistic Regression In R- Predicting Customer Conversions

Hello All,

I am trying to set up logistic regression in R for a customer conversion dataset. Can anyone assist with debugging this? Sorry, I am new to R.

Heres what I have so far:

install.packages("tidyverse")
install.packages("tidymodels")
library(tidymodels)

#Read the dataset and convert the variable to a factor
dataset2 <- read_csv("Bike_Trips_2019.csv")
dataset2$y = as.factor(dataset2$y)

plot the gender and birth year against the target variable
ggplot(dataset2, aes(gender, fill = y))+
geom_bar(x=gender,y=user_type)
coord_flip()

The error message is :+1: <error/tibble_error_assign_incompatible_size>
Error in $<-:
! Assigned data as.factor(dataset2$y) must be compatible with existing data.
:heavy_multiplication_x: Existing data has 365069 rows.
:heavy_multiplication_x: Assigned data has 0 rows.
:information_source: Only vectors of size 1 are recycled.
Caused by error in vectbl_recycle_rhs_rows():
! Can't recycle input of size 0 to size 365069.

Backtrace:
β–†

  1. β”œβ”€base::$<-(*tmp*, y, value = <fct>)
  2. └─tibble:::$<-.tbl_df(*tmp*, y, value = <fct>)
  3. └─tibble:::tbl_subassign(...)
  4. └─tibble:::vectbl_recycle_rhs_rows(value, fast_nrow(xo), i_arg = NULL, value_arg, call)
    

View(data2)
View(dataset2)
Error in View : object 'dataset2' not found

It seems the first of the following lines works but the second throws an error.

dataset2 <- read_csv("Bike_Trips_2019.csv")
dataset2$y = as.factor(dataset2$y)

After running the first line what is the result of running

colnames(dataset2)

Is there a column named y?

1 Like

Nope , the column I am trying to reference is 'user_type'

I am confused because the tutorial I am following said to write the code that way ?

https://www.datacamp.com/tutorial/logistic-regression-R

Wait one moment. after running the first line as you suggested:

Blockquote dataset2 <- read_csv("Bike_Trips_2019.csv")
Rows: 365069 Columns: 14
── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): start_time, end_time, from_station_name, to_station_name, user_type, gender, ride_length
dbl (6): trip_id, bikeid, from_station_id, to_station_id, birthyear, day_of_week
num (1): tripduration

:information_source: Use spec() to retrieve the full column specification for this data.
:information_source: Specify the column types or set show_col_types = FALSE to quiet this message.

colnames(dataset2)
[1] "trip_id" "start_time" "end_time" "bikeid" "tripduration"
[6] "from_station_id" "from_station_name" "to_station_id" "to_station_name" "user_type"
[11] "gender" "birthyear" "ride_length" "day_of_week"

Blockquote

Hi I see that you replied but it looks like the system temporarily hid the message. I wanted to give an update. I since plugged in the y value as:

dataset2$user_type =as.factor(dataset2$user_type)

So it now reads:

#Read the dataset and convert the variable to a factor
dataset2 <- read_csv("Bike_Trips_2019.csv")
dataset2$user_type =as.factor(dataset2$user_type)

install.packages("ggplot2")
library(ggplot2)

plot the gender and birth year against the target variable
ggplot(dataset2, aes(dataset2$gender, fill = y))+
geom_bar(x=dataset2$gender,y=dataset2$user_type)
coord_flip()

The error message states it still 'y' cannot be found.

Here is a screenshot if that also helps

You wrote ggplot(dataset2, aes(dataset2$gender, fill = y))but there is no column named y. Do you mean ggplot(dataset2, aes(gender, fill = user_type))?
Note that you should not use the syntax like dataset2$gender. Just write gender. ggplot() knows to look in the dataset2 data frame because you set that as the data argument of the function. Try

ggplot(dataset2,  aes(x = gender, y = user_type, fill = user_type))+
geom_bar() +
coord_flip()

I tried that:

plot the gender and birth year against the target variable
ggplot(data = dataset2,aes(x = gender, y = user_type, fill = user_type))+

  • geom_bar()+
  • coord_flip()
    Error in geom_bar():
    ! Problem while computing stat.
    :information_source: Error occurred in the 1st layer.
    Caused by error in setup_params():
    ! stat_count() must only have an x or y aesthetic.
    Run rlang::last_trace() to see where the error occurred.

So I tried this instead:

plot the gender and birth year against the target variable
ggplot(data = dataset2,aes( y = user_type, fill = user_type))+
geom_bar()+
coord_flip()

And I did get a plot but what does the count mean? It is not exactly clear/
Please see attached screenshot of graph

To clarify I need the graph to show a break down in count (x-axis) by customer attributes on the y axis. with a yes and no for whether the customer is a casual riders "no" or membership holder "yes"

It was supposed to look like the graph Data Camp shows here:

I understand that you want to fill by the user_type and you want to plot by gender or some other column. This should get you roughly what you want.

ggplot(data = dataset2,aes( y = gender, fill = user_type))+ geom_bar()

The geom_bar() will count how many rows have each combination of gender and user_type.
To get the yes/no labels for user_type, run

dataset2$user_type <- factor(dataset2$user_type, levels = c("customer", "subscriber"), labels = c("no","yes"))

I tried to plug that in to get the graph but the plot went blank. Here I have attached a screenshot for you to see.

I even tried to use the absolute '$' symbol for the gender and user_type but that isn't working either

Do you think we have to insert a pipeline?

This will be easier if I have a little data. Please post the output of

dput(head(dataset2, 15))

Sure here it returned:

dput(head(dataset2,15))
structure(list(trip_id = c(21742443, 21742444, 21742445, 21742446,
21742447, 21742448, 21742449, 21742450, 21742451, 21742452, 21742453,
21742454, 21742455, 21742456, 21742457), start_time = c("2019-01-01 0:04:37",
"2019-01-01 0:08:13", "2019-01-01 0:13:23", "2019-01-01 0:13:45",
"2019-01-01 0:14:52", "2019-01-01 0:15:33", "2019-01-01 0:16:06",
"2019-01-01 0:18:41", "2019-01-01 0:18:43", "2019-01-01 0:19:18",
"2019-01-01 0:20:34", "2019-01-01 0:21:52", "2019-01-01 0:23:04",
"2019-01-01 0:23:43", "2019-01-01 0:23:54"), end_time = c("2019-01-01 0:11:07",
"2019-01-01 0:15:34", "2019-01-01 0:27:12", "2019-01-01 0:43:28",
"2019-01-01 0:20:56", "2019-01-01 0:19:09", "2019-01-01 0:19:03",
"2019-01-01 0:20:21", "2019-01-01 0:47:30", "2019-01-01 0:24:54",
"2019-01-01 0:35:20", "2019-01-01 0:32:45", "2019-01-01 0:33:05",
"2019-01-01 0:33:05", "2019-01-01 0:39:00"), bikeid = c(2167,
4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 3939, 6243, 6300,
3029, 84, 5019), tripduration = c(390, 441, 829, 1783, 364, 216,
177, 100, 1727, 336, 886, 653, 601, 562, 906), from_station_id = c(199,
44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 204, 90, 90, 289
), from_station_name = c("Wabash Ave & Grand Ave", "State St & Randolph St",
"Racine Ave & 18th St", "California Ave & Milwaukee Ave", "Mies van der Rohe Way & Chicago Ave",
"LaSalle St & Washington St", "LaSalle St & Washington St", "St. Clair St & Erie St",
"Fort Dearborn Dr & 31st St", "Lake Shore Dr & North Blvd", "Halsted St & Roscoe St",
"Prairie Ave & Garfield Blvd", "Millennium Park", "Millennium Park",
"Wells St & Concord Ln"), to_station_id = c(84, 624, 644, 176,
35, 49, 49, 142, 148, 141, 295, 420, 255, 255, 324), to_station_name = c("Milwaukee Ave & Grand Ave",
"Dearborn St & Van Buren St ()", "Western Ave & Fillmore St ()",
"Clark St & Elm St", "Streeter Dr & Grand Ave", "Dearborn St & Monroe St",
"Dearborn St & Monroe St", "McClurg Ct & Erie St", "State St & 33rd St",
"Clark St & Lincoln Ave", "Broadway & Argyle St", "Ellis Ave & 55th St",
"Indiana Ave & Roosevelt Rd", "Indiana Ave & Roosevelt Rd", "Stockton Dr & Wrightwood Ave"
), user_type = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), levels = c("customer", "subscriber"
), class = "factor"), gender = c("Male", "Female", "Female",
"Male", "Male", "Female", "Male", "Male", "Male", "Male", "Male",
"Female", "Male", "Female", "Female"), birthyear = c(1989, 1990,
1994, 1993, 1994, 1983, 1984, 1990, 1995, 1996, 1994, 1994, 1986,
1990, 1989), ride_length = c("0:06:30", "0:07:21", "0:13:49",
"0:29:43", "0:06:04", "0:03:36", "0:02:57", "0:01:40", "0:28:47",
"0:05:36", "0:14:46", "0:10:53", "0:10:01", "0:09:22", "0:15:06"
), day_of_week = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"

Does that code ask for the 1st 15 rows per variable?

The user_type column has all NA values in the data you posted to I manually filled it. Here is the plot I get.

dataset2 <- structure(list(trip_id = c(21742443, 21742444, 21742445, 21742446,
                           21742447, 21742448, 21742449, 21742450, 21742451, 21742452, 21742453,
                           21742454, 21742455, 21742456, 21742457), start_time = c("2019-01-01 0:04:37",
                                                                                   "2019-01-01 0:08:13", "2019-01-01 0:13:23", "2019-01-01 0:13:45",
                                                                                   "2019-01-01 0:14:52", "2019-01-01 0:15:33", "2019-01-01 0:16:06",
                                                                                   "2019-01-01 0:18:41", "2019-01-01 0:18:43", "2019-01-01 0:19:18",
                                                                                   "2019-01-01 0:20:34", "2019-01-01 0:21:52", "2019-01-01 0:23:04",
                                                                                   "2019-01-01 0:23:43", "2019-01-01 0:23:54"), end_time = c("2019-01-01 0:11:07",
                                                                                                                                             "2019-01-01 0:15:34", "2019-01-01 0:27:12", "2019-01-01 0:43:28",
                                                                                                                                             "2019-01-01 0:20:56", "2019-01-01 0:19:09", "2019-01-01 0:19:03",
                                                                                                                                             "2019-01-01 0:20:21", "2019-01-01 0:47:30", "2019-01-01 0:24:54",
                                                                                                                                             "2019-01-01 0:35:20", "2019-01-01 0:32:45", "2019-01-01 0:33:05",
                                                                                                                                             "2019-01-01 0:33:05", "2019-01-01 0:39:00"), bikeid = c(2167,
                                                                                                                                                                                                     4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 3939, 6243, 6300,
                                                                                                                                                                                                     3029, 84, 5019), tripduration = c(390, 441, 829, 1783, 364, 216,
                                                                                                                                                                                                                                       177, 100, 1727, 336, 886, 653, 601, 562, 906), from_station_id = c(199,
                                                                                                                                                                                                                                                                                                          44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 204, 90, 90, 289
                                                                                                                                                                                                                                       ), from_station_name = c("Wabash Ave & Grand Ave", "State St & Randolph St",
                                                                                                                                                                                                                                                                "Racine Ave & 18th St", "California Ave & Milwaukee Ave", "Mies van der Rohe Way & Chicago Ave",
                                                                                                                                                                                                                                                                "LaSalle St & Washington St", "LaSalle St & Washington St", "St. Clair St & Erie St",
                                                                                                                                                                                                                                                                "Fort Dearborn Dr & 31st St", "Lake Shore Dr & North Blvd", "Halsted St & Roscoe St",
                                                                                                                                                                                                                                                                "Prairie Ave & Garfield Blvd", "Millennium Park", "Millennium Park",
                                                                                                                                                                                                                                                                "Wells St & Concord Ln"), to_station_id = c(84, 624, 644, 176,
                                                                                                                                                                                                                                                                                                            35, 49, 49, 142, 148, 141, 295, 420, 255, 255, 324), to_station_name = c("Milwaukee Ave & Grand Ave",
                                                                                                                                                                                                                                                                                                                                                                                     "Dearborn St & Van Buren St ()", "Western Ave & Fillmore St ()",
                                                                                                                                                                                                                                                                                                                                                                                     "Clark St & Elm St", "Streeter Dr & Grand Ave", "Dearborn St & Monroe St",
                                                                                                                                                                                                                                                                                                                                                                                     "Dearborn St & Monroe St", "McClurg Ct & Erie St", "State St & 33rd St",
                                                                                                                                                                                                                                                                                                                                                                                     "Clark St & Lincoln Ave", "Broadway & Argyle St", "Ellis Ave & 55th St",
                                                                                                                                                                                                                                                                                                                                                                                     "Indiana Ave & Roosevelt Rd", "Indiana Ave & Roosevelt Rd", "Stockton Dr & Wrightwood Ave"
                                                                                                                                                                                                                                                                                                            ), user_type = structure(c(NA_integer_, NA_integer_, NA_integer_,
                                                                                                                                                                                                                                                                                                                                       NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
                                                                                                                                                                                                                                                                                                                                       NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
                                                                                                                                                                                                                                                                                                                                       NA_integer_, NA_integer_), levels = c("customer", "subscriber"
                                                                                                                                                                                                                                                                                                                                       ), class = "factor"), gender = c("Male", "Female", "Female",
                                                                                                                                                                                                                                                                                                                                                                        "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Male",
                                                                                                                                                                                                                                                                                                                                                                        "Female", "Male", "Female", "Female"), birthyear = c(1989, 1990,
                                                                                                                                                                                                                                                                                                                                                                                                                             1994, 1993, 1994, 1983, 1984, 1990, 1995, 1996, 1994, 1994, 1986,
                                                                                                                                                                                                                                                                                                                                                                                                                             1990, 1989), ride_length = c("0:06:30", "0:07:21", "0:13:49",
                                                                                                                                                                                                                                                                                                                                                                                                                                                          "0:29:43", "0:06:04", "0:03:36", "0:02:57", "0:01:40", "0:28:47",
                                                                                                                                                                                                                                                                                                                                                                                                                                                          "0:05:36", "0:14:46", "0:10:53", "0:10:01", "0:09:22", "0:15:06"
                                                                                                                                                                                                                                                                                                                                                                                                                             ), day_of_week = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                                                                                                                                                                                                                                                                                                                                                                                                                                                3)), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))

library(ggplot2)
dataset2$user_type <- sample(c("customer","subscriber"),15,replace = TRUE)
dataset2$user_type <- factor(dataset2$user_type, levels = c("customer", "subscriber"), labels = c("no","yes"))
ggplot(data = dataset2,aes( y = gender, fill = user_type))+ geom_bar()

Created on 2024-03-09 with reprex v2.0.2
The image in the last post where you were getting and error shows that you set dataset2=aes() within the ggplot function. That's why you weren't getting a plot.

1 Like

I tried to copy the format for the sample size and apply it to my original dataset and still I am bit confused.

There seems to be errors although its the same?

You should not use the two lines of code, quoted below, where I manually filled user_type with values. That was done only to allow me to use the data you posted.

dataset2$user_type <- sample(c("customer","subscriber"),15,replace = TRUE)
dataset2$user_type <- factor(dataset2$user_type, levels = c("customer", "subscriber"), labels = c("no","yes"))

Does the user_type column contain values in the full data set, or are they mostly or totally NA. Have you changed that column?

One moment I can do a summary or head of the dataset

Thank you for all of your help, I appreciate it

oh yes it looks like all of the user_types are 'NA'

summary(dataset2$user_type)
no yes NA's
0 0 365069

I had suspected they were and that's why I tried to replace it as you did with 2 lines of code, that is not the correct way to manually enter the values?

Read in the data again and don't try to change that column. We can do that later.