What is an appropriate model for the data where outcome variable is constant for a unit but the features are variable?


#1

I'm asking this question here with the hopes of getting positive feedback from expert modelers. If this is not an appropriate forum for this question, please let me know. I have asked this question almost an year ago on cross validated as well but didn't get any replies.
Thank you for your time.

Background

I want to model the lane change duration (outcome variable) given the input features such as the speed of car, distance from the lead car in current lane as well as target lane, etc. This is an interesting problem as the outcome variable i.e. lane change duration is a constant value for a given lane change maneuver. However, the features e.g. speed of the car continuously varies during the maneuver.
I have been reviewing various modelling approaches e.g. mixed effects models, etc. but am not sure what modelling approach would be most appropriate in this context.

Example data

Following are sample data for 2 cars (each driven by a different driver). You can see, for instance, that Car1 changed lanes twice, the first maneuver was 5 s long while the second was 8 s long. It is evident that for a given lane change, the LC_duration remains the same but the speed varies. If I want to fit a model on these data that predicts LC_duration based on the speed feature, what would the appropriate modelling technique?

df <-  structure(list(driver = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Car1", 
    "Car2"), class = "factor"), gender = structure(c(2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("Female", "Male"), class = "factor"), speed = c(40.5, 
    40.5, 40.7, 41, 41, 38.9, 38.6, 38.8, 39, 39.1, 55, 55, 55, 55.2, 
    55.3, 25, 25.3, 25.6, 25.6, 25.7), age = c(25, 25, 25, 25, 25, 
    25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30), 
        LC_maneuver = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), class = "factor", .Label = c("lane_change1", 
        "lane_change2")), status = structure(c(3L, 1L, 1L, 1L, 2L, 
        3L, 1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L
        ), .Label = c("changing lane", "End_of_LC", "Start_of_LC"
        ), class = "factor"), LC_duration = c(5, 5, 5, 5, 5, 8, 8, 
        8, 8, 8, 3, 3, 3, 3, 3, 11, 11, 11, 11, 11)), .Names = c("driver", 
    "gender", "speed", "age", "LC_maneuver", "status", "LC_duration"
    ), row.names = c(NA, -20L), class = "data.frame")

#2

Looking at your data I would say what you mean by "constant for a unit" is the outcome variable has the same value for sets of rows as it has been copied between these rows during data prep. In wrapr package notation your data looks like the following.

library("wrapr")
df <- build_frame(
   "driver", "gender", "speed", "age", "LC_maneuver" , "status"       , "LC_duration" |
   "Car1"  , "Male"  , 40.5   , 25   , "lane_change1", "Start_of_LC"  , 5             |
   "Car1"  , "Male"  , 40.5   , 25   , "lane_change1", "changing lane", 5             |
   "Car1"  , "Male"  , 40.7   , 25   , "lane_change1", "changing lane", 5             |
   "Car1"  , "Male"  , 41     , 25   , "lane_change1", "changing lane", 5             |
   "Car1"  , "Male"  , 41     , 25   , "lane_change1", "End_of_LC"    , 5             |
   "Car1"  , "Male"  , 38.9   , 25   , "lane_change2", "Start_of_LC"  , 8             |
   "Car1"  , "Male"  , 38.6   , 25   , "lane_change2", "changing lane", 8             |
   "Car1"  , "Male"  , 38.8   , 25   , "lane_change2", "changing lane", 8             |
   "Car1"  , "Male"  , 39     , 25   , "lane_change2", "changing lane", 8             |
   "Car1"  , "Male"  , 39.1   , 25   , "lane_change2", "End_of_LC"    , 8             |
   "Car2"  , "Female", 55     , 30   , "lane_change1", "Start_of_LC"  , 3             |
   "Car2"  , "Female", 55     , 30   , "lane_change1", "changing lane", 3             |
   "Car2"  , "Female", 55     , 30   , "lane_change1", "changing lane", 3             |
   "Car2"  , "Female", 55.2   , 30   , "lane_change1", "changing lane", 3             |
   "Car2"  , "Female", 55.3   , 30   , "lane_change1", "End_of_LC"    , 3             |
   "Car2"  , "Female", 25     , 30   , "lane_change2", "Start_of_LC"  , 11            |
   "Car2"  , "Female", 25.3   , 30   , "lane_change2", "changing lane", 11            |
   "Car2"  , "Female", 25.6   , 30   , "lane_change2", "changing lane", 11            |
   "Car2"  , "Female", 25.6   , 30   , "lane_change2", "changing lane", 11            |
   "Car2"  , "Female", 25.7   , 30   , "lane_change2", "End_of_LC"    , 11            )

What I would suggest is to limit to the rows that have "status"=="Start_of_LC" as they seem to have the future outcome you are trying to model and what is known at the beginning of the lane change all in one row. Then you use standard methods. You don't have much data (the example table only represents 4 events) so I show just a simple regression.

lm(LC_duration ~ gender + speed + age + LC_maneuver, subset(df, status == "Start_of_LC" ))

# Call:
#   lm(formula = LC_duration ~ gender + speed + age + LC_maneuver, 
#      data = subset(df, status == "Start_of_LC"))
# 
# Coefficients:
#   (Intercept)               genderMale                    speed                      age  LC_maneuverlane_change2  
# 12.6831                  -0.5528                  -0.1761                       NA                   2.7183  

#3

@JohnMount thank you very much for your reply. Your suggestion is intuitive and that is in fact what I tried before. One important thing to mention here is that my original data set is thousands of rows long because there were multiple lane changes by the same driver and there were 50 drivers, each represented by a Car prefix followed by the number assigned in order of testing in a driving simulator. The sample data in my question is for the illustration of data structure only.

Let me explain the problem in more detail so that the goal is more understandable. There are 2 main aspects of this problem:

Lane change maneuver

The beginning and end of each lane change maneuver are indicated by the Start_of_LC and 'End_of_LC in the status column. The LC_duration is calculated as the time difference between these two points. I want to understand the relationship between the LC_duration and other features e.g. speed during the lane change maneuver. Some drivers increase speed but there are also many who kept it constant. Similarly, other features e.g. distance to the lead vehicle in the current lane and the distance to the lead vehicle in the target lane also varied as the driver changed lanes. Therefore, in my opinion (maybe naive) I should consider all the data points that belong to a given lane change maneuver. However, the problem is that for a given lane change, the LC_duration variable remains constant. So,I'm confused about what modelling approach can be used.
As you suggested, I tried using only the data at the Start_of_LC. That didn't give many significant variables (in simple linear regression). But regardless, I believe that the variation of features during lane change is worth considering.

Variation due to drivers themselves

No two drivers are alike. As there are 50 drivers in the original data set, there is some variation in the LC_duration both within each driver (a single driver changing lanes multiple times) and between individual drivers. I believe that this variation can be captured as random effect as used in mixed effects models. But these models, in my understanding, do not take into account the outcome variable being constant/repeated per unit (lane change).

My request to the community is to kindly direct me towards the modelling approaches that are suitable for this problem. Thank you.


#4

I would then suggest a mult-level modeling package such as lme4 or rstan and using the driver as one of the "right-hand side of pipe" variables in the formula. Probably what you need is in lme4 examples and documentation.


#5

@JohnMount thanks. I'll check them out.