splitting choice experiment (predict prob + assess accuracy)

Hey I am working with a choice experiment dataset.

I want to split the data (e.g., 80/20) to be able to run prediction and accuracy tests.

The problem is that I can't just split the data "anyway" because I have to keep my alternatives from the same choice set together (in my case a choice set has 2 alternatives). When I split the data I get an error (see below) that I think means I have split right down the middle of a choice set.

Tips? Can I specify a particular line to split at?

The error I am getting:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed

What you are looking for is stratified sampling and there are many packages that have this capability.

For example, take a look at rsample package, function initial_split. This function has strata argument that you can specify in order to make sure that your training and test splits both have similar number of target variables as in original dataset (e.g., if you have 90% of class 0 and 10% of class 1 then both training and testing will roughly have 90% of class 0 and 10% of class 1).

Also, your error is not necessarily comes from incorrect splitting, but it's difficult to say otherwise without a reproducible example. Here is some info on how to create one:

1 Like

this doesnt read as a stratification issue to me, it sounds like you have explicit relationships between rows, not that you need to keep some proportions of relative values.

library(tidyverse)

(paired_df <- tibble( 
  id = 1:20,
  choice_set = sort(rep(1:10,2)),
  low_or_high = rep(0:1,10),
  value = runif(20)
))

(choices<- unique(paired_df$choice_set))

(train_i <- sample(choices,
                   size = .8*length(choices), # 80 %
                   replace = FALSE))
(test_i <- setdiff(choices,train_i))

(train_df <- filter(paired_df,
                   choice_set %in% train_i))

(test_df <- filter(paired_df,
                   choice_set %in% test_i))

Yeah, you are probably correct. But, @scottcole2, probably best to clarify the structure and what you are trying to achieve anyways.

@nirgrahamuk great! I appreciate this! However, I need a small fix ... Two things, but I think the second is the bigger problem:

  1. your example code defines your choice variable "low_or_high" as 0 or 1 (I accidentally called this “Choice_set” in previous post). My dataset defines choice variable ("Choice") as TRUE or FALSE. No big deal but .. must I recode this in order for your example code to work in creating "train" and "test" df?

  2. You are right that an explicit relationship I need to hold is the choice question itself ("QES" which is equivalent to your "choice_set"). However your code seems to result in a loss of balance across my 4 blocks for some reason ? (I have approx 250 individuals randomly distributed across each block, but this random distribution is lost after running your code, as you can see in the 3 "summary" commands -- I expected the train_df and test_df to roughly mirror the spread in the original dataframe (cedata1), but it does not). What do you think happened?

Thanks.
Screen Shot 2020-06-18 at 12.12.49 PM

this doesnt matter, these variables are along for the ride, they are illustrative.

Ill refer you back to the first response on this thread, the request for a reprex. At this point I feel I cant help until I see a representative example as its necessary to understand the issues (in lieu of a perfect explanation)