Creating a loop for validity variables

jmervis · September 26, 2020, 4:52pm

I have the following data from an MTurk study:

data.frame(
Random.ID = c(46392L,91734L,98884L,50989L,92380L,
32805L,85910L,83298L,28722L,60690L),
CRSBCIS = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
CRSCAPE = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
CRSCAPE2 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
CRSCMQ = c(11L, 11L, 11L, 11L, 11L, 11L, 10L, 11L, 11L, 11L),
CRSDemo = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
CRSDPB = c(8L, 8L, 8L, 8L, 8L, 8L, 10L, 8L, 8L, 8L),
CRSDUQ = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 3L, 3L, 3L),
CRSDUQ2 = c(2L, 2L, 2L, 2L, 2L, 2L, 6L, 2L, 2L, 2L),
CRSGCBS = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
CRSIDI = c(13L, 13L, 13L, 13L, 7L, 13L, 13L, 13L, 13L, 13L),
CRSIDI2 = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
CRSNFC = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
CRSTSRQ = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L)
)

Running studies on MTurk requires figuring out which participants are bots or are randomly responding vs. real/good effort data. The first column in this data frame is the participant ID. I need this preserved in a final data frame consisting of an ID column and a validity variable (example at the end), which could be coded as 0,1 or whatever makes it clear which data to toss and which participants to pay for real work vs. which to reject. Once I have this sorted out we are going to open the floodgates and run with hundreds of participants.

The other column variables come from a method of screening out bots/random responders using the Conscientious Responders Scale (if you're curious: https://journals.sagepub.com/doi/pdf/10.1177/2158244014545964)

Each question reads something like "To answer this question, choose "All of the above", which is coded as "4" in the case of the second variable in the data frame. Each questionnaire gets one or two of these depending on length. I need to create a new variable that will operationalize valid responding as >= ~80% correct responses across these variables (columns 2 through 14).

The correct answers to the variables, in order from 2 through 14 are: (4,1,2,1,1,8,3,2,3,13,3,4,7).

An example of my ideal final data frame would look something like this:

data.frame(
Random.ID = c(46392L,91734L,98884L,50989L,92380L,
32805L,85910L,83298L,28722L,60690L),
Valid = c(0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L),
)

I think this can be done by creating a new empty variable and then within a loop checking to see if these variables are answered correctly, adding 1 to that variable if they are, moving on to the next variable and repeating this process for all columns. Then that number would be divided by 13. I'm not new to R, but I have very little experience writing loops and am not sure where to start.

Thank you in advance for any help!

GreyMerchant · September 26, 2020, 5:37pm

Hello,

I can think of several ways of solving this. I just want you to confirm something first before I create a solution. With regards to the below - is this the only pattern we are checking for for all respondents?

The correct answers to the variables, in order from 2 through 14 are: (4,1,2,1,1,8,3,2,3,13,3,4,7).

GreyMerchant · September 26, 2020, 5:56pm

Hello,

See below. I did make one change to your datafraem by changing the first value of CRSCMQ to 1 as no one will meet that condition otherwise. As you will see in this example we then have one person meeting the condition where everyone else does not.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(
  Random.ID = c(46392L,91734L,98884L,50989L,92380L,
                32805L,85910L,83298L,28722L,60690L),
  CRSBCIS = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4),
  CRSCAPE = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
  CRSCAPE2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
  CRSCMQ = c(1, 11, 11, 11, 11, 11, 10, 11, 11, 11),
  CRSDemo = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
  CRSDPB = c(8, 8, 8, 8, 8, 8, 10, 8, 8, 8),
  CRSDUQ = c(3, 3, 3, 3, 3, 3, 5, 3, 3, 3),
  CRSDUQ2 = c(2, 2, 2, 2, 2, 2, 6, 2, 2, 2),
  CRSGCBS = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
  CRSIDI = c(13, 13, 13, 13, 7, 13, 13, 13, 13, 13),
  CRSIDI2 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
  CRSNFC = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4),
  CRSTSRQ = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7)
)

df_output <- df %>% mutate(Qualify = case_when(
  CRSBCIS == 4 &
    CRSCAPE == 1 &
    CRSCAPE2 == 2 &
    CRSCMQ == 1 &
    CRSDemo == 1 &
    CRSDPB == 8 &
    CRSDUQ == 3 &
    CRSDUQ2 == 2 &
    CRSGCBS == 3 &
    CRSIDI == 13 &
    CRSIDI2 == 3 &
    CRSNFC == 4 &
    CRSTSRQ == 7 ~ 1,
  TRUE ~ 0)) %>% select(Random.ID,Qualify)


df_output
#>    Random.ID Qualify
#> 1      46392       1
#> 2      91734       0
#> 3      98884       0
#> 4      50989       0
#> 5      92380       0
#> 6      32805       0
#> 7      85910       0
#> 8      83298       0
#> 9      28722       0
#> 10     60690       0

^{Created on 2020-09-26 by the reprex package (v0.3.0)}

jmervis · September 26, 2020, 6:43pm

I ran the code you used here:

df_output <- DF %>% mutate(Qualify = case_when(
CRSBCIS == 4 &
CRSCAPE == 1 &
CRSCAPE2 == 2 &
CRSCMQ == 11 &
CRSDemo == 1 &
CRSDPB == 8 &
CRSDUQ == 3 &
CRSDUQ2 == 2 &
CRSGCBS == 3 &
CRSIDI == 13 &
CRSIDI2 == 3 &
CRSNFC == 4 &
CRSTSRQ == 7 ~ 1,
TRUE ~ 0)) %>% select(Random.ID,Qualify)

I changed the 1 in CMQ back to an 11 because when I ran it with a 1 nothing flagged, but when I changed it to 11 there were two values that flagged. I checked them in the .csv file and it looked like it worked perfectly in identifying who had 100% of the validity measures. I like this because it's a conservative measure.

Can you explain what this does in the code: "CRSTSRQ == 7 ~ 1,
TRUE ~ 0"? Since I'm learning, this would be helpful. Also, it doesn't seem like the code you had here was able to set a threshold at 80%, but maybe I don't need that since I should be able to get enough participants who have 100%.

jmervis · September 26, 2020, 6:43pm

Yes, that is correct!

GreyMerchant · September 26, 2020, 10:03pm

Hello @jmervis,

I retracted that post as it didn't work for your 80% cut off criteria but yes as is it will work for 100% so it is as conservative as you can do. (I've re-added the post so you can mark it as the solution if you like)

In terms of what it does, we are making use of mutate which is giving us a whole new variable which is technically a column of values. With case_when it checks for that per value so if it finds 4 it will go to the next and and so forth. This will happen for each value. This is a very well optimised way to write this.

If all conditions are met (the below) then it needs to allocate 1 to our Qualify column

CRSBCIS == 4 &
CRSCAPE == 1 &
CRSCAPE2 == 2 &
CRSCMQ == 11 &
CRSDemo == 1 &
CRSDPB == 8 &
CRSDUQ == 3 &
CRSDUQ2 == 2 &
CRSGCBS == 3 &
CRSIDI == 13 &
CRSIDI2 == 3 &
CRSNFC == 4 &
CRSTSRQ == 7 ~ 1,

If in fact it does reach false then we allocate 0 with the below

 TRUE ~ 0

At the end I am simply returning only the ID and Qualify value but you can keep all of the columns if you like. I hope that explains it sufficiently?

jmervis · September 26, 2020, 10:45pm

Yes, thank you! I appreciate the help.

GreyMerchant · September 26, 2020, 10:56pm

Great If you feel the original response was the solution just mark it as such.

system · October 3, 2020, 10:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.