Is there an R package to help me find duplicate or invalid survey responses?

Hi everyone

I have a dummy dataset I need to do some testing on before our database is moved into production. It is a survey that we are sending out to 3,000 people and will have a gift card associated with completing the survey.
Is there an R package that can help me find invalid responses (those that the person just randomly picked an answer to get to the gift card) and to try and help identify possible duplicates (people trying to get more than 1 gift card)? The questions range from Y/N to 5 point scales to free text. I’m pretty sure I could write a SQL query to find some but want to have the highest possible confidence the list I provide is at least 98% distinct and valid participants.
Thanks and have a great day
Chris

Hi

As far as I know there are no universal methods (and thus packages) to solve this problem. In a survey, you can't know for sure if someone just has a different opinion or is randomly choosing responses as there are no (in)correct answers. There are a few things I can think of that might guide you, but won't guarantee you can separate the cheaters:

  • If you have it, the time spent on each question or the whole survey can give a clue as people who guess randomly are usually much faster in completing a survey than those who read carefully and consider their answers.
  • Looking at free text answers, very short or empty answers might indicate not taking the response seriously. You can easily check the length of each text answer and review the shortest ones
  • For the structured answers (Y/N to 1-5) you can create a sub-dataset from those and perform something like multiple correspondence analysis (or other dimensionality reduction methods) on it. You can then plot the result and see if clusters form. Depending on the type of survey, people with similar answers / opinions they will cluster together. Outliers here might signify people with unique/extreme opinions or cheaters as the pattern across their answers does not make sense.
  • The clustering/similarity methods can also be used to find (near) duplicates, although it will be difficult to prove that it would be the same person.

Hope this helps a bit
PJ

3 Likes

Thanks PJ, that does actually help, and just means I need to allocate more time than I initially planned to on this project. Appreciate your time and hope you have a great day! (By chance do you know of any books that goes into this? I’ve looked but can’t seem to find any).
Chris

Hi,

First of all, I updated one small bit in my answer: Multiple correspondence analysis is the one that's used for categorical variables and Principal component analysis is for continuous variables, so since the survey response is categorical, you should use MCA.

I'm not aware of any particular books, but there are many tutorials online (just google) and videos on youtube explaining it. R has a package called FactoMineR that can implement them, and there are specific tutorials on the topic.

As there will probably be some data cleaning and manipulation involved, it depends on your coding skills how easy a task this would be, as the MCA itself should be easy once you have correctly prepared the dataset. Then it is a matter of filtering and evaluating the outliers.

Grtz,
PJ

You are a champion, thank you so much. Although there isn’t an “easy” answer I feel much better about this.
Chris

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.