Analyse Data blindly

Hello! I'm a student with only little experience in R. For my student job I am currently working on a employee survey regarding diversity and inclusion. Because of data protection I would like to analyse obtained data without being able to actually view the datatable. I imagine only knowing the structure of the data table, i.e. meaning of columns (categorial and numerical variables) and rows (subjects). For t-tests, analyses of variance, linear regression etc. I would have to clean the data set, though (missing values and outliers etc.). I'd be very grateful vor any tips and advice!

Thanks a lot and best of health

D

What could possibly go wrong?

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

Usually, we have the benefit that x is, at the beginning, populated, and we can inspect it for properties that must be transformed through one or more f s before we can apply the f or f s that will yield y. Here, you must work backward from y to anticipate the transformations that are needed.

This suggests that y is the place to start—select the questions that you wish to put to the data. Under the principle of lazy evaluation there should be no effort applied to cleansing data that will not be used. So, identify all of the descriptive or test statistics first.

Next, phony up some data to apply to each of the tests (often you can crib this from the help page examples) and write your own functions to call them, such as

get_basics <- function(x) {
  Mean = mean(x, rm.na = TRUE)
  Median = median(x)
  return(c(Mean,Median))
}

get_basics(mtcars$mpg)[1] - get_basics(mtcars$mpg)[2]
#> [1] 0.890625

(Not that you'd particularly need this example.)

From there, you'd note that some functions don't take a rm.na argument and so add to the task list for data cleaning some way of handling NAs.

Note especially that at this point you care about the answers only so much as that the output of f(x) = y is correct in form. Of course, if you slam two random numeric vectors together with sufficiently large n, their correlation is likely to be close to zero most of the time.

With these goals in mind, we can begin the time-consuming part of the exercise, variously called scrubbing, munging, cleaning, rehabilitation, remediation, etc.

First, unless you know otherwise, assume that the otherwise correctly recorded data has passed through a spreadsheet before arriving at R. It will have one or more of the following defects:

  1. Multiline headers
  2. Mixing character and numeric types in the same column
  3. Variables as rows
  4. Illegal or cumersome variable names
  5. Missing values
  6. Errors in computed values (e.g., division by zero)
  7. Obvious transcription errors (e.g., 7-figure salary for job title \ne head football coach)
  8. Categorical variables that should be dummied
  9. Unknown

Second, decide which of these you care about. For example, don't spend time curating a variable that won't be used to create y.

Third, take a stand on data imputation. Do it or not.

Fourth, write a workflow to make the transformations you anticipate.

Fifth, apply to some public data repositories.

Sixth (optional), design a report format.

If done well, you should get an honor's paper out of it.

Come back with specific questions as they arise.

Good Morning and thanks a lot for your reply. I have to admit, I don't yet understand everything you wrote, but I'll dig into it! Also thanks for answering follow-ups, I do have a specific question: Is there a way to create a data file that locks or disables any function in R to view the actual data table? I imagine the data being collected on a web page and automatically structured in a prefixed subject x variable format. Yet I'd like to still be able to manipulate the data, that is delete cases with outliers or missing values on specific variables. Thanks a lot!

The question isn't very clear in my opinion.

Let's keep it simple.

You have a survey. That is about diversity and inclusion. So let's say it has data that is something like this:

  • Ethnicity
  • Gender
  • Age
  • A measure of inclusiveness ? Likert 1:5
  • A unique identifier (student number, email etc)

What is your data protection concern? The unique identifier? Or being able to unpick the data back to an individual. The unique identifier really has no place in the responses table. At worst the response_id should be linked to a different table. At best, the students table should simply hold only if they answered and nothing that links what they answered.

But it may well remain possible to re-identify a 55year old black male from a class of 50 students. At that level you will never be able to truly anonymise, without loss of data. There could be just one person who is black, or one person who is over 45, or in some classes one male / female!

If you have 5000 responses, these things become less likely. But you may still need to give careful thought to 'small numbers'. The trouble with diversity / inclusion is you may want small numbers... They are afterall who you are worried about excluding presumably!

So returning to the question.. ..how is the raw data held? If it's a CSV file - then of course R can't stop you looking at the data - you could just open the CSV in excel. If it's a SQL database, then exactly what data you can access can be controlled by the SQL table admin. That's not an R thing.

Once the data is in R, can you hide it? Seems a very odd concept. All I can say is that a variable within a function is not global. So (I think):

`
require(skimr)

myFunc <- function ( ) {
myData <- read.csv ("myData.csv")
skim(myData)
}

myFunc()

`

Would, I think mean you are not able to see the raw data from the CSV.

But, nothing would stop you typing:
read.csv ("myData.csv")
elsewhere! And if you need your data more than once that feels like you are just going to be re-reading data in lots.

What I expect you are trying to do is suppress small numbers from your report, but not your analysis. The analyst should be able to see the raw data in my opinion and be bound by contract to confidentiality and to not try to re-identify a subject. The end report however will end up (or should be assumed to) in the public domain and therefore must make it impossible for anyone to re-identify a subject as they are not bound by any rules.

You might want to look at sdcTable and easySdcTable packages.

I can’t think of any way to do that.

It sounds like you're trying to protect the data from yourself. As @CALUM_POLWART points out, there are ethical concerns, even with record identification masking. Yet, that's not the only ethical concern. A blind pilot can, in theory, safely operate an aircraft but wouldn't‐because of lack of feedback in executing the procedures required would make the execution unreliable.

I failed to make explicit a veiled assumption. Either you must do this with your eyes covered (in which case you are being trusted to the near equivalent of actually seeing the data and keeping it confidential) or you can only write a script to be applied to the data base of the characteristics given to you and others will run it.

In the first case, it's a matter of degree of trust; in the second, whether the degradation in the usefulness of your work is a price worth bearing for a lack of trust concerning your ability to maintain the data in confidence.

I have seen people create a simulated dataset, based on the real dataset. Then you could write code against that to run on the original.

But.

Who is creating the simulated dataset?

I think that can be done readily. It's just a matter of using {charlatan} or a similar approach to generate arbitrary content numeric, character and factor variables. But, of course GIGO and any statistical tests run against them will be correct in form only and assuming that the characteristics of the real data don't contain any singularities, etc. It's not really satisfactory as anything other than an academic exercise. (I can't decide if the pun is intended.)

And for a truly heavy-duty approach that is totally fact-free.

Thank you very much for your responses and your time. I am trying to protect the data against my own eyes, exactly. And yes, it is because I'd like preclude myself from tracing specific data back to individuals so that I can call the survey anonymous. Thanks again for your thoughts, I appreciate the quick responses very much! I guess, it is just not feasible to try and block the data from being viewed.

1 Like

I think it is very legitimate to want to make sure you can't EASILY identify a participant. No name. No DoB. no Student ID. DOB presumably should be changed to age..the others are not required in the analysis and so can be removed early on.

If you are concerned that you will still be able to identify a participant, then 1. You are too close to the subjects to give a good analysis. 2. If you are not too close then how are you identifying them. If you need to go to some other data source to check if there is more than one 55year old black student in a class... Here is an idea... Don't go to the data source. Assume there is for analysis. Assume there isn't for publication.

Thank you, Calum. Yes, I am not going to include names, employee ids or dates of birth. I am a working student, that is I am an employee myself. I am the subjects' coworker and thus one might say, I am close. Of course, I can analyse the data without actively trying to identify individuals and I am going to, haha. But there are also legal restrictions and requirements to ensure that participants do not have to simply trust me on that. Anonymous has to mean anonymous. And I consider anonymity crucial regarding participation rates. I hoped there might me a coding solution to this problem. So thanks again for answering my questions! Have a nice day!

As you trying to "hide" data from yourself, I may suggest you subset your dataset to the minimal required set of columns. For instance, you may leave only columns age + responses when you want to see the effect of age. Then move to the next independent variable. Load more than one independent variable only if you need them. Thus you will see connections between independent variables only you do it intentionally.

Thanks, Dobrokhotov! Unfortunately, I also need to analyse interaction effects in order to control for intersectionality. For example, a nonexistent difference between heterosexuals and lgbtqia regarding a psychometric on inclusion might simply be explained by the heterosexual subsample encompassing most employees with disabilities etc. Thanks for joining in, really!