Thinking like a data scientist | approaching data problems

Hey there,

By profession, I'm a sociologist, but I learned R to improve my overall research skills. One challenge I often encounter while working with data is thinking like data scientists while approaching data problems. Let me give you an example.

Let's say I have a task to remove duplicated emails in the dataset below. But there are some conditions:

When deciding which duplicates to remove, follow the rules below:

  1. If the same person was both a beneficiary and not a beneficiary, then remove the entry related to the non-beneficiary, prioritising that person as beneficiary. We have many more non-beneficiaries than beneficiaries.

  2. If the person was several times non-beneficiary, then prioritise keeping the type with the smallest number of entries.

  3. If the person was several times beneficiary, prioritise keeping the type with the smallest number of entries

df <- tibble(
  email = c("", "", "", "", "", "", ""),
  beneficiary = c("Yes", "Yes", "Yes", "No", "No", "No", "No"),
  type = c("ERC", "FET", "ERC", "FET", "ERC", "ERC", "FET")

My challenge is often the approach to a problem like this. How should I start thinking about how to solve this? If I'm being honest with myself, most often, I start typing code without actually knowing what is my approach.

How would you approach this?

Any ideas and thoughts are welcomed. Thank you!

Hey @Paulius

A good starting point is to take a look at algorithms

1 Like