Creating a subset of data based on id numbers

naja · November 14, 2019, 8:11pm

Dear R-studio community.
I am trying to create a subset of some data, but given the nature of the data i need certain conditions to be met.

The problem is that each of my rows contain a single payment, this payment has a variable specifying a contact number. For certain customers there are multiple payments which will fall in different rows, but they will be labelled with the same contact number.

Therefore, i need the subset to take into consideration that if it selects one payment from one contact number, it needs to include all other rows (payments) containing that contact number.

See an example of customer 18445 below. I need a subset to include all 9 payments that have been made under this contact number, if it is randomly selected.

I hope this makes some sense,
Many thanks,
Naja

woodward · November 14, 2019, 8:24pm

Sure it's super easy in R. What have you tried?

library(dplyr)

your_data_frame %>%
    filter(Contact_number == 18445)

naja · November 15, 2019, 8:30am

Hi, woodward thank you for your help, but unfortunately this is not what I mean.
I need a random selection of around 3000 data entries from a dataset containing 52000 data entries. Some contact numbers are repeated several times, so if the random selection chooses one row containing this contact number it needs to select all rows containing that contact number. The subset of the data will as such end up containing multiple different contact numbers and not just 18445 (this was just an example to show that the contact numbers are sometimes repeated.)
Do you think there is a solution for this?
Many Thanks,
Naja

woodward · November 15, 2019, 8:45am

Do you need exactly 3000 rows? This will make it rather difficult because you need to "randomly" choose contact numbers whose rows equal 3000 exactly. Which maky not even be possible.

What do you mean by random? If you take all the rows that match a contact number this will bias the sample. Or can you randomly sample the contact numbers.

The easiest is to make a list of contact numbers, and choose one randomly until you have at least 3000 rows. But this might not be what you want.

naja · November 15, 2019, 9:09am

Hi woodward,
No it does not need to be 3000 exactly, i just need it to sort according to contact number as i will use the sample to do a rentention rate and therefore need all payments made under each contact number that is selected.
It would be good to randomly sample the contact numbers and then get a subset from that
Naja

phiggins · November 15, 2019, 10:30am

Hi naja,

Would this work to randomly sample unique contacts to create nrow ~ 3000 ?

library(tidyverse)

# make a toy data frame
df <- data.frame(
  contact = c(1800, 1800, 1840, 1840,1840, 1840, 1865, 1865, 1890),
  payment = c(21, 43, 35, 43, 42, 56, 12, 17, 29)
)

# estimate how many contacts are needed to ~ 3000
df %>% 
  count(contact) %>% 
  summarise(num_contacts = 3000/mean(n, na.rm = TRUE)) %>% 
  round() %>% 
  as.integer()->
num_contacts

# identify unique contacts, then randomly sample num_contacts of them
# You may not want replace = TRUE
df$contact %>% 
  unique() %>% 
  sample(size = num_contacts, replace = TRUE) %>% as.data.frame() ->
contacts
names(contacts)<- 'contact' #fix column name

# then use this vector of contacts in a semi_join to select the rows from the dataframe with these contact numbers.
semi_join(df, contacts)
#> Joining, by = "contact"
#>   contact payment
#> 1    1800      21
#> 2    1800      43
#> 3    1840      35
#> 4    1840      43
#> 5    1840      42
#> 6    1840      56
#> 7    1865      12
#> 8    1865      17
#> 9    1890      29

^{Created on 2019-11-15 by the reprex package (v0.3.0)}

woodward · November 15, 2019, 6:43pm

This choose random contacts until you have at least 3000 and then constructs the sample dataframe.

# make some test data
df <- data.frame(
  Contact_number = sample(200:300, 5000, replace = TRUE)
)

library(dplyr)

# tabulate contact numbers
contacts <- table(df$Contact_number)
shuffled <- sample(contacts, length(contacts))
cumulate <- cumsum(shuffled)
get3000 <- cumulate[1:min(which(cumulate >= 3000))]

# sample rows
dfsample <- df %>% 
  filter(Contact_number %in% names(get3000))

naja · November 21, 2019, 8:24am

Hi both! these solutions actually work so nicely. Thank you so much!

system · November 28, 2019, 8:24am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.