Downsampling dataset

Hello Rstudio community,

I am trying to downsample my dataset. I have a dataset (datatable) with continuous measurement ratings of some videos, there are two ratings per second in the dataset. I need to add a different set of continuous measurement ratings (of the same videos) so I can compare both ratings in a correlation. However, the second dataset has 1 rating for every two seconds. I want both datasets to have the same amount of ratings per video, so I want to downsample the first dataset to also have 1 rating per 2 seconds.
Does anyone know how I would be able to do this? I can't seem to figure it out..

Thank you in advance!

can you say why you prefer downsampling to upsampling ?
wouldnt you be discarding potentially useful information ?

Yes that is true, but the second dataset doesn't contain any more ratings so to get them even the best way is to downsample the first one. The data lost from the first set won't be a massive problem for my study.

  1. Would it work to just sample N observations from dataset 1, and N from dataset 2?

  2. I'm not sure if downsampling is the same as undersampling, but here is one way to adjust for imbalance. Put both datasets in a single dataframe df, and add a column named z that is 1 for dataset 1 and is 2 for dataset 2. Install ROSE package, load ROSE library. Then:

df_under <- ovun.sample(z ~ ., data=df, method="under", N=nrow(df))$data

df_over <- ovun.sample(z ~ ., data=df, method="over", N=2*nrow(df))$data

will give you one dataframe of undersampled data and a second dataframe of oversampled data. Each should have approximately equal number of dataset 1 and 2 sampled items, which you verify with

table(df_under$z)

table(df_over$z)

I'm sorry, I am relatively new in the Rstudio world and I don't really understand the solution that you are offering. Could you maybe explain it a bit more?

if every row in both datasets is a rating, and they have an equal number of rows ,then you could column bind the rating data and use exactly all of it, or am I missing something ?

Would it work to just sample N observations from dataset 1, and N from dataset 2?
If the first dataset is called x and the second is called y, then

sample(x, 100, replace=FALSE)
sample(y, 100, replace=FALSE)

There is not an equal number of rows though, because the first dataset has more ratings than the second one (because the first one has two ratings per second and the second one has one rating per two seconds).

I'm afraid just sampling them would be to random. The ratings are connected to 16 different videos and the ratings are continuously made while the participants are watching the videos, so the ratings are bound to a certain time aspect..

Ok. I only asked because you had said the second set doesn't have any more ratings (than the first?) sorry for the misunderstanding

Hi Mirthe,

In the first data set there is 2 samples per second, and in the second data set, there is 1 sample every 2 seconds. Therefore, sampling in the first data set is 4x's faster than the second.

One approach is to take every 4th row of the first data set, which would then give you 1 observation every 2 seconds, the same as the second data set.

library("tibble")

set.seed(123)

df1 <- tibble(second = rep(1:50, each = 2),
  sample_id = 1:100, x = rnorm(100))

head(df1)
#> # A tibble: 6 x 3
#>   second sample_id       x
#>    <int>     <int>   <dbl>
#> 1      1         1 -0.560 
#> 2      1         2 -0.230 
#> 3      2         3  1.56  
#> 4      2         4  0.0705
#> 5      3         5  0.129 
#> 6      3         6  1.72

# As a comparison for sampler 2
df2 <- tibble(second = seq(1, 50, by = 2),
  sample_id = 1:25, x = rnorm(25))

df1_sub <- df1[seq(1, nrow(df1), 4), ]
head(df1_sub)
#> # A tibble: 6 x 3
#>   second sample_id      x
#>    <int>     <int>  <dbl>
#> 1      1         1 -0.560
#> 2      3         5  0.129
#> 3      5         9 -0.687
#> 4      7        13  0.401
#> 5      9        17  0.498
#> 6     11        21 -1.07

head(df2)
#> # A tibble: 6 x 3
#>   second sample_id       x
#>    <dbl>     <int>   <dbl>
#> 1      1         1 -0.710 
#> 2      3         2  0.257 
#> 3      5         3 -0.247 
#> 4      7         4 -0.348 
#> 5      9         5 -0.952 
#> 6     11         6 -0.0450

cor(df1_sub[["x"]], df2[["x"]])
#> [1] 0.3125283

Created on 2020-07-31 by the reprex package (v0.3.0)

perhaps another alternative is to simply repeat the slower data stream to 'stretch it out' conceptually this would be equivalent to measuring the correlation between the 'last seen rating' of both streams at every point in shared time

fast stream slow stream
1 2
2
4
2
4 3
3
2
2
3 1

to

fast stream slow stream
1 2
2 2
4 2
2 2
4 3
3 3
2 3
2 3
3 1
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.