Create a subset of a panel data set

I created a panel dataset. The final goal is to run a panel regression on a subset of the data, creating this subset is the issue.

Data example:

ID Time Variable ManyOtherVariables
1 1 123
1 2 1001
1 3 90
2 1 1111
2 2 222
2 3 2222

etc.

The subset I want is: all observations of all ID's for which at time=2 Variable>1000 (here that would be row 1,2, and 3).

I ran:

reg <- plm(y~x, data=subset(df, ID[Variable>1000]), model="within")

and variations such as:

reg <- plm(y~x, data=subset(df, Variable>1000 & Time==2), model="within")

However in this way I lose the observations of the IDs that I want to select in other time periods that time =2

I would have loved to send a reproducable example, but the issue is that I am working with data on a secure computer without access to internet (accept Rstudio itself).

I hope may question is clear? If more detail is needed, please let me know.

Hi, @MLent! Thanks for including some of your data. There's a couple things you can do to make it easier for folks here to help with your question. The first is formatting your code as code so it's easier to read and copy and paste into an R console. Basically, you just enclose your code between three back ticks like this:

``` r
reg <- plm(y~x, data=subset(df, ID[Variable>1000]), model="within")
```

Also, to make it easier for folks here to read and work with, it's better to create an R object with your sample data and post it here. This post has some good tips for how to include sample data:

So, with your example, I would do something like the following:

# create sample data
my_data <- tibble::tribble(
 ~ID, ~Time, ~Variable,
 1, 1, 123,
 1, 2, 1001,
 1, 3, 90,
 2, 1, 1111,
 2, 2, 222,
 2, 3, 2222,
 3, 1, 200,
 3, 2, 2000,
 3, 3, 4000
 )

(I added more fake data to make the example a bit more clear.)

To manipulate data, I like to use the the dplyr package, which is part of the tidyverse. It can sometimes be a little more verbose than other ways of coding in R, but I think it makes the code easier to understand!

So here is how I would create a subset of the data you describe. First I find which IDs meet the conditions you define, and then I use those IDs to subset the full dataset.

library(dplyr)

# create vector of IDs meeting condition
my_ids <- my_data %>%
  filter(Time == 2 & Variable > 1000) %>%
  pull(ID)
my_ids
#> [1] 1 3

# subset data using that vector
my_subset <- my_data %>%
  filter(ID %in% my_ids)
my_subset
#> # A tibble: 6 x 3
#>      ID  Time Variable
#>   <dbl> <dbl>    <dbl>
#> 1     1     1      123
#> 2     1     2     1001
#> 3     1     3       90
#> 4     3     1      200
#> 5     3     2     2000
#> 6     3     3     4000

Created on 2018-11-15 by the reprex package (v0.2.1)

2 Likes

Thank you very much for your help, this fully solved the problem!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.