Tidying data. Beginners question

Hi there,
I´m a beginner working with a huge dataset where I want to tidy/sort my data. My dataset contains 150 variables and 1000 observations. My first question is: One of my variables is named "dia1" and contains codes, such as DKZ455. First I want to just include observations where "dia1" starts with "DK*".
My second aim is to include only observations where "dia1" equals a specific value noted on a list containing about 100 different values, DKZ455, DKJ044 etc. If not on the list it shall not show.

Anybody eager to help? :))

You can use the dplyr package for these tasks, see this example with a built-in dataset

library(dplyr)
library(stringr)

iris %>% 
    filter(str_detect(Species, "^se")) %>% # Filter only values starting with specific text
    head(5) # Show only first 5 results for brevity sake
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa

filter_list <- c("setosa", "virginica")

iris %>% 
    filter(Species %in% filter_list) %>%  # Filter by values present on a list
    head(5) # Show only first 5 results for brevity sake
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa

Created on 2019-11-25 by the reprex package (v0.3.0.9000)

If you want to learn more about this I recommend you to read this free online book

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.