Using pivot_longer to transform data from wide to long

I have an example dataset "dfsales" in a wide format and now I am trying to change it into a long format. I went through a lot of examples with single set of values (e.g just having month1:month7). But I need to work with a lot of column set (e.g month1:month7, goods1:goods7, built1-built7, condition1-condition7 etc). Here in this example I have just added 2 set (month1:month7 and goods1:goods7).

dfsales <- data.frame(
  subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
  location = c("NY","NC","WA","WA","OR","CA","AR","KS","AZ","VT","MA"),
  month1 = c(NA,NA,1,0,0,2,1,1,0,0,0),
  month2 = c(NA,NA,0,0,0,0,NA,0,0,0,NA),
  month3 = c(NA,1,0,1,0,0,0,1,NA,NA,NA),
  month4 = c(0,0,0,0,0,1,2,0,1,NA,0),
  month5 = c(NA,NA,NA,NA,NA,NA,0,1,1,2,0),
  month6 = c(NA,NA,0,0,0,NA,NA,0,0,0,0),
  month7 = c(0,0,0,0,0,0,NA,0,0,0,0),
  goods1 = c(1,2,1,2,0,0,1,2,2,1,0),
  goods2 = c(0,0,1,2,1,1,2,2,1,0,0),
  goods3 = c(0,1,2,1,1,NA,2,1,2,1,NA),
  goods4 = c(0,1,2,1,1,1,2,2,NA,NA,NA),
  goods5 = c(0,1,0,1,1,1,2,2,1,NA,NA),
  goods6 = c(0,1,2,1,1,1,2,2,0,0,0),
  goods7 = c(NA,1,1,1,1,1,2,2,2,NA,NA))
Created on 2021-12-03 by the reprex package (v2.0.1)

I am trying to change it into a long format. I did a bit of research and pivot_longer seemed to be the best option. I have the subjectid and location as fixed variable and month1:month7 and goods1:goods6 as the time varying variable. This is how I want my final data (artificially created a part of it) to look like with every subject having a row for each of the 7 month - subject/month. Since all the column set have the same range column1-column7, I am guessing just having one column as "month" as a time indicator would be enough.
Screen Shot 2021-12-03 at 8.55.56 PM

This is how I tried :

dfsales %>% 
  pivot_longer(
    cols = starts_with("month","goods"), 
    names_to = "month","kind_of_goods",
    names_prefix = "month","goods",
    names_transform = list(month,kind_of_goods = as.integer),
    values_to = "month_code","goods_code",
    values_drop_na = TRUE,
  )

Could anyone help me point out and explain my mistake or show if there is an easier/better way to do this? Thank you

It would be helpful if you could show an example of what your ideal output should be.

Good reprex. Like @rdr , I'm not exactly sure of your intent, but can get you over the first hump

library(tidyr)
dfsales <- data.frame(
  subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
  location = c("NY","NC","WA","WA","OR","CA","AR","KS","AZ","VT","MA"),
  month1 = c(NA,NA,1,0,0,2,1,1,0,0,0),
  month2 = c(NA,NA,0,0,0,0,NA,0,0,0,NA),
  month3 = c(NA,1,0,1,0,0,0,1,NA,NA,NA),
  month4 = c(0,0,0,0,0,1,2,0,1,NA,0),
  month5 = c(NA,NA,NA,NA,NA,NA,0,1,1,2,0),
  month6 = c(NA,NA,0,0,0,NA,NA,0,0,0,0),
  month7 = c(0,0,0,0,0,0,NA,0,0,0,0),
  goods1 = c(1,2,1,2,0,0,1,2,2,1,0),
  goods2 = c(0,0,1,2,1,1,2,2,1,0,0),
  goods3 = c(0,1,2,1,1,NA,2,1,2,1,NA),
  goods4 = c(0,1,2,1,1,1,2,2,NA,NA,NA),
  goods5 = c(0,1,0,1,1,1,2,2,1,NA,NA),
  goods6 = c(0,1,2,1,1,1,2,2,0,0,0),
  goods7 = c(NA,1,1,1,1,1,2,2,2,NA,NA))

# OP
dfsales %>% 
  pivot_longer(
    cols = starts_with("month","goods"), 
    names_to = "month","kind_of_goods",
    names_prefix = "month","goods",
    names_transform = list(month,kind_of_goods = as.integer),
    values_to = "month_code","goods_code",
    values_drop_na = TRUE,
  )
#> Error: argument is not interpretable as logical

# FIX for the first error
dfsales %>% 
  pivot_longer(!(c(subjectid,location)),       
    #names_to = c("month","kind_of_goods"),
    #names_sep  = "month*|goods*",
    #names_prefix = "month","goods",
    #names_transform = list(month,kind_of_goods = as.integer),
    #values_to = "month_code","goods_code",
    values_drop_na = TRUE
)
#> # A tibble: 122 × 4
#>    subjectid location name   value
#>    <chr>     <chr>    <chr>  <dbl>
#>  1 a         NY       month4     0
#>  2 a         NY       month7     0
#>  3 a         NY       goods1     1
#>  4 a         NY       goods2     0
#>  5 a         NY       goods3     0
#>  6 a         NY       goods4     0
#>  7 a         NY       goods5     0
#>  8 a         NY       goods6     0
#>  9 b         NC       month3     1
#> 10 b         NC       month4     0
#> # … with 112 more rows

Thank you. I just created something and edited my OP

Thank you @technocrat . The way you showed worked. However, what I was aiming is bit different. I have edited the OP for clarity.

In your code, if I also wanted to drop the "month" and "goods" part from the "name" column and just have the number, how would I do it? e.g. month4 -> 4, month7 -> 7, goods1 ->1

Is this two step solution acceptable?

library(tidyr)
dfsales <- data.frame(
  subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
  location = c("NY","NC","WA","WA","OR","CA","AR","KS","AZ","VT","MA"),
  month1 = c(NA,NA,1,0,0,2,1,1,0,0,0),
  month2 = c(NA,NA,0,0,0,0,NA,0,0,0,NA),
  month3 = c(NA,1,0,1,0,0,0,1,NA,NA,NA),
  month4 = c(0,0,0,0,0,1,2,0,1,NA,0),
  month5 = c(NA,NA,NA,NA,NA,NA,0,1,1,2,0),
  month6 = c(NA,NA,0,0,0,NA,NA,0,0,0,0),
  month7 = c(0,0,0,0,0,0,NA,0,0,0,0),
  goods1 = c(1,2,1,2,0,0,1,2,2,1,0),
  goods2 = c(0,0,1,2,1,1,2,2,1,0,0),
  goods3 = c(0,1,2,1,1,NA,2,1,2,1,NA),
  goods4 = c(0,1,2,1,1,1,2,2,NA,NA,NA),
  goods5 = c(0,1,0,1,1,1,2,2,1,NA,NA),
  goods6 = c(0,1,2,1,1,1,2,2,0,0,0),
  goods7 = c(NA,1,1,1,1,1,2,2,2,NA,NA))
tmp <- pivot_longer(data = dfsales,cols = month1:goods7,
             names_pattern = "([^\\d]+)(\\d)",
            names_to = c("type","month"))
pivot_wider(data = tmp,names_from = "type",values_from = "value",
            names_prefix = "value_")
#> # A tibble: 77 x 5
#>    subjectid location month value_month value_goods
#>    <chr>     <chr>    <chr>       <dbl>       <dbl>
#>  1 a         NY       1              NA           1
#>  2 a         NY       2              NA           0
#>  3 a         NY       3              NA           0
#>  4 a         NY       4               0           0
#>  5 a         NY       5              NA           0
#>  6 a         NY       6              NA           0
#>  7 a         NY       7               0          NA
#>  8 b         NC       1              NA           2
#>  9 b         NC       2              NA           0
#> 10 b         NC       3               1           1
#> # ... with 67 more rows

Created on 2021-12-03 by the reprex package (v2.0.1)

This works in one step, though the column names are still not what you want.

pivot_longer(data = dfsales,cols = month1:goods7,
                    names_pattern = "([^\\d]+)(\\d)",
                    names_to = c(".value","month_number"))

# A tibble: 77 x 5
   subjectid location month_number month goods
   <chr>     <chr>    <chr>        <dbl> <dbl>
 1 a         NY       1               NA     1
 2 a         NY       2               NA     0
 3 a         NY       3               NA     0
 4 a         NY       4                0     0
 5 a         NY       5               NA     0
 6 a         NY       6               NA     0
 7 a         NY       7                0    NA
 8 b         NC       1               NA     2
 9 b         NC       2               NA     0
10 b         NC       3                1     1
# ... with 67 more rows
1 Like

If that works for @cactus , I think it should go on to widen on subjectid to make the end result “tidy”.

Thank you. I was able to change it into long but I don't think the month_number thing worked well. In my original data I have a columnset ranging from 1-250 (e.g. goods1:goods250). However, the max month_number I got was "9" along with "NA". Could you please explain the error and also what this part "([^\d]+)(\d)" is saying ?

I used the last code you wrote just modifying the column names and number of available months I have.
Could the month_value just come out of the number part of the whatever column I have? e.g. I don't have month1-month250 in my dataset but I have goods1:goods250. So could these values be just picked from that? (1-250)

pivot_longer(data = dfsales,cols = condition1:goods250,
                    names_pattern = "([^\\d]+)(\\d)",
                    names_to = c(".value","month_number"))

Once again thank you for the guide !

The names_pattern argument is a regular expression, a kind of advanced wildcard notation, that describes the structure of the column names. I will give a brief description of this particular expression.
The column names have two parts, a set of letters followed by a single digit. (I know your actual data can have more than one digit but we'll stick with the original version for the moment.) The names_pattern = "([^\\d]+)(\\d)" represents those two parts. The parentheses indicate that there are two parts: ([^\\d]+) and (\\d). I will work from the inside out to interpret the first part, ([^\\d]+).
The \\d represents a numeric digit.
[^\\d] means not a numeric digit. That is, [^ ] wrapped around some text means not the text, so [^\\d] means not a numeric digit.
The plus sign means one or more of the preceding, so [^\\d]+ means one or more characters that are not numeric digits. That will capture either of the texts month or goods.
The second part of the regular expression, (\\d) is simple. It means a numeric digit.
The expression as a whole will match some non-digits followed by a digit.

With the new information, the regular expression has to capture some non-digits followed by one or more digits. That only requires adding a + to the second part of the expression.
([^\\d]+)(\\d+)
I adjusted the original data to have larger numbers following the words and the numbers in month and goods do not match in all cases.

library(tidyr)
dfsales <- data.frame(
  subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
  location = c("NY","NC","WA","WA","OR","CA","AR","KS","AZ","VT","MA"),
  month1 = c(NA,NA,1,0,0,2,1,1,0,0,0),
  month2 = c(NA,NA,0,0,0,0,NA,0,0,0,NA),
  month3 = c(NA,1,0,1,0,0,0,1,NA,NA,NA),
  month4 = c(0,0,0,0,0,1,2,0,1,NA,0),
  month5 = c(NA,NA,NA,NA,NA,NA,0,1,1,2,0),
  month6 = c(NA,NA,0,0,0,NA,NA,0,0,0,0),
  month17 = c(0,0,0,0,0,0,NA,0,0,0,0),
  goods1 = c(1,2,1,2,0,0,1,2,2,1,0),
  goods2 = c(0,0,1,2,1,1,2,2,1,0,0),
  goods3 = c(0,1,2,1,1,NA,2,1,2,1,NA),
  goods4 = c(0,1,2,1,1,1,2,2,NA,NA,NA),
  goods25 = c(0,1,0,1,1,1,2,2,1,NA,NA),
  goods106 = c(0,1,2,1,1,1,2,2,0,0,0),
  goods1077 = c(NA,1,1,1,1,1,2,2,2,NA,NA))

pivot_longer(data = dfsales,cols = month1:goods1077,
             names_pattern = "([^\\d]+)(\\d+)",
             names_to = c(".value","month_number"))
#> # A tibble: 110 x 5
#>    subjectid location month_number month goods
#>    <chr>     <chr>    <chr>        <dbl> <dbl>
#>  1 a         NY       1               NA     1
#>  2 a         NY       2               NA     0
#>  3 a         NY       3               NA     0
#>  4 a         NY       4                0     0
#>  5 a         NY       5               NA    NA
#>  6 a         NY       6               NA    NA
#>  7 a         NY       17               0    NA
#>  8 a         NY       25              NA     0
#>  9 a         NY       106             NA     0
#> 10 a         NY       1077            NA    NA
#> # ... with 100 more rows

Created on 2021-12-10 by the reprex package (v2.0.1)

2 Likes

@FJCC - Wow this is such a great explanation ! Thanks a lot. I understood all the details and I think I can replicate this in other situations too :slight_smile:

It worked as I had expected. However, I noticed that I have an extra row after the end of each month_number as "NA". i.e. After rows for subjectid "a" -> month1:month250, I have an extra row with"NA" before rows for subjectid "b" starts.
I can easily take that out but wanted to check if that is something to do with the code or how my data is structured.

I don't see the spurious NA line between subjectid a and b.

Result <- pivot_longer(data = dfsales,cols = month1:goods1077,
              names_pattern = "([^\\d]+)(\\d+)",
              names_to = c(".value","month_number"))
Result[9:12,]
# A tibble: 4 x 5
  subjectid location month_number month goods
  <chr>     <chr>    <chr>        <dbl> <dbl>
1 a         NY       106             NA     0
2 a         NY       1077            NA    NA
3 b         NC       1               NA     2
4 b         NC       2               NA     0

The NA for goods1077 is correct and the NA in the month column is there because there is no month1077.

So it must be how my data is structured. I can take a look at it.

Thank you for the codes and explanation again ! Cheers

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.