Using gather for converting wide to long data

Leon_bakker · February 26, 2020, 12:34pm

Hi,

I'm trying to convert to my wide dataset into a long form dataset.
First I had the data set below:

                 Date Temp        X1        X2        X3        X4        X5
1 2018-01-01 00:00:00  8.0 0.6472976 0.5178381 0.5609912 0.7336039 0.8199103
2 2018-01-01 01:00:00  7.9 0.6806634 0.5445307 0.5899083 0.7714185 0.8621737
3 2018-01-01 02:00:00  7.8 0.7140293 0.5712234 0.6188254 0.8092332 0.9044371
4 2018-01-01 03:00:00  7.7 0.8208000 0.6566400 0.7113600 0.9302400 1.0396800
5 2018-01-01 04:00:00  7.3 1.1110829 0.8888664 0.9629386 1.2592273 1.4073717

With the following code I was able to achieve the right format:

data2 = gather(data, house, consumption, -Date, -Temp)

                 Date Temp house consumption
1 2018-01-01 00:00:00  8.0    X1   0.6472976
2 2018-01-01 01:00:00  7.9    X1   0.6806634
3 2018-01-01 02:00:00  7.8    X1   0.7140293
4 2018-01-01 03:00:00  7.7    X1   0.8208000
5 2018-01-01 04:00:00  7.3    X1   1.1110829

However, currently I not only have consumption as a variable but two other variables as well (Elabel and Hsize). These two variables are static for each house (every house has a certain Elabel and Hsize). My dataset now looks like this:


                 Date Temp        X1 Elabel1 Hsize1        X2 Elabel2 Hsize2        X3 Elabel3 Hsize3        X4 Elabel4 Hsize4        X5 Elabel5 Hsize5        X6
1 2018-01-01 00:00:00  8.0 0.8093377       A    100 0.9413433       C     90 0.8801952       E     75 0.8984059       A     60 0.7658825       B    120 0.7069353
2 2018-01-01 01:00:00  7.9 0.9300447       A    100 1.1047173       C     90 0.8668304       E     75 0.7173619       A     60 0.8714931       B    120 0.7565921
3 2018-01-01 02:00:00  7.8 0.8468523       A    100 0.9813693       C     90 0.9104481       E     75 0.9178342       A     60 0.8925958       B    120 0.6474878
4 2018-01-01 03:00:00  7.7 1.0558602       A    100 1.3127769       C     90 1.1186276       E     75 1.0786968       A     60 0.9416125       B    120 0.8699044
5 2018-01-01 04:00:00  7.3 1.2765728       A    100 1.6924849       C     90 1.4120881       E     75 1.2462317       A     60 1.1781826       B    120 1.1912390

I'm trying to convert this to long format. Below is an example of how I would like it to look, but I can't seem to achieve this. Would any of you be able to help me?

                 Date Temp house  Elabel Hsize consumption
1 2018-01-01 00:00:00  8.0    X1    A    100   0.6472976
2 2018-01-01 01:00:00  7.9    X1    A    100   0.6806634
3 2018-01-01 02:00:00  7.8    X1    A    100   0.7140293
4 2018-01-01 03:00:00  7.7    X1    A    100   0.8208000
5 2018-01-01 04:00:00  7.3    X1    A    100   1.1110829

kuriwaki · February 26, 2020, 12:48pm

Can you post a reprex for the dataset that actually includes Elabel and Hsize?

You can probably use tidyr::pivot_longer (in tidyr v1.0.0 or above) and do something like

data_sel <- select(data, Date, Temp, Elabel, Hsize, matches("X[1-9]"))
pivot_longer(data_sel, 
             -c(Date, Temp, Elabel, Hsize),
             names_to = "house", 
             values_to = "consumption")

dromano · February 26, 2020, 12:50pm

Welcome @Leon_bakker!

Could you also post the code you used to produce your table? That way folks can follow along with what you're trying to do. And are you familiar with the reprex() function?

Leon_bakker · February 26, 2020, 1:04pm

Thanks!
I just looked up what that is, but what would I need to reprex to get what you need?

I've simply imported the data from an excel file (See links below. I can't upload xlsx files).

The file without the two variables:
https://lvbakker.stackstorage.com/s/06Eedavmm1BazdX

The file with the variables:
https://lvbakker.stackstorage.com/s/lDavhlO0H1fNKAQ

data <- read_excel("neighbourhoodtest.xlsx",sheet="Sheet1")
data <- data.frame(data)

dromano · February 26, 2020, 1:27pm

First you'll need to install the reprex package, then, to see it in action, copy (but don't paste) the following code:

# include any packages you need
library(tidyverse)

my_tibble <- 
  tibble(a = 1:3, b = letters[1:3])

my_tibble

dput(my_tibble)

Immediately afterwards, run the command reprex(), and note the output rendered in the RStudio 'Viewer' tab: all the commands are shown along with their output.

Now, copy and also paste the rendered output to an R file in your 'Source' pane, and add the following command (which you can obtain by modifying the commented output from the dput() command):

my_dup <- structure(list(a = 1:3, b = c("a", "b", "c")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

my_dup

Finally, copy -- but don't paste -- all of the code you placed in the R file, and run reprex() again, then come back here, reply, and paste to see what happens.

dromano · February 26, 2020, 2:14pm

Oh, and P.S., @Leon_bakker: If you follow reprex() example, you can use dput() with your dataset, which is a preferable (and secure) way to share your data with others. Then we can help with your original question more easily.

Leon_bakker · February 26, 2020, 2:30pm

When I use reprex like you suggested it just provides me with the whole table as output, which comes down to 80000 lines :D. However, using pivot_longer, as @kuriwaki suggested, I have been able to get the desired result.
I've added a separator to all value columns and used the code below.

Here, I gather all value columns --> separate them into seperate measures and a houseid and then use pivot_wider to get the desired columns. I then used arrange() to sort on houseid and Date.

data2 <- data %>%
  pivot_longer(
    Consumption_1:Hsize_10,
    names_to = c("measure", "houseid"),
    names_sep = "_"
  )%>%
  pivot_wider(
    names_from = measure,
    values_from = value
  )%>% 
  arrange(
    houseid, Date
  )

dromano · February 26, 2020, 4:31pm

That can happen! That's why it's a good idea to pare the table down before using dput(), so that data looks a lot like what you showed in your original post, which you could do with:

data_small <- data %>% slice(1:5)

Then you can run dput(data_small) and post the result so people can reproduce your sample data and work with it themselves.

system · March 4, 2020, 4:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.