Why is pivot_longer duplicating my rows?

_shy · December 3, 2022, 8:27pm

Why is pivot_longer is creating duplicate rows in my data frame?

I am using pivot_longer to clean up my data frame and it created three duplicate rows for each plant species.

Here is a sample of the data frame:

latin_name           plant_type treatment       `ounces per plot` weight            `grams per plot`
   <chr>                <chr>      <chr>                       <dbl> <chr>                        <dbl>
 1 Agastache foeniculum wildflower tx._one_rate               0.021  wt_per_plot_one              0.609
 2 Agastache foeniculum wildflower tx._one_rate               0.021  wt_per_plot_two              1.22 
 3 Agastache foeniculum wildflower tx._one_rate               0.021  wt_per_plot_three            1.83 
 4 Agastache foeniculum wildflower tx._one_rate               0.021  wt_per_plot_four             2.44 
 5 Agastache foeniculum wildflower tx._two._rate              0.0430 wt_per_plot_one              0.609
 6 Agastache foeniculum wildflower tx._two._rate              0.0430 wt_per_plot_two              1.22 
 7 Agastache foeniculum wildflower tx._two._rate              0.0430 wt_per_plot_three            1.83 
 8 Agastache foeniculum wildflower tx._two._rate              0.0430 wt_per_plot_four             2.44 
 9 Agastache foeniculum wildflower tx._three._rate            0.0645 wt_per_plot_one              0.609
10 Agastache foeniculum wildflower tx._three._rate            0.0645 wt_per_plot_two              1.22

And here is the code chunk I used:

library(tidyverse)
seed_rate1 <- seed_rate1%>%
  pivot_longer(
    cols = starts_with("tx"),
    names_to = "treatment",
    values_to = "ounces per plot")%>%
  pivot_longer(
    cols = starts_with("wt"),
    names_to = "weight",
    values_to = "grams per plot")

And here is the first 12 rows and all columns using dput():

seed_struct <-
structure(structure(
list(
latin_name = c(
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum"
),
plant_type = c(
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower"
),
treatment = c(
"tx._one_rate",
"tx._one_rate",
"tx._one_rate",
"tx._one_rate",
"tx._two._rate",
"tx._two._rate",
"tx._two._rate",
"tx._two._rate",
"tx._three._rate",
"tx._three._rate",
"tx._three._rate",
"tx._three._rate"
),
ounces per plot = c(
0.021,
0.021,
0.021,
0.021,
0.042975207,
0.042975207,
0.042975207,
0.042975207,
0.06446281,
0.06446281,
0.06446281,
0.06446281
),
weight = c(
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four",
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four",
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four"
),
grams per plot = c(
0.609,
1.218,
1.827,
2.437,
0.609,
1.218,
1.827,
2.437,
0.609,
1.218,
1.827,
2.437
)
),
row.names = c(NA,-12L),
class = c("tbl_df", "tbl", "data.frame")
))

Unable to figure out how to fix this i tried to use distinct() with no luck either...I am new to R so I suspect there is something in the pivot_longer that I am not doing correctly. If anyone has any input I would greatly appreciate. I am new to R so very specific/ easy to understand responses would be much appreciated!

EconomiCurtis · December 3, 2022, 8:28pm

Hello.
Thanks for providing code , but you could take further steps to make it more convenient for other forum users to help you. Share some representative data that will enable your code to run and show the problematic behaviour. You might use tools such as the library datapasta, or the base function dput() to share a portion of data in code form, i.e. that can be copied from forum and pasted to R session. Reprex Guide

If I were to guess what's going on; I don't see the three extra rows per plant species. Note there are multiple distinct values of weight/grams per plot for each plant + treatment /ounces per plot combination.

EconProf · December 3, 2022, 9:11pm

I am confused. I believe that cols = starts_with() is to select columns based on their names, not by the content of the columns. There are no column names that start with "tx". Am I missing something?

_shy · December 3, 2022, 10:16pm

@EconomiCurtis I edited the post with the df using dput() towards the bottom. Does that help visualize? Sorry very new to R and still learning how all of this works!

_shy · December 3, 2022, 10:17pm

@EconProf I just used dput() for the first time and added towards the bottom of my post. Hopefully that makes it more clear!

EconProf · December 4, 2022, 1:00am

When I use your data and code, this error message appears:

#> Error in build_longer_spec():
#> ! cols must select at least one column.

The problem is that cols = starts_with("tx") will select all columns with names that start with "tx". It is not based on the values in those columns. None of your six columns has a name starting with "tx".

#> Backtrace:
#> ▆
#> 1. ├─seed_rate %>% ...
#> 2. ├─tidyr::pivot_longer(...)
#> 3. └─tidyr:::pivot_longer.data.frame(...)
#> 4. └─tidyr::build_longer_spec(...)
#> 5. └─cli::cli_abort("{.arg cols} must select at least one column.")
#> 6. └─rlang::abort(...)


<sup>Created on 2022-12-03 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

It is also not clear what your intended outcome is.

_shy · December 4, 2022, 1:59am

@EconProf My df started with many more columns: latin_name, plant_type, tx._one_rate, tx._two._rate, tx._three._rate, tx._four._rate, wt_per_plot_one, wt_per_plot_two, wt_per_plot_three, wt_per_plot_four

and then i used this code below to make the df longer instead of wider and instead of just giving me a column (for example the treatment column that was created) that showed the 4 different tx rates for each latin_name it did each tx rate 4 times for each latin_name instead of one tx level for each. there should be 4 rows for each latin_name in the treatment. Does that make sense? the df i provided with dput() was the df after using the below code chunk

seed_rate1 <- seed_rate1%>%
pivot_longer(
cols = starts_with("tx"),
names_to = "treatment",
values_to = "ounces per plot")%>%
pivot_longer(
cols = starts_with("wt"),
names_to = "weight",
values_to = "grams per plot")

EconProf · December 4, 2022, 2:07am

Thanks for the clarification. I should have figured that out!

_shy · December 4, 2022, 2:20am

@EconProf here is the first 12 rows of the full df before trying to use pivot_longer:

seed_struct <-
structure(
list(
latin_name = c(
"Agastache foeniculum",
"Asclepias syriaca",
"Asclepias tuberosa",
"Coreopsis lanceolata",
"Echinacea pallida",
"Echinacea purpurea",
"Eutrochium purpureum",
"Heliopsis helianthoides",
"Monarda fistulosa",
"Oligoneuron rigidum",
"Penstemon digitalis",
"Ratibida pinnata"
),
plant_type = c(
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower"
),
tx._one_rate = c(
0.021,
0.057,
0.057,
0.057,
0.057,
0.057,
0.021,
0.072,
0.014,
0.021,
0.007,
0.029
),
tx._two._rate = c(
0.042975207,
0.114600551,
0.114600551,
0.114600551,
0.114600551,
0.114600551,
0.042975207,
0.143250689,
0.028650138,
0.042975207,
0.014325069,
0.057300275
),
tx._three._rate = c(
0.06446281,
0.171900826,
0.171900826,
0.171900826,
0.171900826,
0.171900826,
0.06446281,
0.214876033,
0.042975207,
0.06446281,
0.021487603,
0.085950413
),
tx.four.rate = c(
0.085950413,
0.229201102,
0.229201102,
0.229201102,
0.229201102,
0.229201102,
0.085950413,
0.286501377,
0.057300275,
0.085950413,
0.028650138,
0.114600551
),
wt_per_plot_one = c(
0.609,
1.624,
1.624,
1.624,
1.624,
1.624,
0.609,
2.031,
0.406,
0.609,
0.203,
0.812
),
wt_per_plot_two = c(
1.218,
3.249,
3.249,
3.249,
3.249,
3.249,
1.218,
4.061,
0.812,
1.218,
0.406,
1.624
),
wt_per_plot_three = c(
1.827,
4.873,
4.873,
4.873,
4.873,
4.873,
1.827,
6.092,
1.218,
1.827,
0.609,
2.437
),
wt_per_plot_four = c(
2.437,
6.498,
6.498,
6.498,
6.498,
6.498,
2.437,
8.122,
1.624,
2.437,
0.812,
3.249
)
),
row.names = c(NA, 12L),
class = "data.frame"
)

EconProf · December 4, 2022, 3:52am

In your data, there is one row for Agastache foeniculum, with four "tx" values and four "wt" values. Can you explain what that means? Should your long data frame have a row for each of the 16 permutations, which is the output from your code, or does" tx_one_rate" go with "weight per plot one", and so forth, so just four rows?

_shy · December 4, 2022, 4:04am

@EconProf there are multiple plant species in the full df. Each of the plant species were planted at 4 different seed rates to compare the yield (weight) between the different seed rates. there should not be 16 rows for each species, there should only be 4 (one row for each seed rate level per species) So each species has data for their yeild weight in each of the 4 seed rates. when i do the pivot_longer() function it does do the lengthening it should, but it also duplicates everything and i cant figure out how to fix that. appreciate your help by the way!

The goal is to clean the df so I can use ggplot to compare the 4 seed rates with the yield (weight) of the species. Essentially to find the optimal seed rate.

EconProf · December 4, 2022, 4:27am

Not an elegant solution, but I think the outcome is what you want.

library(tidyverse)

seed_struct <-
  structure(
    list(
      latin_name = c(
        "Agastache foeniculum",
        "Asclepias syriaca",
        "Asclepias tuberosa",
        "Coreopsis lanceolata",
        "Echinacea pallida",
        "Echinacea purpurea",
        "Eutrochium purpureum",
        "Heliopsis helianthoides",
        "Monarda fistulosa",
        "Oligoneuron rigidum",
        "Penstemon digitalis",
        "Ratibida pinnata"
      ),
      plant_type = c(
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower",
        "wildflower"
      ),
      tx_one_rate = c(
        0.021,
        0.057,
        0.057,
        0.057,
        0.057,
        0.057,
        0.021,
        0.072,
        0.014,
        0.021,
        0.007,
        0.029
      ),
      tx_two_rate = c(
        0.042975207,
        0.114600551,
        0.114600551,
        0.114600551,
        0.114600551,
        0.114600551,
        0.042975207,
        0.143250689,
        0.028650138,
        0.042975207,
        0.014325069,
        0.057300275
      ),
      tx_three_rate = c(
        0.06446281,
        0.171900826,
        0.171900826,
        0.171900826,
        0.171900826,
        0.171900826,
        0.06446281,
        0.214876033,
        0.042975207,
        0.06446281,
        0.021487603,
        0.085950413
      ),
      tx_four_rate = c(
        0.085950413,
        0.229201102,
        0.229201102,
        0.229201102,
        0.229201102,
        0.229201102,
        0.085950413,
        0.286501377,
        0.057300275,
        0.085950413,
        0.028650138,
        0.114600551
      ),
      wt_per_plot_one = c(
        0.609,
        1.624,
        1.624,
        1.624,
        1.624,
        1.624,
        0.609,
        2.031,
        0.406,
        0.609,
        0.203,
        0.812
      ),
      wt_per_plot_two = c(
        1.218,
        3.249,
        3.249,
        3.249,
        3.249,
        3.249,
        1.218,
        4.061,
        0.812,
        1.218,
        0.406,
        1.624
      ),
      wt_per_plot_three = c(
        1.827,
        4.873,
        4.873,
        4.873,
        4.873,
        4.873,
        1.827,
        6.092,
        1.218,
        1.827,
        0.609,
        2.437
      ),
      wt_per_plot_four = c(
        2.437,
        6.498,
        6.498,
        6.498,
        6.498,
        6.498,
        2.437,
        8.122,
        1.624,
        2.437,
        0.812,
        3.249
      )
    ),
    row.names = c(NA, 12L),
    class = "data.frame"
  )


long_tx <- seed_struct %>% 
  select(!starts_with("wt")) %>%
  pivot_longer(
    cols = starts_with("tx"),
    names_to = "treatment",
    values_to = "ounces_per_plot") 

long_wt <- seed_struct %>% 
  select(!starts_with("tx")) %>%
  pivot_longer(
  cols = starts_with("wt"),
    names_to = "weight",
    values_to = "grams_per_plot")

bind_cols(long_tx, (long_wt %>% select(weight, grams_per_plot)))
#> # A tibble: 48 × 6
#>    latin_name           plant_type treatment     ounces_per_plot weight  grams…¹
#>    <chr>                <chr>      <chr>                   <dbl> <chr>     <dbl>
#>  1 Agastache foeniculum wildflower tx_one_rate            0.021  wt_per…   0.609
#>  2 Agastache foeniculum wildflower tx_two_rate            0.0430 wt_per…   1.22 
#>  3 Agastache foeniculum wildflower tx_three_rate          0.0645 wt_per…   1.83 
#>  4 Agastache foeniculum wildflower tx_four_rate           0.0860 wt_per…   2.44 
#>  5 Asclepias syriaca    wildflower tx_one_rate            0.057  wt_per…   1.62 
#>  6 Asclepias syriaca    wildflower tx_two_rate            0.115  wt_per…   3.25 
#>  7 Asclepias syriaca    wildflower tx_three_rate          0.172  wt_per…   4.87 
#>  8 Asclepias syriaca    wildflower tx_four_rate           0.229  wt_per…   6.50 
#>  9 Asclepias tuberosa   wildflower tx_one_rate            0.057  wt_per…   1.62 
#> 10 Asclepias tuberosa   wildflower tx_two_rate            0.115  wt_per…   3.25 
#> # … with 38 more rows, and abbreviated variable name ¹grams_per_plot

^{Created on 2022-12-03 with reprex v2.0.2}

_shy · December 4, 2022, 3:09pm

Thanks so much! My only question is what part do I put into my R to try it?

EconProf · December 4, 2022, 5:55pm

Not exactly sure what you mean, but it would go after your data, assuming it is the same format as the seed_struct example you posted

_shy · December 7, 2022, 12:22am

I did figure it out and this was very helpful, thanks again!

system · December 28, 2022, 12:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.