Iteratively insert row into data frame

aellis · February 14, 2021, 5:45pm

Hello all!

I am trying to iteratively add new observations to my dataframe df1. Instead of using rbind() to insert the observations at the end of the dataframe, I'd like to insert my observations into df1 using add_row(). I'm relatively new to R, so currently I iterate by running a for loop with the range as

row in 1:nrow(df1))

and then (given some conditions), creating a new row and using rbind to append it to df1. If I wanted to instead insert the row inside df1 (i.e. if it happens at row n, inserting at n + 1), could I still use the same range for my loop? If I have k original rows, and I insert an observation at, say, row 5, now I have k + 1 rows, and the former row 5 would instead be row 6, so I would think that I'd need to adjust my loop to a larger range. Is this true?

Also worth noting, I've been reading that it's better to vectorize operations in R because it works faster than a for loop. Is there a way to vectorize this operation? Or any recommendations on tutorials/websites where I can read up and start better understanding vectorizing?

Thanks so much.

nirgrahamuk · February 14, 2021, 5:50pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

aellis · February 14, 2021, 6:30pm

Sorry about that. Here is an example of what I'm trying to do, using the first 5 rows of the DAX column from the EuStockMarkets dataset.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3

df <- data.frame(DAX = c(1628.75, 1613.63, 1606.51, 1621.04, 1618.16))

for(row in 1:nrow(df)) {
  if({df[(row + 1), "DAX"] > df[row, "DAX"]}) {
    df <- df %>%
      add_row(DAX = (df[(row + 1), "DAX"] - df[row, "DAX"]),
              .before = df[row, ])
  }
}

^{Created on 2021-02-14 by the reprex package (v1.0.0)}

So, my goal in this reprex is to iterate over the dataframe, checking to see if DAX increased, and then inserting a row into the dataframe (between the two rows) representing the increase. If you run this output, you can pretty clearly see this doesn't work, so I don't think it would work for my dataframe. Are there any recommendations on how to make something like this work or to vectorize it?

nirgrahamuk · February 14, 2021, 6:50pm

What do you think of this approach ?

 mutate(df,
        change_in_dax = lead(DAX)-DAX) %>% 
   pivot_longer(cols = everything()) %>% 
   na.omit()

aellis · February 14, 2021, 7:58pm

Okay, I think I understand what you're recommending here. I think I may have oversimplified my reprex relative to my dataset. Here's a better example, I think. If I make the reprex something like this, using that same DAX column and adding a column to represent the (fake) day of the week when the observation was taken:

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3

df <- data.frame(DAX = c(1628.75, 1613.63, 1606.51, 1621.04, 1618.16, 1636.78, 1632.17), 
                 Day.Of.Week = c("1", "2", "4", "5", "1", "2", "3"))

^{Created on 2021-02-14 by the reprex package (v1.0.0)}

My goal would be to maintain the structure of that dataframe I created while inserting the row. So in this case, if I want to find that change_in_dax for observations taken on consecutive days (i.e. if lead(Day.Of.Week) - Day.Of.Week = 1), I would use a for loop to do this. My pseudocode for this would look something like:

for ("every row in the dataframe"){
   if ("the observations are separated by 1 day") {
      "insert row in between the two rows with some descriptor about how the observations changed"
  }
}

And then after running that code, I would ideally end up with a df that looks something like

df <- data.frame(DAX = c(1628.75, -15.12, 1613.63, 1606.51, 14.53, 1621.04, 1618.16, 1636.78, 1632.17), 
                 Day.Of.Week = c(1, 0, 2, 4, 0, 5, 1, 3, 5))

though the Day.Of.Week could definitely be something besides 0, that's just to preserve the class of the vector as numeric, the row insertion is what I'm more concerned about. Does that still make sense? Sorry, and thank you very much for your help.

lars · February 15, 2021, 8:55am

Hi aellis ,

It's a bit of challenge as one need to have a unique identifier for each observation (row) in order to insert the new values correctly.

One possible approach is like this:

library(tidyverse)

## create dataframe
df_in <- tibble(DAX = c(1628.75, 1613.63, 1606.51, 1621.04, 1618.16, 1636.78, 1632.17),
                day_of_week = c(1, 2, 4, 5, 1, 2, 3)
                )
df_in
#> # A tibble: 7 x 2
#>     DAX day_of_week
#>   <dbl>       <dbl>
#> 1 1629.           1
#> 2 1614.           2
#> 3 1607.           4
#> 4 1621.           5
#> 5 1618.           1
#> 6 1637.           2
#> 7 1632.           3

## create temporary dataframe `df_a` which:
## 1) calculate DAX difference when previous DAX is from yesterday
## 2) add temporary column `id`, which is needed to sort the endresult correctly
df_a <- df_in %>% 
  mutate(diff = case_when(day_of_week - lag(day_of_week) == 1 ~ DAX - lag(DAX)),
         id = row_number(),
         ) %>% 
  select(id, everything()) %>%
  identity()

## create new temporary dataframe, based on `df_a`
## holding rows which need to be added
df_b <- df_a %>% 
  filter(!is.na(diff)) %>% 
  mutate(DAX = diff,
         day_of_week = 0,
         ) %>% 
  select(id, DAX, day_of_week)

## merge both dataframes, explicitely use `copy = TRUE` 
## so that rows are added and not merged!
df_out <- df_b %>% 
  full_join(df_a %>% 
              select(id, DAX, day_of_week),
            copy = TRUE
            ) %>% 
  arrange(id) %>% 
  ## remove the temporary `id` column
  select(-id)
#> Joining, by = c("id", "DAX", "day_of_week")

## show endresult
df_out
#> # A tibble: 11 x 2
#>        DAX day_of_week
#>      <dbl>       <dbl>
#>  1 1629.             1
#>  2  -15.1            0
#>  3 1614.             2
#>  4 1607.             4
#>  5   14.5            0
#>  6 1621.             5
#>  7 1618.             1
#>  8   18.6            0
#>  9 1637.             2
#> 10   -4.61           0
#> 11 1632.             3

^{Created on 2021-02-15 by the reprex package (v1.0.0)}

Note: the values of DAX appears to be incorrect, but use df_out %>% View() and you'll see all is OK.

system · February 28, 2021, 5:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.