Day-wise normalization in R

Hi,

I have a file in the following format:

26 Jul 20 00:01:00 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:02:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:03:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:04:00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:05:00 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:06:00 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:07:00 0 0 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0

26 Jul 20 00:08:00 0 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0

The first column is Date, second column is the timepoint, third to thirty-fourth columns are activity at that time point of 32 different individuals (in 32 different columns). So essentially the full data represents time-course recording of 32 different individuals for 8 days. What I want to do is normalize their activity counts for each day, so the activity of each time point is a percentage value telling what percentage of activity of the whole day occurred at that particular time point. Each timepoint activity will have to be divided by sum of all activity on all time points of one particular day for that individual and multiplied by 100.

It may be a novice question, but I will greatly appreciate any help!

Hi, this should work

suppressMessages(library(tidyverse))

df1 <- read_tsv("sample.txt", col_names = c("date", "time", paste0("p", 1:32)))
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   .default = col_double(),
#>   date = col_character(),
#>   time = col_time(format = "")
#> )
#> i Use `spec()` for the full column specifications.


df_tidy <- df1 %>%
  #Convert df in tidy format 
  pivot_longer(cols = p1:p32,
               names_to = "individ",
               values_to = "activity") %>%
  # Group by date and person
  group_by(date, individ) %>% 
  mutate(total_day_act = sum(activity), #find total activity in each date for each person
         activity_pct = activity/total_day_act*100) #find % of activities in each timepoint 

head(df_tidy)
#> # A tibble: 6 x 6
#> # Groups:   date, individ [6]
#>   date      time   individ activity total_day_act activity_pct
#>   <chr>     <time> <chr>      <dbl>         <dbl>        <dbl>
#> 1 26 Jul 20 01'00" p1             0          1127            0
#> 2 26 Jul 20 01'00" p2             0           974            0
#> 3 26 Jul 20 01'00" p3             0           421            0
#> 4 26 Jul 20 01'00" p4             0           775            0
#> 5 26 Jul 20 01'00" p5             0          1283            0
#> 6 26 Jul 20 01'00" p6             0          1248            0

#Convert back to "dirty" format if needed
df_dirty <- df_tidy %>%
  ungroup() %>%
  select(!c(activity, total_day_act)) %>%
  pivot_wider(names_from = individ,
              values_from = activity_pct)

#Last date has only one timepoint, hence it either 100% or NA because of division by 0
tail(df_dirty)
#> # A tibble: 6 x 34
#>   date  time     p1      p2    p3    p4      p5    p6    p7    p8    p9   p10
#>   <chr> <tim> <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Au~ 23:55     0   0         0     0 0.          0     0     0     0     0
#> 2 2 Au~ 23:56     0   0.139     0     0 0.          0     0     0     0     0
#> 3 2 Au~ 23:57     0   0.555     0     0 0.          0     0     0     0     0
#> 4 2 Au~ 23:58     0   0         0     0 9.43e-2     0     0     0     0     0
#> 5 2 Au~ 23:59     0   0         0     0 3.77e-1     0     0     0     0     0
#> 6 3 Au~ 00:00   NaN NaN       NaN   NaN 1.00e+2   NaN   NaN   NaN   NaN   NaN
#> # ... with 22 more variables: p11 <dbl>, p12 <dbl>, p13 <dbl>, p14 <dbl>,
#> #   p15 <dbl>, p16 <dbl>, p17 <dbl>, p18 <dbl>, p19 <dbl>, p20 <dbl>,
#> #   p21 <dbl>, p22 <dbl>, p23 <dbl>, p24 <dbl>, p25 <dbl>, p26 <dbl>,
#> #   p27 <dbl>, p28 <dbl>, p29 <dbl>, p30 <dbl>, p31 <dbl>, p32 <dbl>
1 Like

@Dobrokhotov1989 , thank you, this works. I have another question though, suppose my recordings start from 24/05/2021 10:00:00 AM and end on 29/05/2021 09:59:00, which will be 5 complete days, is there a way so that I can consider 24/05/2021 10:00:00 to 25/05/2021 09:59:00 - 1 single day and so on? And then do the calculations on this "defined day", rather than going by absolute day value?

Let dat stand in for the integer portions of the data frame.

dat <- data.frame(
  V1 =
    c(8, 15, 12, 9, 13, 1, 19, 3, 7, 4, 5, 16, 6, 14, 18),
  V2 =
    c(12, 19, 14, 11, 17, 1, 5, 16, 10, 7, 9, 13, 6, 18, 4))

dat/colSums(dat)
#>            V1          V2
#> 1  0.05333333 0.074074074
#> 2  0.09259259 0.126666667
#> 3  0.08000000 0.086419753
#> 4  0.05555556 0.073333333
#> 5  0.08666667 0.104938272
#> 6  0.00617284 0.006666667
#> 7  0.12666667 0.030864198
#> 8  0.01851852 0.106666667
#> 9  0.04666667 0.061728395
#> 10 0.02469136 0.046666667
#> 11 0.03333333 0.055555556
#> 12 0.09876543 0.086666667
#> 13 0.04000000 0.037037037
#> 14 0.08641975 0.120000000
#> 15 0.12000000 0.024691358

It is possible. I think the easiest way should be something like this:

  1. merging date and time columns in "date time" format
  2. subtract 10 h from this new column
  3. split it back into 2 columns date and time

Check this for the date and time manipulations: 16 Dates and times | R for Data Science

Can't figure out if this is a solution! Help me understand?

Thank you so much, will do this.

Create a convenience timestamp variable

suppressPackageStartupMessages({
  library(lubridate)
})

datestring <- "24/05/2021 10:01:00"
datetime <- dmy_hms(datestring)
datetime - hours(10)
#> [1] "2021-05-24 00:01:00 UTC"

to use in group_by with lubridate::yday (assuming no year-end spanning data)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.