Hi,
I have a file in the following format:
26 Jul 20 00:01:00 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:02:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:03:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:04:00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:05:00 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:06:00 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:07:00 0 0 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0
26 Jul 20 00:08:00 0 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
The first column is Date, second column is the timepoint, third to thirty-fourth columns are activity at that time point of 32 different individuals (in 32 different columns). So essentially the full data represents time-course recording of 32 different individuals for 8 days. What I want to do is normalize their activity counts for each day, so the activity of each time point is a percentage value telling what percentage of activity of the whole day occurred at that particular time point. Each timepoint activity will have to be divided by sum of all activity on all time points of one particular day for that individual and multiplied by 100.
It may be a novice question, but I will greatly appreciate any help!
Hi, this should work
suppressMessages(library(tidyverse))
df1 <- read_tsv("sample.txt", col_names = c("date", "time", paste0("p", 1:32)))
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> date = col_character(),
#> time = col_time(format = "")
#> )
#> i Use `spec()` for the full column specifications.
df_tidy <- df1 %>%
#Convert df in tidy format
pivot_longer(cols = p1:p32,
names_to = "individ",
values_to = "activity") %>%
# Group by date and person
group_by(date, individ) %>%
mutate(total_day_act = sum(activity), #find total activity in each date for each person
activity_pct = activity/total_day_act*100) #find % of activities in each timepoint
head(df_tidy)
#> # A tibble: 6 x 6
#> # Groups: date, individ [6]
#> date time individ activity total_day_act activity_pct
#> <chr> <time> <chr> <dbl> <dbl> <dbl>
#> 1 26 Jul 20 01'00" p1 0 1127 0
#> 2 26 Jul 20 01'00" p2 0 974 0
#> 3 26 Jul 20 01'00" p3 0 421 0
#> 4 26 Jul 20 01'00" p4 0 775 0
#> 5 26 Jul 20 01'00" p5 0 1283 0
#> 6 26 Jul 20 01'00" p6 0 1248 0
#Convert back to "dirty" format if needed
df_dirty <- df_tidy %>%
ungroup() %>%
select(!c(activity, total_day_act)) %>%
pivot_wider(names_from = individ,
values_from = activity_pct)
#Last date has only one timepoint, hence it either 100% or NA because of division by 0
tail(df_dirty)
#> # A tibble: 6 x 34
#> date time p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
#> <chr> <tim> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Au~ 23:55 0 0 0 0 0. 0 0 0 0 0
#> 2 2 Au~ 23:56 0 0.139 0 0 0. 0 0 0 0 0
#> 3 2 Au~ 23:57 0 0.555 0 0 0. 0 0 0 0 0
#> 4 2 Au~ 23:58 0 0 0 0 9.43e-2 0 0 0 0 0
#> 5 2 Au~ 23:59 0 0 0 0 3.77e-1 0 0 0 0 0
#> 6 3 Au~ 00:00 NaN NaN NaN NaN 1.00e+2 NaN NaN NaN NaN NaN
#> # ... with 22 more variables: p11 <dbl>, p12 <dbl>, p13 <dbl>, p14 <dbl>,
#> # p15 <dbl>, p16 <dbl>, p17 <dbl>, p18 <dbl>, p19 <dbl>, p20 <dbl>,
#> # p21 <dbl>, p22 <dbl>, p23 <dbl>, p24 <dbl>, p25 <dbl>, p26 <dbl>,
#> # p27 <dbl>, p28 <dbl>, p29 <dbl>, p30 <dbl>, p31 <dbl>, p32 <dbl>
1 Like
@Dobrokhotov1989 , thank you, this works. I have another question though, suppose my recordings start from 24/05/2021 10:00:00 AM and end on 29/05/2021 09:59:00, which will be 5 complete days, is there a way so that I can consider 24/05/2021 10:00:00 to 25/05/2021 09:59:00 - 1 single day and so on? And then do the calculations on this "defined day", rather than going by absolute day value?
Let dat
stand in for the integer portions of the data frame.
dat <- data.frame(
V1 =
c(8, 15, 12, 9, 13, 1, 19, 3, 7, 4, 5, 16, 6, 14, 18),
V2 =
c(12, 19, 14, 11, 17, 1, 5, 16, 10, 7, 9, 13, 6, 18, 4))
dat/colSums(dat)
#> V1 V2
#> 1 0.05333333 0.074074074
#> 2 0.09259259 0.126666667
#> 3 0.08000000 0.086419753
#> 4 0.05555556 0.073333333
#> 5 0.08666667 0.104938272
#> 6 0.00617284 0.006666667
#> 7 0.12666667 0.030864198
#> 8 0.01851852 0.106666667
#> 9 0.04666667 0.061728395
#> 10 0.02469136 0.046666667
#> 11 0.03333333 0.055555556
#> 12 0.09876543 0.086666667
#> 13 0.04000000 0.037037037
#> 14 0.08641975 0.120000000
#> 15 0.12000000 0.024691358
It is possible. I think the easiest way should be something like this:
merging date and time columns in "date time" format
subtract 10 h from this new column
split it back into 2 columns date and time
Check this for the date and time manipulations: 16 Dates and times | R for Data Science
orijitghosh:
thank you, this works. I have another question though, suppose my recordings start from 24/05/2021 10:00:00 AM and end on 29/05/2021 09:59:00, which will be 5 complete days, is there a way so that I can consider 24/05/2021 10:00:00 to 25/05/2021 09:59:00 - 1 single day and so on? And then do the calculations on this "defined day", rather than going by absolute day value?
Can't figure out if this is a solution! Help me understand?
Thank you so much, will do this.
Create a convenience timestamp variable
suppressPackageStartupMessages({
library(lubridate)
})
datestring <- "24/05/2021 10:01:00"
datetime <- dmy_hms(datestring)
datetime - hours(10)
#> [1] "2021-05-24 00:01:00 UTC"
to use in group_by
with lubridate::yday
(assuming no year-end spanning data)
system
Closed
June 1, 2021, 6:12am
9
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.