Summary Statistics for Data with Multiple Factors

Hi, I a relatively new R User. This maybe more of a data management issue than a stats issue. I have a dataset in the "vertical" format. Columns are: Name (e.g., locations, about 10 of these), Date, Parameter, Code, and Result. (I think I properly loaded a reprex of my code below)

I am trying to use summaryStats in the EnvStats package to get summary stats (mean, median, n, SD, Max, Min, etc.) for these data by Code and by Name.

I did it what i think is the hard way, filtering for Code, and doing this separately for all my Codes:

seperate_data<- LakeReduced %>%
  filter(Code =="sc")
#> Error in LakeReduced %>% filter(Code == "sc"): could not find function "%>%"
 t1 <- summaryStats((Corrected) ~ Name, data=seperate_data, digits = 0)


Is there a more efficient method?

Do I have to convert the dataset into a "Wide" format?  

Thanks so much!

Craig

Subset  of Data:
Name	Date	Parameter	Code	Result
Channel	6-Mar-20	Secchi Disk	sd	0.4
Channel	6-Mar-20	Total Depth	td	0.9
Channel	6-Mar-20	Temperature	temp	20.2
Channel	6-Mar-20	Dissolved Oxygen (%)	do%	36.1
Channel	6-Mar-20	Dissolved Oxygen	do	3.3
Channel	6-Mar-20	Specific Conductance	sc	71
Channel	6-Mar-20	Secchi Disk	sd	0.3
Channel	6-Mar-20	Total Depth	td	4.1
Channel	6-Mar-20	pH	ph	5.1
Channel	6-Mar-20	ORP	orp	339
Channel	6-Mar-20	Temperature	temp	21.3
Channel	6-Mar-20	Dissolved Oxygen (%)	do%	92.3
Channel	6-Mar-20	Dissolved Oxygen	do	8.2
Channel	6-Mar-20	Specific Conductance	sc	66
Canal	6-Mar-20	Secchi Disk	sd	0.3
Canal	6-Mar-20	Total Depth	td	2.8
Canal	6-Mar-20	Temperature	temp	19.8
Canal	6-Mar-20	Specific Conductance	sc	72
Canal	6-Mar-20	Dissolved Oxygen (%)	do%	47.7
Canal	6-Mar-20	Dissolved Oxygen	do	4.4
Canal	6-Mar-20	pH	ph	4.5
Canal	6-Mar-20	ORP	orp	302
Hia	6-Mar-20	Secchi Disk	sd	0.3
Hia	6-Mar-20	Total Depth	td	3.2
Hia	6-Mar-20	Temperature	temp	20.7
Hia	6-Mar-20	Specific Conductance	sc	72
Hia	6-Mar-20	Dissolved Oxygen (%)	do%	87.6
Hia	6-Mar-20	Dissolved Oxygen	do	7.9
Hia	6-Mar-20	pH	ph	5.5
Hia	6-Mar-20	ORP	orp	318

I can't write and test code to solve your problem without seeing your data or a subset of it, but you can do this in tidyverse with a combination of group_by and summarize. So you would group your data by location and/or code, then use summarize to generate the summary stats for each group. Something in the form of...

data %>%
  group_by(Name, Code) %>%
  summarize(stddev = sd(Result),
            mean = mean(Result))

Thanks @ulfelder!

I thought I attaced a subset of data to the message--sorry, first time RStudio Community user! I will try attaching it to this message.

I will follow your goup_by approach, but can I use this group_by and then call in summaryStats from EnvStats? (Then I dont have to type in "mean", "median", etc. and it calculates some other environmental-related stats.)

Subset of Data:

Name Date Parameter Code Result
Channel 6-Mar-20 Secchi Disk sd 0.4
Channel 6-Mar-20 Total Depth td 0.9
Channel 6-Mar-20 Temperature temp 20.2
Channel 6-Mar-20 Dissolved Oxygen (%) do% 36.1
Channel 6-Mar-20 Dissolved Oxygen do 3.3
Channel 6-Mar-20 Specific Conductance sc 71
Channel 6-Mar-20 Secchi Disk sd 0.3
Channel 6-Mar-20 Total Depth td 4.1
Channel 6-Mar-20 pH ph 5.1
Channel 6-Mar-20 ORP orp 339
Channel 6-Mar-20 Temperature temp 21.3
Channel 6-Mar-20 Dissolved Oxygen (%) do% 92.3
Channel 6-Mar-20 Dissolved Oxygen do 8.2
Channel 6-Mar-20 Specific Conductance sc 66
Canal 6-Mar-20 Secchi Disk sd 0.3
Canal 6-Mar-20 Total Depth td 2.8
Canal 6-Mar-20 Temperature temp 19.8
Canal 6-Mar-20 Specific Conductance sc 72
Canal 6-Mar-20 Dissolved Oxygen (%) do% 47.7
Canal 6-Mar-20 Dissolved Oxygen do 4.4
Canal 6-Mar-20 pH ph 4.5
Canal 6-Mar-20 ORP orp 302
Hia 6-Mar-20 Secchi Disk sd 0.3
Hia 6-Mar-20 Total Depth td 3.2
Hia 6-Mar-20 Temperature temp 20.7
Hia 6-Mar-20 Specific Conductance sc 72
Hia 6-Mar-20 Dissolved Oxygen (%) do% 87.6
Hia 6-Mar-20 Dissolved Oxygen do 7.9
Hia 6-Mar-20 pH ph 5.5
Hia 6-Mar-20 ORP orp 318
Shallow 6-Mar-20 Secchi Disk sd 0.41
Shallow 6-Mar-20 Total Depth td 2.8
Shallow 6-Mar-20 Temperature temp 19.7
Shallow 6-Mar-20 Specific Conductance sc 74
Shallow 6-Mar-20 Dissolved Oxygen (%) do% 98.3
Shallow 6-Mar-20 Dissolved Oxygen do 9
Shallow 6-Mar-20 pH ph 5.8
Shallow 6-Mar-20 ORP orp 316
Shallow 6-Mar-20 Secchi Disk sd 0.3
Shallow 6-Mar-20 Total Depth td 4.15
Shallow 6-Mar-20 Temperature temp 20.8
Shallow 6-Mar-20 Specific Conductance sc 74
Shallow 6-Mar-20 Dissolved Oxygen (%) do% 102.7
Shallow 6-Mar-20 Dissolved Oxygen do 9.2
Shallow 6-Mar-20 pH ph 5.9
Shallow 6-Mar-20 ORP orp 331
Deep 6-Mar-20 Secchi Disk sd 0.3
Deep 6-Mar-20 Total Depth td 4.15
Deep 6-Mar-20 Temperature temp 20.6
Deep 6-Mar-20 Specific Conductance sc 74
Deep 6-Mar-20 Dissolved Oxygen (%) do% 101
Deep 6-Mar-20 Dissolved Oxygen do 9.12
Deep 6-Mar-20 pH ph 6
Deep 6-Mar-20 ORP orp 350
Deep 6-Mar-20 Secchi Disk sd 0.41
Deep 6-Mar-20 Total Depth td 2.8
Deep 6-Mar-20 Temperature temp 19.7
Deep 6-Mar-20 Specific Conductance sc 74
Deep 6-Mar-20 Dissolved Oxygen (%) do% 98.1
Deep 6-Mar-20 Dissolved Oxygen do 8.97
Deep 6-Mar-20 pH ph 5.83
Deep 6-Mar-20 ORP orp 342
P-Channel 6-Mar-20 Ammonia as N NH4 0.02
P-Channel 6-Mar-20 Chlorophyll a chla 1.6
P-Channel 6-Mar-20 Color color 350
P-Channel 6-Mar-20 Nitrate/Nitrite as N nox 0.079
P-Channel 6-Mar-20 Orthophosphate as P srp 0.0053
P-Channel 6-Mar-20 pH for Color 4.6
P-Channel 6-Mar-20 Phosphorus tp 0.01
P-Channel 6-Mar-20 Total Alkalinity as CaCO3 ta 1.9
P-Channel 6-Mar-20 Total Kjeldahl Nitrogen tkn 0.88
P-Channel 6-Mar-20 Total Nitrogen tn 0.95
P-Channel 6-Mar-20 Total Suspended Solids tss 2.5
P-Channel 6-Mar-20 Turbidity turb 1

Thanks again for your quick reply.

On another note, I could not find in the help information how to load data?

Craig

Ah, okay, now I get what you're saying about summaryStats, and that would call for a slightly different approach, I think. Can you show your desired output, though? It's still not clear to me exactly how you want to group the data and which quantities you're looking to summarize.

@ulfelder--

The code from your output is perfect (see below--this will help me with this report I need to finish today!), but moving forward, is there a way to do this using that EnvStats package?

Code Name mean median stddev max min

1 chla Crystal Cove 3.8 3.8 1.41 4.8 2.8
2 chla Emerald 3.27 2.2 2.39 6 1.6
3 chla Hia-Canal 5.80 1.7 8.82 19 0.81
4 chla HIA-Eff 3.4 2.8 2.65 6.3 1.1
5 chla Hiawatha 3.50 1.95 3.78 9.1 0.99
6 chla Min-Deep 5.63 2.2 10.3 31 0.55
7 chla Min-Shallow 7.69 3.95 10.4 32 0.54

@Craigdux, Looking closer at EnvStats and summaryStats(), it seems the goal of that function is to bundle together exactly the things I was doing with group_by() and summarize(). You just need to specify the grouping in your call to summaryStats(), and you can do the rest with base R. So, for example..

library(EnvStats)

Name <- rep(LETTERS[1:10], each = 3)
Code <- rep(c("temp", "ph", "do"), times = 10)
Result <- rnorm(n = 30, mean = 10, sd = 2)

data <- data.frame(Name, Code, Result)

with(data, summaryStats(Result, group = Code))

      N    Mean     SD  Median    Min     Max
do   10  8.6319 1.9881  8.1329 5.0849 12.5399
ph   10 10.4098 1.5158 10.6173 8.4258 12.4312
temp 10  9.8891 1.3578 10.0943 7.1174 11.6493

It's still not clear to me what, if anything, you want to do with the dates, or if you want to crosstab by location and code at the same time. But, hopefully, this gets you far enough to resolve additional complications on your own.

@ulfelder Thanks so much!

At this point, dates are irrelevant, so I think I can work it from here.

Thanks again!

@ulfelder Sorry again. One last question:

your code:
with(data, summaryStats(Result, group = Code))

Worked great.

However, like you stated, I do want to crosstab by location, and I cant find out how to write the code: group = Code and Name

group = Code, Name does not work.

Thanks yet again

@Craigdux, I don't think there's a neat way to group by multiple factors with summaryStats(), which may be an argument for writing your own code to arrive at the same destination. You can certainly hack your way there, though. Here's a base R approach that uses lapply() to iterate the summarizing function over subsets of the original data set defined by one factor.

lapply(unique(data$Code), function(x) {

  df <- data[data$Code == x,]

  with(df, summaryStats(Result ~ Name))

})

The result is a list of tables returned by summaryStats(), one for each of the elements in your call to unique() (here, Code).

You can also do this in the tidyverse with map() in place of lapply() and some piping inside the function. I don't see any obvious gains here, though, since you probably don't want to bind the resulting tables together (e.g., with map_dfr() instead of map()) without additional columns for labeling. Those will already be there with the original group_by/summarize approach.

library(tidyverse)

map(unique(data$Code), function(x) {

  data %>%
    filter(Code == x) %>%
    with(., summaryStats(Result ~ Name))

})

@ulfelder Thanks again-- I will try this tonight!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.