Hi, I a relatively new R User. This maybe more of a data management issue than a stats issue. I have a dataset in the "vertical" format. Columns are: Name (e.g., locations, about 10 of these), Date, Parameter, Code, and Result. (I think I properly loaded a reprex of my code below)
I am trying to use summaryStats in the EnvStats package to get summary stats (mean, median, n, SD, Max, Min, etc.) for these data by Code and by Name.
I did it what i think is the hard way, filtering for Code, and doing this separately for all my Codes:
seperate_data<- LakeReduced %>%
filter(Code =="sc")
#> Error in LakeReduced %>% filter(Code == "sc"): could not find function "%>%"
t1 <- summaryStats((Corrected) ~ Name, data=seperate_data, digits = 0)
Is there a more efficient method?
Do I have to convert the dataset into a "Wide" format?
Thanks so much!
Craig
Subset of Data:
Name Date Parameter Code Result
Channel 6-Mar-20 Secchi Disk sd 0.4
Channel 6-Mar-20 Total Depth td 0.9
Channel 6-Mar-20 Temperature temp 20.2
Channel 6-Mar-20 Dissolved Oxygen (%) do% 36.1
Channel 6-Mar-20 Dissolved Oxygen do 3.3
Channel 6-Mar-20 Specific Conductance sc 71
Channel 6-Mar-20 Secchi Disk sd 0.3
Channel 6-Mar-20 Total Depth td 4.1
Channel 6-Mar-20 pH ph 5.1
Channel 6-Mar-20 ORP orp 339
Channel 6-Mar-20 Temperature temp 21.3
Channel 6-Mar-20 Dissolved Oxygen (%) do% 92.3
Channel 6-Mar-20 Dissolved Oxygen do 8.2
Channel 6-Mar-20 Specific Conductance sc 66
Canal 6-Mar-20 Secchi Disk sd 0.3
Canal 6-Mar-20 Total Depth td 2.8
Canal 6-Mar-20 Temperature temp 19.8
Canal 6-Mar-20 Specific Conductance sc 72
Canal 6-Mar-20 Dissolved Oxygen (%) do% 47.7
Canal 6-Mar-20 Dissolved Oxygen do 4.4
Canal 6-Mar-20 pH ph 4.5
Canal 6-Mar-20 ORP orp 302
Hia 6-Mar-20 Secchi Disk sd 0.3
Hia 6-Mar-20 Total Depth td 3.2
Hia 6-Mar-20 Temperature temp 20.7
Hia 6-Mar-20 Specific Conductance sc 72
Hia 6-Mar-20 Dissolved Oxygen (%) do% 87.6
Hia 6-Mar-20 Dissolved Oxygen do 7.9
Hia 6-Mar-20 pH ph 5.5
Hia 6-Mar-20 ORP orp 318
I can't write and test code to solve your problem without seeing your data or a subset of it, but you can do this in tidyverse with a combination of group_by and summarize. So you would group your data by location and/or code, then use summarize to generate the summary stats for each group. Something in the form of...
data %>%
group_by(Name, Code) %>%
summarize(stddev = sd(Result),
mean = mean(Result))
I thought I attaced a subset of data to the message--sorry, first time RStudio Community user! I will try attaching it to this message.
I will follow your goup_by approach, but can I use this group_by and then call in summaryStats from EnvStats? (Then I dont have to type in "mean", "median", etc. and it calculates some other environmental-related stats.)
Subset of Data:
Name
Date
Parameter
Code
Result
Channel
6-Mar-20
Secchi Disk
sd
0.4
Channel
6-Mar-20
Total Depth
td
0.9
Channel
6-Mar-20
Temperature
temp
20.2
Channel
6-Mar-20
Dissolved Oxygen (%)
do%
36.1
Channel
6-Mar-20
Dissolved Oxygen
do
3.3
Channel
6-Mar-20
Specific Conductance
sc
71
Channel
6-Mar-20
Secchi Disk
sd
0.3
Channel
6-Mar-20
Total Depth
td
4.1
Channel
6-Mar-20
pH
ph
5.1
Channel
6-Mar-20
ORP
orp
339
Channel
6-Mar-20
Temperature
temp
21.3
Channel
6-Mar-20
Dissolved Oxygen (%)
do%
92.3
Channel
6-Mar-20
Dissolved Oxygen
do
8.2
Channel
6-Mar-20
Specific Conductance
sc
66
Canal
6-Mar-20
Secchi Disk
sd
0.3
Canal
6-Mar-20
Total Depth
td
2.8
Canal
6-Mar-20
Temperature
temp
19.8
Canal
6-Mar-20
Specific Conductance
sc
72
Canal
6-Mar-20
Dissolved Oxygen (%)
do%
47.7
Canal
6-Mar-20
Dissolved Oxygen
do
4.4
Canal
6-Mar-20
pH
ph
4.5
Canal
6-Mar-20
ORP
orp
302
Hia
6-Mar-20
Secchi Disk
sd
0.3
Hia
6-Mar-20
Total Depth
td
3.2
Hia
6-Mar-20
Temperature
temp
20.7
Hia
6-Mar-20
Specific Conductance
sc
72
Hia
6-Mar-20
Dissolved Oxygen (%)
do%
87.6
Hia
6-Mar-20
Dissolved Oxygen
do
7.9
Hia
6-Mar-20
pH
ph
5.5
Hia
6-Mar-20
ORP
orp
318
Shallow
6-Mar-20
Secchi Disk
sd
0.41
Shallow
6-Mar-20
Total Depth
td
2.8
Shallow
6-Mar-20
Temperature
temp
19.7
Shallow
6-Mar-20
Specific Conductance
sc
74
Shallow
6-Mar-20
Dissolved Oxygen (%)
do%
98.3
Shallow
6-Mar-20
Dissolved Oxygen
do
9
Shallow
6-Mar-20
pH
ph
5.8
Shallow
6-Mar-20
ORP
orp
316
Shallow
6-Mar-20
Secchi Disk
sd
0.3
Shallow
6-Mar-20
Total Depth
td
4.15
Shallow
6-Mar-20
Temperature
temp
20.8
Shallow
6-Mar-20
Specific Conductance
sc
74
Shallow
6-Mar-20
Dissolved Oxygen (%)
do%
102.7
Shallow
6-Mar-20
Dissolved Oxygen
do
9.2
Shallow
6-Mar-20
pH
ph
5.9
Shallow
6-Mar-20
ORP
orp
331
Deep
6-Mar-20
Secchi Disk
sd
0.3
Deep
6-Mar-20
Total Depth
td
4.15
Deep
6-Mar-20
Temperature
temp
20.6
Deep
6-Mar-20
Specific Conductance
sc
74
Deep
6-Mar-20
Dissolved Oxygen (%)
do%
101
Deep
6-Mar-20
Dissolved Oxygen
do
9.12
Deep
6-Mar-20
pH
ph
6
Deep
6-Mar-20
ORP
orp
350
Deep
6-Mar-20
Secchi Disk
sd
0.41
Deep
6-Mar-20
Total Depth
td
2.8
Deep
6-Mar-20
Temperature
temp
19.7
Deep
6-Mar-20
Specific Conductance
sc
74
Deep
6-Mar-20
Dissolved Oxygen (%)
do%
98.1
Deep
6-Mar-20
Dissolved Oxygen
do
8.97
Deep
6-Mar-20
pH
ph
5.83
Deep
6-Mar-20
ORP
orp
342
P-Channel
6-Mar-20
Ammonia as N
NH4
0.02
P-Channel
6-Mar-20
Chlorophyll a
chla
1.6
P-Channel
6-Mar-20
Color
color
350
P-Channel
6-Mar-20
Nitrate/Nitrite as N
nox
0.079
P-Channel
6-Mar-20
Orthophosphate as P
srp
0.0053
P-Channel
6-Mar-20
pH for Color
4.6
P-Channel
6-Mar-20
Phosphorus
tp
0.01
P-Channel
6-Mar-20
Total Alkalinity as CaCO3
ta
1.9
P-Channel
6-Mar-20
Total Kjeldahl Nitrogen
tkn
0.88
P-Channel
6-Mar-20
Total Nitrogen
tn
0.95
P-Channel
6-Mar-20
Total Suspended Solids
tss
2.5
P-Channel
6-Mar-20
Turbidity
turb
1
Thanks again for your quick reply.
On another note, I could not find in the help information how to load data?
Ah, okay, now I get what you're saying about summaryStats, and that would call for a slightly different approach, I think. Can you show your desired output, though? It's still not clear to me exactly how you want to group the data and which quantities you're looking to summarize.
The code from your output is perfect (see below--this will help me with this report I need to finish today!), but moving forward, is there a way to do this using that EnvStats package?
@Craigdux, Looking closer at EnvStats and summaryStats(), it seems the goal of that function is to bundle together exactly the things I was doing with group_by() and summarize(). You just need to specify the grouping in your call to summaryStats(), and you can do the rest with base R. So, for example..
library(EnvStats)
Name <- rep(LETTERS[1:10], each = 3)
Code <- rep(c("temp", "ph", "do"), times = 10)
Result <- rnorm(n = 30, mean = 10, sd = 2)
data <- data.frame(Name, Code, Result)
with(data, summaryStats(Result, group = Code))
N Mean SD Median Min Max
do 10 8.6319 1.9881 8.1329 5.0849 12.5399
ph 10 10.4098 1.5158 10.6173 8.4258 12.4312
temp 10 9.8891 1.3578 10.0943 7.1174 11.6493
It's still not clear to me what, if anything, you want to do with the dates, or if you want to crosstab by location and code at the same time. But, hopefully, this gets you far enough to resolve additional complications on your own.
@Craigdux, I don't think there's a neat way to group by multiple factors with summaryStats(), which may be an argument for writing your own code to arrive at the same destination. You can certainly hack your way there, though. Here's a base R approach that uses lapply() to iterate the summarizing function over subsets of the original data set defined by one factor.
The result is a list of tables returned by summaryStats(), one for each of the elements in your call to unique() (here, Code).
You can also do this in the tidyverse with map() in place of lapply() and some piping inside the function. I don't see any obvious gains here, though, since you probably don't want to bind the resulting tables together (e.g., with map_dfr() instead of map()) without additional columns for labeling. Those will already be there with the original group_by/summarize approach.