Hi!
How can I include an extrapolation factor to my data? This factor already exists and it says, for how many households one single household stands. After doing that extrapolation, I do some statistical analysis like anova and welch-test including pre- and post-hoc-tests, and for depicting the results a ggplot.
My real data set includes more than 17000 households, which represent about 3 millions of households.
Here is a samle data:
library(tidyverse)
Data
hh <- c(1,2,3,4,5,6,7,8,9,10) # household
hh_weight <- c(1.5,10.4,2.4,5.1,8.4,4.7,3.1,7.4,7.9,11.1) # weight of the single household
dv <- c(1.,0.5,0.9,2,4,1,0.3,4,0.1,3) # dependent variable
dv_log <- log10(dv) # log10 of the dependent variable
iv <- c(1,2,2,1,3,2,1,3,2,3) # independent variable
dat <- data.frame(hh,hh_weight,dv,dv_log,iv) # build data frame
dat <- mutate(dat,dv_kat = cut(dv,breaks = c(0,0.3,0.5,0.7,0.9,1.12,1.43,2.01,3.31,105),labels = c("5","4","3","2","1","2","3","4","5"))) # categories of dependent variable
Statistical analysis (here: anova)
Usually, I do before the anova or welch-test/welch-anova pre-tests and afterwards post-hoc-tests)
anova_dv_kat <- aov(iv~dv_kat,data=dat)
summary(anova_dv_kat)
Plot
text_labels1 <- group_by(
dat %>% drop_na(dv_kat, iv, dv),
iv
) %>% summarise(
textlabel_top = paste("n ==", format(n(), big.mark = " ",justify = "left")),
textlabel_mid = paste("µ == ", sprintf('%.2f',mean(dv),justify = "left")),
textlabel_bot = paste("µ[log]"," == ", sprintf('%.2f',mean(dv_log),justify = "left")),
y_top = 200,
y_mid = 120,
y_bot = 80
)
# plotten
ggplot(dat,mapping = aes(
x = as.numeric(iv),
y = dv
)) +
geom_text(
data = text_labels1,
aes(label = textlabel_top,color=NULL,
y=y_top),show.legend = FALSE,size=4,parse = TRUE
) +
geom_text(
data = text_labels1,
aes(label = textlabel_mid,
color=NULL,
y=y_mid),show.legend = FALSE,size=4,parse = TRUE
) +
geom_text(
data = text_labels1,
aes(label = textlabel_bot,
color=NULL,
y=y_bot),show.legend = FALSE,size=4,parse = TRUE
) +
geom_hline(yintercept = 1e+00,linetype="dotted")+ # horizontale Linie zeigt die 100%ige Übereinstimmung der Zugangszeiten an
geom_jitter(position=position_jitter(0.25),aes(color = factor(dv_kat)), alpha=0.6)+
stat_boxplot(aes(group=iv),geom ='errorbar',width=0.25)+
geom_boxplot(aes(group=iv),outlier.color = "transparent",fill="transparent") +
scale_color_manual(values = c("red4", "red3", "orange", "green3", "green4"), labels = c("sehr schlecht", "schlecht", "mäßig", "gut", "sehr gut"),guide = guide_legend(shape = c(rep(16, 7), NA, NA))) +
scale_x_continuous(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
scale_y_continuous(trans='log10', limits = c(0.006,200))+
scale_fill_manual(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
labs(x = "IV",
y = "XYZ-Fakt (log)",
color = "XYZ:") +
theme_bw()+
theme(legend.position = "bottom",legend.background = element_rect(color = "black",fill = "transparent"),legend.key.size = unit(0.3,"cm"))
Hope, I made it clear enough.