 # include survey weight

Hi!

How can I include an extrapolation factor to my data? This factor already exists and it says, for how many households one single household stands. After doing that extrapolation, I do some statistical analysis like anova and welch-test including pre- and post-hoc-tests, and for depicting the results a ggplot.

My real data set includes more than 17000 households, which represent about 3 millions of households.

Here is a samle data:

``````library(tidyverse)
``````

## Data

``````hh <- c(1,2,3,4,5,6,7,8,9,10) # household
hh_weight <- c(1.5,10.4,2.4,5.1,8.4,4.7,3.1,7.4,7.9,11.1) # weight of the single household
dv <- c(1.,0.5,0.9,2,4,1,0.3,4,0.1,3) # dependent variable
dv_log <- log10(dv) # log10 of the dependent variable
iv <- c(1,2,2,1,3,2,1,3,2,3) # independent variable
dat <- data.frame(hh,hh_weight,dv,dv_log,iv) # build data frame
dat <- mutate(dat,dv_kat = cut(dv,breaks =  c(0,0.3,0.5,0.7,0.9,1.12,1.43,2.01,3.31,105),labels = c("5","4","3","2","1","2","3","4","5"))) # categories of dependent variable
``````

## Statistical analysis (here: anova)

Usually, I do before the anova or welch-test/welch-anova pre-tests and afterwards post-hoc-tests)

``````anova_dv_kat <- aov(iv~dv_kat,data=dat)
summary(anova_dv_kat)
``````

## Plot

``````text_labels1 <- group_by(
dat %>% drop_na(dv_kat, iv, dv),
iv
) %>% summarise(
textlabel_top = paste("n ==", format(n(), big.mark = " ",justify = "left")),
textlabel_mid = paste("µ == ", sprintf('%.2f',mean(dv),justify = "left")),
textlabel_bot = paste("µ[log]"," == ", sprintf('%.2f',mean(dv_log),justify = "left")),
y_top = 200,
y_mid = 120,
y_bot = 80
)

# plotten
ggplot(dat,mapping = aes(
x = as.numeric(iv),
y = dv
)) +
geom_text(
data = text_labels1,
aes(label = textlabel_top,color=NULL,
y=y_top),show.legend = FALSE,size=4,parse = TRUE
) +
geom_text(
data = text_labels1,
aes(label = textlabel_mid,
color=NULL,
y=y_mid),show.legend = FALSE,size=4,parse = TRUE
) +
geom_text(
data = text_labels1,
aes(label = textlabel_bot,
color=NULL,
y=y_bot),show.legend = FALSE,size=4,parse = TRUE
) +
geom_hline(yintercept = 1e+00,linetype="dotted")+ # horizontale Linie zeigt die 100%ige Übereinstimmung der Zugangszeiten an
geom_jitter(position=position_jitter(0.25),aes(color = factor(dv_kat)), alpha=0.6)+
stat_boxplot(aes(group=iv),geom ='errorbar',width=0.25)+
geom_boxplot(aes(group=iv),outlier.color = "transparent",fill="transparent") +
scale_color_manual(values = c("red4", "red3", "orange", "green3", "green4"), labels = c("sehr schlecht", "schlecht", "mäßig", "gut", "sehr gut"),guide = guide_legend(shape = c(rep(16, 7), NA, NA))) +
scale_x_continuous(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
scale_y_continuous(trans='log10', limits = c(0.006,200))+
scale_fill_manual(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
labs(x = "IV",
y = "XYZ-Fakt (log)",
color = "XYZ:") +
theme_bw()+
theme(legend.position = "bottom",legend.background = element_rect(color = "black",fill = "transparent"),legend.key.size = unit(0.3,"cm"))
``````

Hope, I made it clear enough.

Weights can be specified by a `weights` argument, but should not be used with an `Error` term, and are incompletely supported (e.g., not by `model.tables` )

Thank you for that easy solution! Is there any possibility to check, if R really used the weights in the equations? In my ggplot, it still shows me (at least in my label) the number of the households surveyed and not the households including the weights.

I suppose run the code as you did without the weights and compare that analysis to the one with the weights included...

The `weights` function didn't change anything, so I installed the `survey` package. After subsetting the weighted data, I had to change my `hist` functions into `svyhist` . Do you know, how I can do statistical analysis with that package? The normal `leveneTest()` , `describeBy()` , `aov()` , `oneway.test()` , `pairwise.t.test()` , `games_howell_test()` , and my ggplot didn't work anymore.

I am suprised by this statement, as I can see that weights param would influence aov

``````op <- options(contrasts = c("contr.helmert", "contr.poly"))
( npk.aov <- aov(yield ~ block + N*P*K, npk) )

# add some arbitrary weights
npk\$w <- 1:nrow(npk)
#try the weights
( npk.aov2 <- aov(yield ~ block + N*P*K, npk, weights=w) )

summary(npk.aov)
summary(npk.aov2)
Df Sum Sq Mean Sq F value  Pr(>F)
block        5  343.3   68.66   4.447 0.01594 *
N            1  189.3  189.28  12.259 0.00437 **
P            1    8.4    8.40   0.544 0.47490
K            1   95.2   95.20   6.166 0.02880 *
N:P          1   21.3   21.28   1.378 0.26317
N:K          1   33.1   33.14   2.146 0.16865
P:K          1    0.5    0.48   0.031 0.86275
Residuals   12  185.3   15.44
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(npk.aov2)
Df Sum Sq Mean Sq F value  Pr(>F)
block        5   4628   925.6   8.761 0.00107 **
N            1   1537  1537.5  14.553 0.00246 **
P            1    253   252.9   2.393 0.14780
K            1   1140  1140.0  10.791 0.00652 **
N:P          1    308   307.9   2.915 0.11350
N:K          1    376   375.8   3.558 0.08370 .
P:K          1    331   330.8   3.132 0.10217
Residuals   12   1268   105.6
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

In my example, nothing changes, when I only add the `weights` parameter.

really ?

``````library(tidyverse)
hh <- c(1,2,3,4,5,6,7,8,9,10) # household
hh_weight <- c(1.5,10.4,2.4,5.1,8.4,4.7,3.1,7.4,7.9,11.1) # weight of the single household
dv <- c(1.,0.5,0.9,2,4,1,0.3,4,0.1,3) # dependent variable
dv_log <- log10(dv) # log10 of the dependent variable
iv <- c(1,2,2,1,3,2,1,3,2,3) # independent variable
dat <- data.frame(hh,hh_weight,dv,dv_log,iv) # build data frame
dat <- mutate(dat,dv_kat = cut(dv,breaks =  c(0,0.3,0.5,0.7,0.9,1.12,1.43,2.01,3.31,105),labels = c("5","4","3","2","1","2","3","4","5")))

anova_dv_kat <- aov(iv~dv_kat,data=dat)
summary(anova_dv_kat)

anova_dv_kat2 <- aov(iv~dv_kat,data=dat,weights=hh_weight)
summary(anova_dv_kat2)
``````

This portion of your data shows differences from applying the weights at least.

Oh, it changes, at least in the aov. But it doesn't show it in the number of observations and it doesn't seem to make sense.

Here the result of my real unweighted data:

``````               Df Sum Sq Mean Sq F value Pr(>F)
hh_wohnbdl      8   38.5   4.818   36.75 <2e-16 ***
Residuals   16434 2154.6   0.131
---
``````

And the result of my weighted data:

``````               Df Sum Sq Mean Sq F value Pr(>F)
hh_wohnbdl      8   7292   911.5   34.23 <2e-16 ***
Residuals   16434 437642    26.6
---
``````

And the `weights` parameter doesn't work with the other tests and is there a possibility to bring it into the ggplot?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.