include survey weight

Hi!

How can I include an extrapolation factor to my data? This factor already exists and it says, for how many households one single household stands. After doing that extrapolation, I do some statistical analysis like anova and welch-test including pre- and post-hoc-tests, and for depicting the results a ggplot.

My real data set includes more than 17000 households, which represent about 3 millions of households.

Here is a samle data:

library(tidyverse)

Data

hh <- c(1,2,3,4,5,6,7,8,9,10) # household
hh_weight <- c(1.5,10.4,2.4,5.1,8.4,4.7,3.1,7.4,7.9,11.1) # weight of the single household
dv <- c(1.,0.5,0.9,2,4,1,0.3,4,0.1,3) # dependent variable
dv_log <- log10(dv) # log10 of the dependent variable
iv <- c(1,2,2,1,3,2,1,3,2,3) # independent variable
dat <- data.frame(hh,hh_weight,dv,dv_log,iv) # build data frame
dat <- mutate(dat,dv_kat = cut(dv,breaks =  c(0,0.3,0.5,0.7,0.9,1.12,1.43,2.01,3.31,105),labels = c("5","4","3","2","1","2","3","4","5"))) # categories of dependent variable

Statistical analysis (here: anova)

Usually, I do before the anova or welch-test/welch-anova pre-tests and afterwards post-hoc-tests)

anova_dv_kat <- aov(iv~dv_kat,data=dat)
summary(anova_dv_kat)

Plot

text_labels1 <- group_by(
  dat %>% drop_na(dv_kat, iv, dv),
  iv
) %>% summarise(
  textlabel_top = paste("n ==", format(n(), big.mark = " ",justify = "left")),
  textlabel_mid = paste("µ == ", sprintf('%.2f',mean(dv),justify = "left")),
  textlabel_bot = paste("µ[log]"," == ", sprintf('%.2f',mean(dv_log),justify = "left")),
  y_top = 200,
  y_mid = 120,
  y_bot = 80
)

# plotten
ggplot(dat,mapping = aes(
  x = as.numeric(iv),
  y = dv
)) +
  geom_text(
    data = text_labels1,
    aes(label = textlabel_top,color=NULL,
        y=y_top),show.legend = FALSE,size=4,parse = TRUE
  ) +
  geom_text(
    data = text_labels1,
    aes(label = textlabel_mid,
        color=NULL,
        y=y_mid),show.legend = FALSE,size=4,parse = TRUE
  ) +
  geom_text(
    data = text_labels1,
    aes(label = textlabel_bot,
        color=NULL,
        y=y_bot),show.legend = FALSE,size=4,parse = TRUE
  ) +
  geom_hline(yintercept = 1e+00,linetype="dotted")+ # horizontale Linie zeigt die 100%ige Übereinstimmung der Zugangszeiten an
  geom_jitter(position=position_jitter(0.25),aes(color = factor(dv_kat)), alpha=0.6)+
  stat_boxplot(aes(group=iv),geom ='errorbar',width=0.25)+
  geom_boxplot(aes(group=iv),outlier.color = "transparent",fill="transparent") +
  scale_color_manual(values = c("red4", "red3", "orange", "green3", "green4"), labels = c("sehr schlecht", "schlecht", "mäßig", "gut", "sehr gut"),guide = guide_legend(shape = c(rep(16, 7), NA, NA))) +
  scale_x_continuous(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
  scale_y_continuous(trans='log10', limits = c(0.006,200))+
  scale_fill_manual(breaks = 1:3, name = "IV", labels = c("1", "2", "3")) +
  labs(x = "IV",
       y = "XYZ-Fakt (log)",
       color = "XYZ:") + 
  theme_bw()+
  theme(legend.position = "bottom",legend.background = element_rect(color = "black",fill = "transparent"),legend.key.size = unit(0.3,"cm"))

Hope, I made it clear enough.

Weights can be specified by a weights argument, but should not be used with an Error term, and are incompletely supported (e.g., not by model.tables )

The weights function didn't change anything, so I installed the survey package. After subsetting the weighted data, I had to change my hist functions into svyhist . Do you know, how I can do statistical analysis with that package? The normal leveneTest() , describeBy() , aov() , oneway.test() , pairwise.t.test() , games_howell_test() , and my ggplot didn't work anymore.

I am suprised by this statement, as I can see that weights param would influence aov

op <- options(contrasts = c("contr.helmert", "contr.poly"))
( npk.aov <- aov(yield ~ block + N*P*K, npk) )


# add some arbitrary weights
npk$w <- 1:nrow(npk)
#try the weights 
( npk.aov2 <- aov(yield ~ block + N*P*K, npk, weights=w) )

summary(npk.aov)
summary(npk.aov2)
Df Sum Sq Mean Sq F value  Pr(>F)   
block        5  343.3   68.66   4.447 0.01594 * 
  N            1  189.3  189.28  12.259 0.00437 **
  P            1    8.4    8.40   0.544 0.47490   
K            1   95.2   95.20   6.166 0.02880 * 
  N:P          1   21.3   21.28   1.378 0.26317   
N:K          1   33.1   33.14   2.146 0.16865   
P:K          1    0.5    0.48   0.031 0.86275   
Residuals   12  185.3   15.44                   
---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(npk.aov2)
Df Sum Sq Mean Sq F value  Pr(>F)   
block        5   4628   925.6   8.761 0.00107 **
  N            1   1537  1537.5  14.553 0.00246 **
  P            1    253   252.9   2.393 0.14780   
K            1   1140  1140.0  10.791 0.00652 **
  N:P          1    308   307.9   2.915 0.11350   
N:K          1    376   375.8   3.558 0.08370 . 
P:K          1    331   330.8   3.132 0.10217   
Residuals   12   1268   105.6                   
---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In my example, nothing changes, when I only add the weights parameter.

really ?

library(tidyverse)
hh <- c(1,2,3,4,5,6,7,8,9,10) # household
hh_weight <- c(1.5,10.4,2.4,5.1,8.4,4.7,3.1,7.4,7.9,11.1) # weight of the single household
dv <- c(1.,0.5,0.9,2,4,1,0.3,4,0.1,3) # dependent variable
dv_log <- log10(dv) # log10 of the dependent variable
iv <- c(1,2,2,1,3,2,1,3,2,3) # independent variable
dat <- data.frame(hh,hh_weight,dv,dv_log,iv) # build data frame
dat <- mutate(dat,dv_kat = cut(dv,breaks =  c(0,0.3,0.5,0.7,0.9,1.12,1.43,2.01,3.31,105),labels = c("5","4","3","2","1","2","3","4","5")))

anova_dv_kat <- aov(iv~dv_kat,data=dat)
summary(anova_dv_kat)

anova_dv_kat2 <- aov(iv~dv_kat,data=dat,weights=hh_weight)
summary(anova_dv_kat2)

This portion of your data shows differences from applying the weights at least.

Oh, it changes, at least in the aov. But it doesn't show it in the number of observations and it doesn't seem to make sense.

Here the result of my real unweighted data:

               Df Sum Sq Mean Sq F value Pr(>F)    
hh_wohnbdl      8   38.5   4.818   36.75 <2e-16 ***
Residuals   16434 2154.6   0.131                   
---

And the result of my weighted data:

               Df Sum Sq Mean Sq F value Pr(>F)    
hh_wohnbdl      8   7292   911.5   34.23 <2e-16 ***
Residuals   16434 437642    26.6                   
---

And the weights parameter doesn't work with the other tests and is there a possibility to bring it into the ggplot?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

I suppose run the code as you did without the weights and compare that analysis to the one with the weights included...

Thank you for that easy solution! Is there any possibility to check, if R really used the weights in the equations? In my ggplot, it still shows me (at least in my label) the number of the households surveyed and not the households including the weights.