How can I apply frequency weights (like SPSS) in R ?

Hi,
I would like to apply so called frequency weights like it is simple in SPSS, but want to do it in R.
In SPSS it is just weight by Variable and that's it.
Is it possible to do similar in R ?
I want to do this because I have aggregated data so weigths should be applied.

I have read this:
https://stackoverflow.com/questions/67703362/how-can-i-apply-frequency-weights-like-spss-in-rstudio

https://www.reddit.com/r/rstats/comments/2z8qhb/how_do_i_weight_by_spss_function_from_r/

https://stackoverflow.com/questions/7026549/weight-data-with-r-part-ii/7026980#7026980

but this is unclear to me how do I do it ?
Any help will be much appreciated.

See the FAQ: How to do a minimal reproducible example reprex for beginners. Relatively few members here use SPSS so the question is obscure. Some representative data with the function to be applied and the identification of the weights variables will help us to help you find the right argument or an alternative function with weightings.

Hi again,

The data are aggregated data and this is why it needs weighting.

temp <- structure(list(Existence = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("not_exists", 
"exists"), class = "factor"), Group = structure(c(2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("Group_1", 
"Group_2"), class = "factor"), Sex = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("F",  
"M"), class = "factor"), Side = structure(c(1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("right", 
"left"), class = "factor"), r1 = c(35, 77, 28, 70, 37, 75, 24, 
74, 23, 27, 21, 31, 20, 30, 18, 34), r2 = c(21, 91, 7, 91, 17, 
95, 8, 90, 18, 32, 5, 47, 19, 31, 7, 45), r3 = c(62, 50, 41, 
57, 47, 65, 35, 63, 28, 22, 29, 23, 28, 22, 31, 21)), class = "data.frame", row.names = c(NA, 
-16L), variable.labels = structure(character(0), .Names = character(0)), codepage = 65001L)

I would like to calculate Odds Ratio (OR) and confidence intervals for it and p-value by means of firstly weighting cases separately, I think by r1, r2, r3 variables. I have read what SPSS does is weighting by frequency if I am not mistaken.
The results I would like to place in dataframe.
For example the results calculated in SPSS ( for OR, CI, p-value) comparing Group_1 and Group_2 for Female on the right hand side are as follows:

obraz

I have a fair bit of knowledge of both SPSS and R. In SPSS you have the luxury of specifying a weight variable and then typically SPSS will perform all runs with weights included and make it known to you on the print outs.

In R it is a completely different case. Not all functions will have the ability to accept weights given it wasn't a requirement. So you will have to use specific versions of functions etc that can take a weight variable (many packages have this for some functions).

The other options is to create a number of duplicates of your rows based on the weight factor and then run it as if it is "unweighted" essentially. Some people do this as a way to deal with it but it can become cumbersome if you have really large or complicated weights.

2 Likes

Thank you @GreyMerchant ,
Could you please explain how do I convert my data to "unweighted" data ?

if the weights are whole numbers, then tidyr::uncount

(df <- tibble(x = c("a", "b"), n = c(2, 3)))
uncount(df, n)
uncount(df, n)

not working on my data. My data (temp) are not weighted but aggregated. What I meant was that I wanted to make my data not-aggregated.

The example @nirgrahamuk posted would work. Alternatively, something like this will work:

library(tidyverse)

df <- data.frame(person_id = c(0,1,2,3,4,5), weight = c(1,1,2,1,1,3))

df
#>   person_id weight
#> 1         0      1
#> 2         1      1
#> 3         2      2
#> 4         3      1
#> 5         4      1
#> 6         5      3

df %>%
  uncount(weight) %>%
  rename(weight_person_id = person_id)
#>   weight_person_id
#> 1                0
#> 2                1
#> 3                2
#> 4                2
#> 5                3
#> 6                4
#> 7                5
#> 8                5
#> 9                5

Created on 2022-01-26 by the reprex package (v2.0.1)

As you can see the second print out has now printed out our ID a certain number of times based on the weight. You can then perform the lookup to add the information back to the rows and then you will have the new dataframe. So essentially, instead of having a weight of "3" you now have 3 rows of weight 1 each essentially that will come to represent that originaly value with weight. As nirgra mentioned, this only works for whole numbers. So if you have say 2.51 you will have to convert all weights into whole numbers (e.g. 2.51 x 100 = 251 so now you will have 251 rows created from the original 1 where you had a weight of 2.51. This grows quickly as mentioned when you have a really messy/involved weight schema as you can end up with a lot of rows.

Can you please explain it based on my data ?

Can you explain your data? how to interpret r1,r2,r3 values, what do they mean ?

    Existence   Group Sex  Side r1 r2 r3
1      exists Group_2   F right 35 21 62

Of course, here you are: this data is about three symptoms (Variables: r1, r2, r3) in right and left knee (side), divided into two gropus (Group_1 - muscle type_, Group_2 - bone type), with factor variable Sex (F and M) and with variable Existence indicating presence or absence of a particular symptom in the following groups: Sex, Side, Group.
This data are aggregated, summarised data, like gathered frequency data, not individual patients' detailed records.

Ok, then from your temp data.frame :

(totsum1 <- sum(temp$r1+temp$r2+temp$r3))

(temp2 <- pivot_longer(data=temp,
                      cols=c(r1,r2,r3)))

(temp3 <- uncount(temp2,value))
nrow(temp3)

This is very helpful, thank you. Now my question is how to calculate OR like this:

obraz

This would be for Group_1 and Group_2 for Female, right hand side.

There is a convenient package you can use for oddratio. See an example below. Some additional details here: Introducing R package ‘oddsratio’ | R-bloggers

## Example with glm()
library(oddsratio)
# load data (source: http://www.ats.ucla.edu/stat/r/dae/logit.htm) and
# fit model
fit_glm <- glm(admit ~ gre + gpa + rank,
               data = data_glm,
               family = "binomial"
) # fit model
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm, incr = list(gre = 380, gpa = 5))
#>   predictor oddsratio ci_low (2.5) ci_high (97.5)          increment
#> 1       gre     2.364        1.054          5.396                380
#> 2       gpa    55.712        2.229       1511.282                  5
#> 3     rank2     0.509        0.272          0.945 Indicator variable
#> 4     rank3     0.262        0.132          0.512 Indicator variable
#> 5     rank4     0.212        0.091          0.471 Indicator variable
# Calculate OR and change the confidence interval level
or_glm(
  data = data_glm, model = fit_glm,
  incr = list(gre = 380, gpa = 5), ci = .70
)
#>   predictor oddsratio ci_low (15) ci_high (85)          increment
#> 1       gre     2.364       1.540        3.647                380
#> 2       gpa    55.712      10.084      314.933                  5
#> 3     rank2     0.509       0.366        0.706 Indicator variable
#> 4     rank3     0.262       0.183        0.374 Indicator variable
#> 5     rank4     0.212       0.136        0.325 Indicator variable
## Example with MASS:glmmPQL()
# load data
library(MASS)
data(bacteria)
fit_glmmPQL <- glmmPQL(y ~ trt + week,
                       random = ~ 1 | ID,
                       family = binomial, data = bacteria,
                       verbose = FALSE
)
# Apply function
or_glm(data = bacteria, model = fit_glmmPQL, incr = list(week = 5))
#> Warning: No confident interval calculation possible for 'glmmPQL' models.
#>   predictor oddsratio ci_low ci_high          increment
#> 1   trtdrug     0.296     NA      NA Indicator variable
#> 2  trtdrug+     0.454     NA      NA Indicator variable
#> 3      week     0.485     NA      NA                  5

Created on 2022-01-26 by the reprex package (v2.0.0)

Can you please show me how to do it on my data ?

I found this on SO that says it is rather difficult to do it in R:

https://stackoverflow.com/questions/55543139/using-a-column-that-contains-a-frequency-weight-count-in-r

Hi Andrzej,

I was an spss user who fortunately switched to R a few years ago.

The SPSS weighting function you mention is wrong because it replicates or "clons" observations. Of course it is handy, but incorrect for estimating confidence intervals because when cloning the cases, the variance of the variables is false, and therefore the statistical inference is incorrect.

If you are going to work with weighted data seriously, consider spending some time learning how to use the survey package. It is fine.

For your specific example -with only categorical variables- I think you can get log odds ratios for weighted data using weighted logistic regression with the survey package.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.