Chi-square analysis with missing data, what should I do?

Hello,

Here is how I attempted to run the chi-square analysis without handling the missing data in my smoking status (I did not get any error message). Does this code automatically take care of the missing data and I should be okay to report its output? If not, how can I handle the missingness?

data_1$disease <- cut(data_1$Post.score, br = c(0,5,100), labels = c("none", "disease"))
chisq.test(data_1$Smoking.Status,data_1$disease,correct=TRUE)

Here is my data:

structure(list(Smoking.Status = c("smoking", "smoking", "smoking", 
"smoking", "smoking", "non-smoking", "smoking", "non-smoking", 
"non-smoking", "non-smoking", "smoking", "non-smoking", "non-smoking", 
"smoking", "non-smoking", "smoking", "smoking", "non-smoking", 
"non-smoking", "", "", "", "", "", "", "", "non-smoking", "", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "non-smoking", "non-smoking", "non-smoking", "non-smoking", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "smoking"), Post.score = c(1.309408341, 7.213930348, 
25.26690391, 12.92719168, 8.702064897, 5.556698909, 16.09399246, 
8.097784568, 4.505119454, 1.120709783, 1.708011387, 5.040871935, 
0.937744204, 6.898584906, 16.31768953, 5.823792932, 3.003754693, 
1.416005149, 44.515357, 4.358683314, 5.233572398, 0.376175549, 
38.43137255, 22.97383535, 1.367088608, 7.234251969, 8.444902163, 
5.696202532, 6.324262169, 3.12922542, 8.610271903, 53.125, 4.962950198, 
7.529843893, 2.871287129, 3.155728333, 15.67839196, 3.181336161, 
3.718393654, 3.9408867, 29.10839161, 21.28337983, 7.73073889, 
12.6340882, 18.53658537, 17.49837978, 15.8557047)), row.names = c(NA, 
47L), class = "data.frame")

The chisq.test treats the empty strings in Smoking.Status as another level of that factor. You can see this by running

str(Test1)

on the Test1 object in my code. I replaced the empty strings with the word "blank" to show that the test result is the same. I also compared filtering out rows with empty strings and replacing the empty strings with NA to show that those two results are the same.

data_1 <- structure(list(Smoking.Status = c("smoking", "smoking", "smoking", 
                                            "smoking", "smoking", "non-smoking", "smoking", "non-smoking", 
                                            "non-smoking", "non-smoking", "smoking", "non-smoking", "non-smoking", 
                                            "smoking", "non-smoking", "smoking", "smoking", "non-smoking", 
                                            "non-smoking", "", "", "", "", "", "", "", "non-smoking", "", 
                                            "", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
                                            "non-smoking", "non-smoking", "non-smoking", "non-smoking", "non-smoking", 
                                            "", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
                                            "non-smoking", "smoking"), 
                         Post.score = c(1.309408341, 7.213930348, 
                                        25.26690391, 12.92719168, 8.702064897, 5.556698909, 16.09399246, 
                                        8.097784568, 4.505119454, 1.120709783, 1.708011387, 5.040871935, 
                                        0.937744204, 6.898584906, 16.31768953, 5.823792932, 3.003754693, 
                                        1.416005149, 44.515357, 4.358683314, 5.233572398, 0.376175549, 
                                        38.43137255, 22.97383535, 1.367088608, 7.234251969, 8.444902163, 
                                        5.696202532, 6.324262169, 3.12922542, 8.610271903, 53.125, 4.962950198, 
                                        7.529843893, 2.871287129, 3.155728333, 15.67839196, 3.181336161, 
                                        3.718393654, 3.9408867, 29.10839161, 21.28337983, 7.73073889, 
                                        12.6340882, 18.53658537, 17.49837978, 15.8557047)), 
                    row.names = c(NA, 47L), class = "data.frame")

#original test
data_1$disease <- cut(data_1$Post.score, br = c(0,5,100), labels = c("none", "disease"))
Test1 <- chisq.test(data_1$Smoking.Status,data_1$disease,correct=TRUE)
#> Warning in chisq.test(data_1$Smoking.Status, data_1$disease, correct = TRUE):
#> Chi-squared approximation may be incorrect
Test1
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  data_1$Smoking.Status and data_1$disease
#> X-squared = 2.5837, df = 2, p-value = 0.2748
Test1$observed
#>                      data_1$disease
#> data_1$Smoking.Status none disease
#>                          4       6
#>           non-smoking   10      12
#>           smoking        3      12

#New coloumn with the word blank replacing ""
data_1$Smoking_3level <- ifelse(data_1$Smoking.Status == "","blank",data_1$Smoking.Status)
Test2 <- chisq.test(data_1$Smoking_3level,data_1$disease,correct=TRUE)
#> Warning in chisq.test(data_1$Smoking_3level, data_1$disease, correct = TRUE):
#> Chi-squared approximation may be incorrect
#Same result as Test1
Test2
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  data_1$Smoking_3level and data_1$disease
#> X-squared = 2.5837, df = 2, p-value = 0.2748
Test2$observed
#>                      data_1$disease
#> data_1$Smoking_3level none disease
#>           blank          4       6
#>           non-smoking   10      12
#>           smoking        3      12

# Filter out rows with Smoking.Status == ""
data_1_filtered <- data_1[data_1$Smoking.Status != "",]
nrow(data_1_filtered)
#> [1] 37
Test3 <- chisq.test(data_1_filtered$Smoking.Status,data_1_filtered$disease,correct=TRUE)
Test3
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  data_1_filtered$Smoking.Status and data_1_filtered$disease
#> X-squared = 1.5418, df = 1, p-value = 0.2144
Test3$observed
#>                               data_1_filtered$disease
#> data_1_filtered$Smoking.Status none disease
#>                    non-smoking   10      12
#>                    smoking        3      12

#Replace "" with NA.
data_1$SmokingNA <- ifelse(data_1$Smoking.Status == "", NA, data_1$Smoking.Status)
Test4 <- chisq.test(data_1$SmokingNA,data_1$disease,correct=TRUE)
#Same result as Test3
Test4
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  data_1$SmokingNA and data_1$disease
#> X-squared = 1.5418, df = 1, p-value = 0.2144
Test4$observed
#>                 data_1$disease
#> data_1$SmokingNA none disease
#>      non-smoking   10      12
#>      smoking        3      12

Created on 2021-11-30 by the reprex package (v2.0.1)

2 Likes

FJCC, we were working on solutions at the same time. Yours is more informative, but I'll post mine as well. Often with these kinds of things, I try the test by explicitly removing the "missing" cases to see if the answer is the same. In this case, it isn't, for the reasons FJCC mentions above.

``` r
suppressPackageStartupMessages(library(dplyr))

# create dataframe
data_1 <- structure(list(Smoking.Status = c("smoking", "smoking", "smoking", 
                                            "smoking", "smoking", "non-smoking", "smoking", "non-smoking", 
                                            "non-smoking", "non-smoking", "smoking", "non-smoking", "non-smoking", 
                                            "smoking", "non-smoking", "smoking", "smoking", "non-smoking", 
                                            "non-smoking", "", "", "", "", "", "", "", "non-smoking", "", 
                                            "", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
                                            "non-smoking", "non-smoking", "non-smoking", "non-smoking", "non-smoking", 
                                            "", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
                                            "non-smoking", "smoking"), Post.score = c(1.309408341, 7.213930348, 
                                                                                      25.26690391, 12.92719168, 8.702064897, 5.556698909, 16.09399246, 
                                                                                      8.097784568, 4.505119454, 1.120709783, 1.708011387, 5.040871935, 
                                                                                      0.937744204, 6.898584906, 16.31768953, 5.823792932, 3.003754693, 
                                                                                      1.416005149, 44.515357, 4.358683314, 5.233572398, 0.376175549, 
                                                                                      38.43137255, 22.97383535, 1.367088608, 7.234251969, 8.444902163, 
                                                                                      5.696202532, 6.324262169, 3.12922542, 8.610271903, 53.125, 4.962950198, 
                                                                                      7.529843893, 2.871287129, 3.155728333, 15.67839196, 3.181336161, 
                                                                                      3.718393654, 3.9408867, 29.10839161, 21.28337983, 7.73073889, 
                                                                                      12.6340882, 18.53658537, 17.49837978, 15.8557047)), row.names = c(NA, 
                                                                                                                                                        47L), class = "data.frame")
# create categories
data_1$disease <- cut(data_1$Post.score, br = c(0,5,100), labels = c("none", "disease"))

# chi2 test
chisq.test(data_1$Smoking.Status, data_1$disease, correct=TRUE)
#> Warning in chisq.test(data_1$Smoking.Status, data_1$disease, correct = TRUE):
#> Chi-squared approximation may be incorrect
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  data_1$Smoking.Status and data_1$disease
#> X-squared = 2.5837, df = 2, p-value = 0.2748

# remove cases with missing smoking status
data_1_filtered <- data_1 %>% 
  filter(Smoking.Status != "") 

# chi2 test w/o missing
chisq.test(data_1_filtered$Smoking.Status, data_1_filtered$disease, correct=TRUE)  
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  data_1_filtered$Smoking.Status and data_1_filtered$disease
#> X-squared = 1.5418, df = 1, p-value = 0.2144


# as an aside, you have a dichotomous IV and a continuous DV, so you might try a t-test
suppressPackageStartupMessages(library(rstatix))
data_1 %>% 
  filter(Smoking.Status != "") %>% 
  t_test(Post.score ~ Smoking.Status,
         detailed = TRUE)
#> # A tibble: 1 × 15
#>   estimate estimate1 estimate2 .y.    group1  group2    n1    n2 statistic     p
#> *    <dbl>     <dbl>     <dbl> <chr>  <chr>   <chr>  <int> <int>     <dbl> <dbl>
#> 1    0.771      11.5      10.7 Post.… non-sm… smoki…    22    15     0.219 0.828
#> # … with 5 more variables: df <dbl>, conf.low <dbl>, conf.high <dbl>,
#> #   method <chr>, alternative <chr>

Created on 2021-11-30 by the reprex package (v2.0.1)

1 Like

Hi @FJCC So we can see there is a difference between keeping/re-naming the empty column (test 1 or 2) vs dropping them (test4), which result is more reliable? Also, it occurs to me that we may actually impute the missing data instead of both options above, do you think we can do that?

Without knowing your data, I cannot suggest the best response to the missing data. Also, I have no experience with data imputation, so my opinion on that is not worth much.
Do you have any information about why the Smoking.Status is missing? Is that random or is it more likely for smokers or non-smokers?

I don't think it is random - this is a multi-centered study and most missing data is from 1 specific center. Does this mean I should not remove those data?

I am going to mislead you if I act as if I know how best to handle this. I can help with doing things with R. Making decisions about someone else's data, about which I know nothing, is more than I can comfortably do.

1 Like

No worries!! Can I ask why you used "correct = TRUE" and not "correct = FALSE" in the chi-squared code? How do I know when to use true vs false?

The correction is appropriate when some cells have low counts. The smoking/none cell only has 3 counts so the correction seems appropriate.

Imputation is useful if you have long surveys and a few individuals fail to answer a question here and there. A respondent might answer 30 out of 31 questions and the one missing value may result in the loss of all of the data from that survey. A small proportion of missing values may be imputed to avoid discarding too much data. In this case, there are only two variables. In addition some 20% of the data is missing. Imputation in this case is a terrible idea. You do not have enough data to really know the underlying distribution. So all you are doing by using the mean is to cement in the mistakes in sampling and inflating the degrees of freedom to drive a significant outcome.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.