analysis help--choosing method of analysis

Hi , I am doing my first own statistical analysis, and I am trying to study the effects that family, education, and institutions have on attitudes towards premarital sex. I originally thought that I would conduct a contingency table analysis, but when running my code in R i continually got an error message that the chi-squared approximation may be incorrect. So I am wondering if I should conduct an ANOVA or regression analysis based on this? does anyone have any help or advice? I can attach my R syntax file at attempting contingency analysis if that would be helpful. thank you!

The message often indicates that many of the terms used in the calculation are very small, due to an insufficiently large number of observations. Beyond that it's hard to say without a FAQ: What's a reproducible example (`reprex`) and how do I do one? and the characteristics of your data in terms of number of rows and variables, if you have them in a data frame and the type of each variable.

More generally in doing this sort of problem you want to identify an outcome, conventionally called y as a function of some other set of variables, x_i ... x_n. How each is encoded has a big influence on your choice of tools.

For example y may be binary yes/no, TRUE/FALSE, 1,0. A binary variable is an example of a categorical variable that can take on only one of two value. On the other hand, it may be continuous numerically. For example if you are measuring attitudes toward premarital sex by gathering data on the number of sexual partners a subject reports before a first marriage, if any, you may have a range of numbers ranging from 0 to 32, say (depending on the person, the culture, and other intangibles). Other categorical data may take one several different values, say flavors of ice cream.

Take a look at one of the built in datasets

data(mtcars)
str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Think about which of these are continuous and which are categorical and which are binary. In this set of automobile data, miles per gallon, mpg, is a continuous variable as is horsepower, hp. You might ask if mpg is affected by hp, and choose a linear regression model

fit <- lm(mpg ~ hp, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
hp          -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,	Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

and you'd conclude that there is a big effect that's has less than a one in 17 million probability of being due solely to chance.

Try

fit <- lm(carb ~ gear, data = mtcars)
summary(fit)

and think about what kind of data (continuous or categorical) the number of carburetors and gears represent.

Then do the same with your data in understanding how it's represented.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.