Am I interpreting this scatter plot correctly? ggplot in R

I am not sure if I am interpreting this scatter plot correctly.

It's supposed to help analyse any potential relationship between unemployment rate and crime.

  • X axis = crime occurrences (3 types of crime analysed)
  • Y axis = unemployment rate
  • Plus a geom_smooth() for each crime type

3 questions:

  • Does this graph shows that crimes are more likely to be committed when unemployment rate is lower?

  • Does each dot show how many crimes were committed at every unemployment rate %? I.e. the uppermost red dot shows that over 12k anti-social behaviour crimes were recorded when the unemployment rate was a bit lower than 4.25%?

  • Do the geom_smooth() suggest how each crime type occurrences changed over time OR depending on the unemployment rate?

Here's the code I used to plot it:

ggplot() +
  ggtitle("Crime rate and unemployment rate relationship \nin regions with lower unemployment rate volatility") +
  geom_point(aes(x = Unemployment_rate, y= Crime_occurrences, colour = Crime),
             size = 2, data = df) +
  scale_x_continuous(breaks = seq(from = 3.75, to = 5, by = 0.25), #scale X axis
                     limits = c(3.75,5),
                     labels = function(x) paste0(x,"%"))+ #add % symbol to x axis values
  scale_y_continuous(breaks = seq(from = 30000, to = 95000, by = 5000)) +
  xlab("Unemployment Rate") +
  ylab("Crime Occurrences") +
  new_scale_color()+
  labs(linetype = "Crime") +
  geom_smooth(method = 'lm',se=F,
              aes(x=Unemployment_rate,y = Crime_occurrences, group=1,
                  color="Anti-social behaviour"),
              lty = 6, data=smooth_antisocial_L) +
  geom_smooth(method = 'lm',se=F,
              aes(x=Unemployment_rate,y = Crime_occurrences, group=1,
                  color="Theft"),
              lty = 6, data=smooth_theft_L) +
  geom_smooth(method = 'lm',se=F,
              aes(x=Unemployment_rate,y = Crime_occurrences, group=1,
                  color="Violence and sexual offences"),
              lty = 6, data=smooth_violence_L) +
  labs(colour = "Smooth Conditional Means")

Here's the data frame I used to generate this plot:

structure(list(Date = structure(c(17897, 17897, 17897, 17928, 
17928, 17928, 17956, 17956, 17956, 17987, 17987, 17987, 18017, 
18017, 18017, 18048, 18048, 18048, 18078, 18078, 18078, 18109, 
18109, 18109, 18140, 18140, 18140, 18170, 18170, 18170, 18201, 
18201, 18201, 18231, 18231, 18231, 18262, 18262, 18262, 18293, 
18293, 18293, 18322, 18322, 18322, 18353, 18353, 18353, 18383, 
18383, 18383, 18414, 18414, 18414, 18444, 18444, 18444, 18475, 
18475, 18475, 18506, 18506, 18506, 18536, 18536, 18536), class = "Date"), 
    Crime = c("Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences", 
    "Anti-social behaviour", "Theft", "Violence and sexual offences"
    ), Crime_occurrences = c(48701L, 65999L, 63295L, 48466L, 
    60502L, 59562L, 55761L, 66675L, 68746L, 60397L, 63669L, 65965L, 
    61850L, 65725L, 68863L, 60862L, 62455L, 70496L, 61875L, 56217L, 
    65130L, 57331L, 54841L, 60444L, 49365L, 52528L, 59115L, 50363L, 
    55277L, 62828L, 43449L, 53850L, 59680L, 39847L, 49695L, 60358L, 
    42248L, 54099L, 62524L, 40940L, 51389L, 60469L, 50036L, 43857L, 
    60206L, 85213L, 30102L, 51524L, 90380L, 31124L, 58183L, 76542L, 
    33063L, 60959L, 61060L, 38352L, 67812L, 61428L, 39178L, 67569L, 
    53077L, 39215L, 63079L, 55887L, 41773L, 61591L), Unemployment_rate = c(4.12344811627314, 
    4.12344811627314, 4.12344811627314, 4.14095110134931, 4.14095110134931, 
    4.14095110134931, 4.01323345624432, 4.01323345624432, 4.01323345624432, 
    3.98165688212034, 3.98165688212034, 3.98165688212034, 3.9949985797279, 
    3.9949985797279, 3.9949985797279, 4.02198311038492, 4.02198311038492, 
    4.02198311038492, 3.97224528388891, 3.97224528388891, 3.97224528388891, 
    3.9191851565279, 3.9191851565279, 3.9191851565279, 3.97596949462614, 
    3.97596949462614, 3.97596949462614, 3.80352996332542, 3.80352996332542, 
    3.80352996332542, 3.90768007484014, 3.90768007484014, 3.90768007484014, 
    3.84982022142404, 3.84982022142404, 3.84982022142404, 4.0182521800768, 
    4.0182521800768, 4.0182521800768, 4.0334041605285, 4.0334041605285, 
    4.0334041605285, 3.92106608927081, 3.92106608927081, 3.92106608927081, 
    3.89612131845226, 3.89612131845226, 3.89612131845226, 3.90307987656759, 
    3.90307987656759, 3.90307987656759, 3.86867718232534, 3.86867718232534, 
    3.86867718232534, 3.94431710971825, 3.94431710971825, 3.94431710971825, 
    4.33339138157948, 4.33339138157948, 4.33339138157948, 4.63534527853297, 
    4.63534527853297, 4.63534527853297, 4.82676294010233, 4.82676294010233, 
    4.82676294010233)), row.names = c(NA, -66L), class = "data.frame")

And here's one of the geom_smooth() dput to give you an idea:

structure(list(Date = structure(c(17897, 17928, 17956, 17987, 
18017, 18048, 18078, 18109, 18140, 18170, 18201, 18231, 18262, 
18293, 18322, 18353, 18383, 18414, 18444, 18475, 18506, 18536, 
17897, 17928, 17956, 17987, 18017, 18048, 18078, 18109, 18140, 
18170, 18201, 18231, 18262, 18293, 18322, 18353, 18383, 18414, 
18444, 18475, 18506, 18536, 17897, 17928, 17956, 17987, 18017, 
18048, 18078, 18109, 18140, 18170, 18201, 18231, 18262, 18293, 
18322, 18353, 18383, 18414, 18444, 18475, 18506, 18536, 17897, 
17928, 17956, 17987, 18017, 18048, 18078, 18109, 18140, 18170, 
18201, 18231, 18262, 18293, 18322, 18353, 18383, 18414, 18444, 
18475, 18506, 18536), class = "Date"), Crime = c("Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour", 
"Anti-social behaviour", "Anti-social behaviour", "Anti-social behaviour"
), Crime_occurrences = c(48701L, 48466L, 55761L, 60397L, 61850L, 
60862L, 61875L, 57331L, 49365L, 50363L, 43449L, 39847L, 42248L, 
40940L, 50036L, 85213L, 90380L, 76542L, 61060L, 61428L, 53077L, 
55887L, 48701L, 48466L, 55761L, 60397L, 61850L, 60862L, 61875L, 
57331L, 49365L, 50363L, 43449L, 39847L, 42248L, 40940L, 50036L, 
85213L, 90380L, 76542L, 61060L, 61428L, 53077L, 55887L, 48701L, 
48466L, 55761L, 60397L, 61850L, 60862L, 61875L, 57331L, 49365L, 
50363L, 43449L, 39847L, 42248L, 40940L, 50036L, 85213L, 90380L, 
76542L, 61060L, 61428L, 53077L, 55887L, 48701L, 48466L, 55761L, 
60397L, 61850L, 60862L, 61875L, 57331L, 49365L, 50363L, 43449L, 
39847L, 42248L, 40940L, 50036L, 85213L, 90380L, 76542L, 61060L, 
61428L, 53077L, 55887L), Unemployment_rate = c(3.13281377800783, 
4.68247349426148, 5.0708122696351, 3.60769292318816, 3.259676887651, 
4.24086209425895, 5.20425572086481, 3.85900970262247, 3.18049050596868, 
4.02517515965457, 5.07463979478007, 3.77262836457398, 3.00553644073387, 
4.29397104451947, 4.95863831882363, 3.66848172440438, 2.81131839449418, 
4.23454968410189, 4.79139912788739, 4.14272711242814, 2.90440048952152, 
4.28118824655063, 4.56621383409869, 4.33612987136883, 2.79802032127443, 
4.63665149464923, 4.15610221124255, 4.29820710838942, 3.08966319084375, 
4.50379542173275, 3.98279027057451, 4.10049174296061, 3.15150059517075, 
4.46358064141772, 4.09111105457073, 4.19768568734536, 3.09433421567558, 
3.65653388653173, 4.4653881330391, 3.99786361805527, 3.27106119142821, 
3.86911315140672, 4.31748261760943, 4.1730633389162, 3.13332712805949, 
3.58533538113168, 4.43826006085208, 4.2423583156529, 3.22884092778213, 
3.93320874039025, 4.50210360585639, 4.40885544627843, 2.99681254988402, 
4.02503140646339, 4.81764323428107, 4.29412945148553, 2.90903701310954, 
3.76788548732822, 5.02382063022771, 3.98352122641776, 3.01997997605704, 
3.67161258712684, 4.80322174913054, 4.08967096149463, 2.99042563020473, 
3.97182605242641, 4.85515814694348, 3.79490967669574, 3.25568537435778, 
4.3194319371037, 4.41226516242903, 3.48732625541084, 3.50626064652609, 
4.38753537321373, 4.37107017836121, 3.51240224077197, 4.06197213710062, 
4.39271393019063, 4.62628095567074, 4.25259850335592, 4.131002249127, 
4.89588411607636, 4.9378241924801, 4.57667055644844, 3.86912441545878, 
5.32040137455687, 5.402969264185, 4.71455670620866)), class = "data.frame", row.names = c(NA, 
-88L))

Thanks in advance for any help you can give!

Preliminary plotting using geom_smooth is a good first step, but then the strength of the association should be tested with a model.

suppressPackageStartupMessages({
  library(ggplot2)
})

DF <- data.frame(
  Date = c(
    17897, 17897, 17897, 17928,
    17928, 17928, 17956, 17956, 17956, 17987, 17987, 17987, 18017,
    18017, 18017, 18048, 18048, 18048, 18078, 18078, 18078, 18109,
    18109, 18109, 18140, 18140, 18140, 18170, 18170, 18170, 18201,
    18201, 18201, 18231, 18231, 18231, 18262, 18262, 18262, 18293,
    18293, 18293, 18322, 18322, 18322, 18353, 18353, 18353, 18383,
    18383, 18383, 18414, 18414, 18414, 18444, 18444, 18444, 18475,
    18475, 18475, 18506, 18506, 18506, 18536, 18536, 18536
  ),
  Crime = as.factor(c(
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences",
    "Anti-social behaviour", "Theft", "Violence and sexual offences"
  )), 
  Crime_occurrences = c(
    48701L, 65999L, 63295L, 48466L,
    60502L, 59562L, 55761L, 66675L, 68746L, 60397L, 63669L, 65965L,
    61850L, 65725L, 68863L, 60862L, 62455L, 70496L, 61875L, 56217L,
    65130L, 57331L, 54841L, 60444L, 49365L, 52528L, 59115L, 50363L,
    55277L, 62828L, 43449L, 53850L, 59680L, 39847L, 49695L, 60358L,
    42248L, 54099L, 62524L, 40940L, 51389L, 60469L, 50036L, 43857L,
    60206L, 85213L, 30102L, 51524L, 90380L, 31124L, 58183L, 76542L,
    33063L, 60959L, 61060L, 38352L, 67812L, 61428L, 39178L, 67569L,
    53077L, 39215L, 63079L, 55887L, 41773L, 61591L
  ), 
  Unemployment_rate = c(4.12344811627314,
    4.12344811627314, 4.12344811627314, 4.14095110134931, 4.14095110134931,
    4.14095110134931, 4.01323345624432, 4.01323345624432, 4.01323345624432,
    3.98165688212034, 3.98165688212034, 3.98165688212034, 3.9949985797279,
    3.9949985797279, 3.9949985797279, 4.02198311038492, 4.02198311038492,
    4.02198311038492, 3.97224528388891, 3.97224528388891, 3.97224528388891,
    3.9191851565279, 3.9191851565279, 3.9191851565279, 3.97596949462614,
    3.97596949462614, 3.97596949462614, 3.80352996332542, 3.80352996332542,
    3.80352996332542, 3.90768007484014, 3.90768007484014, 3.90768007484014,
    3.84982022142404, 3.84982022142404, 3.84982022142404, 4.0182521800768,
    4.0182521800768, 4.0182521800768, 4.0334041605285, 4.0334041605285,
    4.0334041605285, 3.92106608927081, 3.92106608927081, 3.92106608927081,
    3.89612131845226, 3.89612131845226, 3.89612131845226, 3.90307987656759,
    3.90307987656759, 3.90307987656759, 3.86867718232534, 3.86867718232534,
    3.86867718232534, 3.94431710971825, 3.94431710971825, 3.94431710971825,
    4.33339138157948, 4.33339138157948, 4.33339138157948, 4.63534527853297,
    4.63534527853297, 4.63534527853297, 4.82676294010233, 4.82676294010233,
    4.82676294010233
  ))

p <- ggplot(DF,aes(Unemployment_rate,Crime_occurrences))
p + geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'

p + geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~ Crime) +
  theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'

fit <- lm(Crime_occurrences ~ Unemployment_rate, data = DF)
summary(fit)
#> 
#> Call:
#> lm(formula = Crime_occurrences ~ Unemployment_rate, data = DF)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -27148  -7191   2708   6467  33154 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)          70916      23671   2.996  0.00389 **
#> Unemployment_rate    -3508       5835  -0.601  0.54989   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 11550 on 64 degrees of freedom
#> Multiple R-squared:  0.005614,   Adjusted R-squared:  -0.009923 
#> F-statistic: 0.3613 on 1 and 64 DF,  p-value: 0.5499

fit <- glm(Crime_occurrences ~ Unemployment_rate + Crime, data = DF)
summary(fit)
#> 
#> Call:
#> glm(formula = Crime_occurrences ~ Unemployment_rate + Crime, 
#>     data = DF)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -20871   -6755     362    4597   32818  
#> 
#> Coefficients:
#>                                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)                          71252      21686   3.286  0.00168 **
#> Unemployment_rate                    -3508       5327  -0.658  0.51267   
#> CrimeTheft                           -6613       3180  -2.080  0.04169 * 
#> CrimeViolence and sexual offences     5606       3180   1.763  0.08287 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 111229212)
#> 
#>     Null deviance: 8590446865  on 65  degrees of freedom
#> Residual deviance: 6896211149  on 62  degrees of freedom
#> AIC: 1416
#> 
#> Number of Fisher Scoring iterations: 2

Created on 2021-01-17 by the reprex package (v0.3.0.9001)

I have an explainer on interpreting the output of linear regression models that may help.

Although it's possible to draw a line through the cloud of points at the least possible distance to all the points and that line has a slope direction—in this case from higher crime occurrences at lower unemployment rates to lower crime occurrences at higher unemployment rates, a linear regression model will tell us more.

First, look at

F-statistic: 0.3613 on 1 and 64 DF,  p-value: 0.5499

The F-statistic captures the mathematical properties of the fit model and gives a p-value probability that the association between Crime_occurrences and Unemployment_rate is due to chance. 0.5499 is quite high.

At this point, there is nothing left to see for the first model.

The second model takes into account the type of crime, with the Crime variable. It shows that Theft has a modest depressive effect on `Crime_occurrences, but there's still a 0.04169 probability that is just chance.

Just read your article, thanks a lot! I understand that p value determines the chance of the relationship between the variables to be due to pure chance (which, in this case, seems fairly likely).

However, I still don't understand what Intercept means, and how I can use the data below to draw conclusions other than the p value shows that the relationship between the variables might be due to chance.

#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)          70916      23671   2.996  0.00389 **
#> Unemployment_rate    -3508       5835  -0.601  0.54989  

And here, how come there is no anti-social behaviour?
And what do these coefficients show? If the analysis is about the relationship between unemployment rate and the three types of crime, why is there unemployment rate AND the types of crime as well in this table below?

Coefficients:
#>                                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)                          71252      21686   3.286  0.00168 **
#> Unemployment_rate                    -3508       5327  -0.658  0.51267   
#> CrimeTheft                           -6613       3180  -2.080  0.04169 * 
#> CrimeViolence and sexual offences     5606       3180   1.763  0.08287 . ```

#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -27148  -7191   2708   6467  33154 

#> Residual standard error: 11550 on 64 degrees of freedom
#> Multiple R-squared:  0.005614,   Adjusted R-squared:  -0.009923 
#> F-statistic: 0.3613 on 1 and 64 DF,  p-value: 0.5499```

Finally, what does the F-statistic tell me in this case?

You explain that An F-statistic that approaches 1 precludes us from rejecting the null hypothesis of no relation and to conclude that there is, in fact, no relationship at all between one or more of the independent variable with the dependent.

In this case, since F is fairly close to 1 (F = 0.3613), does this reinforce the assumption that any relationship between the variables is null?

Thank you in advance for dealing with my ignorance, the more you dumb it down the more grateful I am ^^"

My father used to say

Any fool can make things complicated, son, aim to make them simple

OK

  1. Intercept is where the slope line crosses the y-axis
  2. Visualization alone gives the intuition that the x-y pairs fall mainly at some distance from the best-fit trend line, especially when looking at the highlighted confidence bands, and the p-values confirm it.
  3. Negative results are good results, too. Remember that we are dealing with observational data. There could be a relation. The most we can say with these data is that we cannot reject the "null" hypothesis that there isn't. So, it doesn't mean that there *couldn't be a reason. More data could lead to a different conclusion
  4. The second fit holds anti-social behavior constant. I haven't ever run down the reason that glm decides which categorical variable to pick for that role. Given its lack of influence in the lm model, it may not matter.
  5. The second fit has both the continuous and categorical variables to test their joint influence. Try running glm with only the categorical variables on the right-hand side of ~.
  6. The F-statistic's absolute value's p-value is more important, since in the usual case we only model where we expect at leastsome relationship, which was not the case in the toy model in my post, which was me simulating a monkey throwing darts.

When you progress in ordinary least squares linear regression modeling, you next will run into other diagnostic issues, such as outliers, heterodecsasity, normality of studentized residuals and influential points. These exist to help us spot misfits with the underlying assumptions behind the model.

  • The regression model is linear in parameters (powers of variables ok, but coefficients have to be linear)
  • The mean of residuals is zero (mean(mod$residuals))
  • Homoscedasticity of residuals or equal variance (not flat plots)
  • No autocorrelation of residuals acf(lmMod$residuals)or lmtest::dwtest(lmMod)
  • The X variables and residuals are uncorrelated cor.test(cars$residuals)
  • The number of observations must be greater than number of Xs
  • The variability in X values is positive var(cars$speed)
  • The regression model is correctly specified (as if Y inverse to X)
  • No perfect multicollinearity car::vif
  • Normality of residuals qqnorm in plots

Be especially careful of autocorrelation when dealing with time series.

For a better treatment, at an intermediate level, of regression, see Faraday. For an advanced treatment, there's Harrell, which I believe has been posted as a free text pdf.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

You're a life saver.

If I may ask, where can I read about how to interpret that data?

I am not sure what the following means and/or what to draw from it:

#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -20871   -6755     362    4597   32818  

Coefficients:
#>                                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)                          71252      21686   3.286  0.00168 **
#> Unemployment_rate                    -3508       5327  -0.658  0.51267   
#> CrimeTheft                           -6613       3180  -2.080  0.04169 * 
#> CrimeViolence and sexual offences     5606       3180   1.763  0.08287 . ```

#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -27148  -7191   2708   6467  33154 

#> Residual standard error: 11550 on 64 degrees of freedom
#> Multiple R-squared:  0.005614,   Adjusted R-squared:  -0.009923 
#> F-statistic: 0.3613 on 1 and 64 DF,  p-value: 0.5499```