Zooming out of a linear regression plot

Dear community,

I hope you can help me out. I am doing a linear regression in RStudio for the first time and I wanted to get a nice plot of the data with a regression line.
The data I use consits out of 21.000 observations and I do not know if this is the reason, but you can not see any trends within the plot, because the amount of dots is too low or the scale is wrong?

Does anybody have an idea on how to solve this problem?

Thank you very much in advance.

This is the code I used:

plot(ESS_Datensatz_bereinigt$stfjb, ESS_Datensatz_bereinigt$happy, main= "Zusammenhang zwischen Arbeits- und Lebenszufriedenheit",
xlab= "Arbeitszufriedenheit", ylab = "Lebenszufriedenheit",
abline(lm(ESS_Datensatz_bereinigt$stfjb ~ ESS_Datensatz_bereinigt$happy), col ="red"))

I also tried it using the ggplot2- package, but that didn*t help:


Hi

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Or can you find a way to share your used data? Probably you cannot mimick this with inbuilt data set.
Wouldn't you agree that it's a bit strange to have datapoints spreaded equally over the whole range?
To me it seems you have just multiple readouts per datapoint, just because people chose points on a scale from 1 to 10. So actually in the plot you hide some data. You could add some jitter to show the overlap, or think if another representation might be better, e.g. a boxplot or a violin-plot. Or a ballon-plot, where the size of the point actually shows the count at this location.

Hi Matthias, hi Andres,

thank you very much for your help!

The data I use are provided by the European Social Survey. I uploaded my dataset, I hope it works: https://www.file-upload.net/download-13702472/ESS1-8e01.sav.html

You are right: The 2 variables were raised with a 11-point likert scale.
You'll probably need the "haven" package, since this is a .sav-file.

I want to do a linear regression with "happy" (General happiness) being my dependent variable and "stfjb" (job satisfaction) being my independet variable. The idea of the work is to prove the hypothesis that the more satisfied you are with your job, the more satisfied you are with life in general.

The regression itselfs works with p < 0,001, but I just can*t plot a graph showing the regression...

Option 1: Add some jitter and transparency to have the single counts identifiable:

> ggplot(ESS_bereinigt, aes(x=as.numeric(stfjb), 
>                           y = as.numeric(happy))) + 
>   geom_jitter(width = 0.4, height = 0.4, alpha =0.05) + 
>   theme_classic() +
>   scale_x_continuous(breaks = seq(0,10,1)) +
>   scale_y_continuous(breaks = seq(0,10,1)) +
>   geom_smooth(method = "lm", na.rm=TRUE)

grafik

Option 2: Violin Plots: I purposely increased the width to show the crowding in the 8-8 area. Also the scale="count" is important to show the areas with more counts.

ggplot(ESS_bereinigt, aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_violin(aes(group = as.numeric(stfjb)),
              scale = "count", width = 1.75,
              fill = "grey50", alpha = 0.5) + 
  theme_classic() +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  geom_smooth(method = "lm")

grafik

Boxplots actually don't look good, just because the median is then also one of the numbers.

Option 3: Count the number of observations per X-Y-pair, then define the size of the points depending on the counts.

ESS_bereinigt %>%
  group_by(cntry, stfjb, happy) %>%
  mutate(happy_count = n()) %>%
ggplot(aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_point(aes(size = happy_count)) + 
  theme_classic() + scale_size_continuous(range = c(1,11)) +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  geom_smooth(method = "lm", na.rm=TRUE, size = 1.5)

grafik

PS: Maybe it's the other way around, people that are more happy in their life can also enjoy their work more!? :wink:

Wow, thank you very much, Matthias!

I like option 3 a lot! Is there a way to change the colours of the dots?

For example counts < 50 = red, 50= green, 100= blue and so on...? It would make it much easier for the reader to see where crowdings are....

Isn't this illustrative enough? Actually I had a mistake inside, as it counted the occurrences per country so there are still many different points per position. Still you need all the points for the linear fit but at least they should have the same property.

Just remove cntry from the "group_by", then add "colour = happy_count" to the geom_point and define the colours with "scale_colour_gradientn()", here the "values = c(0.25, 0.75, 1)" defines the position of each colour on a scale from 0 to 1. 0.25 corresponds to 500, 0.75 to 1500 and 1 to the highest value (2164 actually) Therefor 2000 is not complete blue. Sorry I am not very proficient with gradient colours...

ESS_bereinigt %>%
  group_by(stfjb, happy) %>%
  mutate(happy_count = n()) %>%
ggplot(aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_point(aes(size = happy_count,
                 colour = happy_count)) + 
  theme_classic() + scale_size_continuous(range = c(1,11)) +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  scale_colour_gradientn(colours = c("grey50", "green", "blue4"),
                         guide = "legend",
                         values = c(0.25, 0.75, 1)) +
  geom_smooth(method = "lm", na.rm=TRUE, size = 1.5)

grafik

1 Like

Thank you very much, Matthias :slight_smile:

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

https://stat4everyone.shinyapps.io/lowess/

I hope this will be helpful to you.
you can understand quick to see the example data.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.