How can I calculate the correlation between two variables with R?
cor(data$variable~data$variable2)
or is this wrong?
How can I calculate the correlation between two variables with R?
cor(data$variable~data$variable2)
or is this wrong?
You don't use ~
here since cor()
doesn't accept formulas. Just supply vectors to the x
and y
parameters of the function instead.
cor(iris$Sepal.Length, iris$Sepal.Width)
#> [1] -0.1175698
Created on 2020-08-10 by the reprex package (v0.3.0)
what about cor(data$variable1, data$variable2)?
Why would you use vectors? I don't understand this
I have a csv file with different variables and I would like to calculate the correlations between a few of them.
By using data$variable1 and data$variable2 you are just accessing numeric vectors that are stored in a column of a dataframe. Referring to them as data$variable1 and data$variable2 works fine
For instance:
class(iris$Sepal.Length)
#> [1] "numeric"
Created on 2020-08-10 by the reprex package (v0.3.0)
So I can calculate correlations with my code?
Yes, the change from using the ~
tilde to using a ,
comma as you did here:
makes it work
So let's say I have a negative correlation.
Which variable influences the other negatively?
which variable is on the x-axis and which is on the y-axis if you imagine a scatterplot.
It's not that one influences the other negatively, but that their relationship is negative. It doesn't matter which is the X variable and which is the Y variable in the correlation calculations.
The population correlation coefficient is defined as
But what we want, and what cor
calculates is the sample correlation which uses n-1
as the denominator at the start of the formula. As such.
Which is the following in R code
n <- length(x)
(1 / (n - 1)) * sum(((x - mean(x)) / sd(x)) * ((y - mean(y)) / sd(y)))
If you inspect the above calculation, you'll notice that ((x - mean(x)) / sd(x))
and ((y - mean(y)) / sd(y))
will end up just being two vectors which we'll multiply together, and as a * b
is the same as b * a
, by definition it makes no difference which one is the "x" and which one is the "y" when you calculate the correlation.
As for the plotting in a scatter, here's a little walk through that might explain it, starting with the above calculation in a data.frame context.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
set.seed(0)
d_f <- tibble(x = rnorm(10),
y = rnorm(10) * x)
d_f <- d_f %>%
mutate(x_mean = mean(x),
y_mean = mean(y),
x_sd = sd(x),
y_sd = sd(y)) %>%
mutate(row_to_sum = ((x - x_mean) / x_sd) * ((y - y_mean) / y_sd))
our_correlation <- (1 / (n-1)) * sum(d_f$row_to_sum)
our_correlation
#> [1] -0.6649819
cor_correlation <- cor(x = d_f$x,
y = d_f$y)
cor_correlation
#> [1] -0.6649819
our_correlation == cor_correlation
#> [1] TRUE
# NOW FOR SOME PLOTTING
# Variable x on X-axis
var_x <- d_f %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
labs(title = "Variable *x* on X-axis") +
coord_equal()
var_x
# Variable y on Y-axis
var_y <- d_f %>%
ggplot(aes(x = y, y = x)) +
geom_point() +
labs(title = "Variable *y* on X-axis") +
coord_equal()
var_y
# Some fillers to show the *mirroring*
upper_right <- d_f %>%
ggplot(aes(x = x, y = y)) +
coord_equal()
lower_left <- d_f %>%
ggplot(aes(x = y, y = x)) +
coord_equal()
(var_x + upper_right) / (lower_left + var_y) #patchwork composition of plots
Created on 2020-08-10 by the reprex package (v0.3.0)
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.