# How to calculate correlations of two variables

How can I calculate the correlation between two variables with R?

cor(data\$variable~data\$variable2)
or is this wrong?

You don't use `~` here since `cor()` doesn't accept formulas. Just supply vectors to the `x` and `y` parameters of the function instead.

``````cor(iris\$Sepal.Length, iris\$Sepal.Width)
#>  -0.1175698
``````

Created on 2020-08-10 by the reprex package (v0.3.0)

Why would you use vectors? I don't understand this
I have a csv file with different variables and I would like to calculate the correlations between a few of them.

By using data\$variable1 and data\$variable2 you are just accessing numeric vectors that are stored in a column of a dataframe. Referring to them as data\$variable1 and data\$variable2 works fine For instance:

``````class(iris\$Sepal.Length)
#>  "numeric"
``````

Created on 2020-08-10 by the reprex package (v0.3.0)

So I can calculate correlations with my code?

Yes, the change from using the `~` tilde to using a `,` comma as you did here:

makes it work So let's say I have a negative correlation.
Which variable influences the other negatively?
which variable is on the x-axis and which is on the y-axis if you imagine a scatterplot.

It's not that one influences the other negatively, but that their relationship is negative. It doesn't matter which is the X variable and which is the Y variable in the correlation calculations.

The population correlation coefficient is defined as

But what we want, and what `cor` calculates is the sample correlation which uses `n-1` as the denominator at the start of the formula. As such. Which is the following in R code

``````n <- length(x)
(1 / (n - 1)) * sum(((x - mean(x)) / sd(x)) * ((y - mean(y)) / sd(y)))
``````

If you inspect the above calculation, you'll notice that `((x - mean(x)) / sd(x))` and `((y - mean(y)) / sd(y))` will end up just being two vectors which we'll multiply together, and as `a * b` is the same as `b * a`, by definition it makes no difference which one is the "x" and which one is the "y" when you calculate the correlation.

As for the plotting in a scatter, here's a little walk through that might explain it, starting with the above calculation in a data.frame context.

``````library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
set.seed(0)

d_f <- tibble(x = rnorm(10),
y = rnorm(10) * x)

d_f <- d_f %>%
mutate(x_mean = mean(x),
y_mean = mean(y),
x_sd = sd(x),
y_sd = sd(y)) %>%
mutate(row_to_sum = ((x - x_mean) / x_sd) * ((y - y_mean) / y_sd))

our_correlation <- (1 / (n-1)) * sum(d_f\$row_to_sum)
our_correlation
#>  -0.6649819

cor_correlation <- cor(x = d_f\$x,
y = d_f\$y)
cor_correlation
#>  -0.6649819

our_correlation == cor_correlation
#>  TRUE

# NOW FOR SOME PLOTTING
# Variable x on X-axis
var_x <- d_f %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
labs(title = "Variable *x* on X-axis") +
coord_equal()

var_x
`````` ``````
# Variable y on Y-axis
var_y <- d_f %>%
ggplot(aes(x = y, y = x)) +
geom_point() +
labs(title = "Variable *y* on X-axis") +
coord_equal()

var_y
`````` ``````
# Some fillers to show the *mirroring*
upper_right <- d_f %>%
ggplot(aes(x = x, y = y)) +
coord_equal()

lower_left <- d_f %>%
ggplot(aes(x = y, y = x)) +
coord_equal()

(var_x + upper_right) / (lower_left + var_y) #patchwork composition of plots
`````` Created on 2020-08-10 by the reprex package (v0.3.0)

## The relationship is exactly the same, just "flipped" or "mirrored" if you like. You can imagine the red line below corresponding to a mirror, or a crease you fold over.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.