How can I calculate the correlation between two variables with R?

cor(data$variable~data$variable2)

or is this wrong?

How can I calculate the correlation between two variables with R?

cor(data$variable~data$variable2)

or is this wrong?

You don't use `~`

here since `cor()`

doesn't accept formulas. Just supply vectors to the `x`

and `y`

parameters of the function instead.

```
cor(iris$Sepal.Length, iris$Sepal.Width)
#> [1] -0.1175698
```

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

what about cor(data$variable1, data$variable2)?

Why would you use vectors? I don't understand this

I have a csv file with different variables and I would like to calculate the correlations between a few of them.

By using data$variable1 and data$variable2 you are just accessing numeric vectors that are stored in a column of a dataframe. Referring to them as data$variable1 and data$variable2 works fine

For instance:

```
class(iris$Sepal.Length)
#> [1] "numeric"
```

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

So I can calculate correlations with my code?

Yes, the change from using the `~`

tilde to using a `,`

comma as you did here:

makes it work

So let's say I have a negative correlation.

Which variable influences the other negatively?

which variable is on the x-axis and which is on the y-axis if you imagine a scatterplot.

It's not that one influences the other negatively, but that their relationship is negative. It doesn't matter which is the X variable and which is the Y variable in the correlation calculations.

The population correlation coefficient is defined as

But what we want, and what `cor`

calculates is the sample correlation which uses `n-1`

as the denominator at the start of the formula. As such.

Which is the following in R code

```
n <- length(x)
(1 / (n - 1)) * sum(((x - mean(x)) / sd(x)) * ((y - mean(y)) / sd(y)))
```

If you inspect the above calculation, you'll notice that `((x - mean(x)) / sd(x))`

and `((y - mean(y)) / sd(y))`

will end up just being two vectors which we'll multiply together, and as `a * b`

is the same as `b * a`

, by definition it makes no difference which one is the "x" and which one is the "y" when you calculate the correlation.

As for the plotting in a scatter, here's a little walk through that might explain it, starting with the above calculation in a data.frame context.

```
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
set.seed(0)
d_f <- tibble(x = rnorm(10),
y = rnorm(10) * x)
d_f <- d_f %>%
mutate(x_mean = mean(x),
y_mean = mean(y),
x_sd = sd(x),
y_sd = sd(y)) %>%
mutate(row_to_sum = ((x - x_mean) / x_sd) * ((y - y_mean) / y_sd))
our_correlation <- (1 / (n-1)) * sum(d_f$row_to_sum)
our_correlation
#> [1] -0.6649819
cor_correlation <- cor(x = d_f$x,
y = d_f$y)
cor_correlation
#> [1] -0.6649819
our_correlation == cor_correlation
#> [1] TRUE
# NOW FOR SOME PLOTTING
# Variable x on X-axis
var_x <- d_f %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
labs(title = "Variable *x* on X-axis") +
coord_equal()
var_x
```

```
# Variable y on Y-axis
var_y <- d_f %>%
ggplot(aes(x = y, y = x)) +
geom_point() +
labs(title = "Variable *y* on X-axis") +
coord_equal()
var_y
```

```
# Some fillers to show the *mirroring*
upper_right <- d_f %>%
ggplot(aes(x = x, y = y)) +
coord_equal()
lower_left <- d_f %>%
ggplot(aes(x = y, y = x)) +
coord_equal()
(var_x + upper_right) / (lower_left + var_y) #patchwork composition of plots
```

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.