How to calculate correlations of two variables

How can I calculate the correlation between two variables with R?

cor(data$variable~data$variable2)
or is this wrong?

You don't use ~ here since cor() doesn't accept formulas. Just supply vectors to the x and y parameters of the function instead.

cor(iris$Sepal.Length, iris$Sepal.Width)
#> [1] -0.1175698

Created on 2020-08-10 by the reprex package (v0.3.0)

what about cor(data$variable1, data$variable2)?
Why would you use vectors? I don't understand this
I have a csv file with different variables and I would like to calculate the correlations between a few of them.

By using data$variable1 and data$variable2 you are just accessing numeric vectors that are stored in a column of a dataframe. Referring to them as data$variable1 and data$variable2 works fine :slight_smile:

For instance:

class(iris$Sepal.Length)
#> [1] "numeric"

Created on 2020-08-10 by the reprex package (v0.3.0)

So I can calculate correlations with my code?

Yes, the change from using the ~ tilde to using a , comma as you did here:

makes it work :slight_smile:

So let's say I have a negative correlation.
Which variable influences the other negatively?
which variable is on the x-axis and which is on the y-axis if you imagine a scatterplot.

It's not that one influences the other negatively, but that their relationship is negative. It doesn't matter which is the X variable and which is the Y variable in the correlation calculations.

The population correlation coefficient is defined as

But what we want, and what cor calculates is the sample correlation which uses n-1 as the denominator at the start of the formula. As such.

image

Which is the following in R code

n <- length(x)
(1 / (n - 1)) * sum(((x - mean(x)) / sd(x)) * ((y - mean(y)) / sd(y)))

If you inspect the above calculation, you'll notice that ((x - mean(x)) / sd(x)) and ((y - mean(y)) / sd(y)) will end up just being two vectors which we'll multiply together, and as a * b is the same as b * a, by definition it makes no difference which one is the "x" and which one is the "y" when you calculate the correlation.

As for the plotting in a scatter, here's a little walk through that might explain it, starting with the above calculation in a data.frame context.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
set.seed(0) 

d_f <- tibble(x = rnorm(10),
              y = rnorm(10) * x)

d_f <- d_f %>% 
  mutate(x_mean = mean(x),
         y_mean = mean(y),
         x_sd = sd(x),
         y_sd = sd(y)) %>% 
  mutate(row_to_sum = ((x - x_mean) / x_sd) * ((y - y_mean) / y_sd))

our_correlation <- (1 / (n-1)) * sum(d_f$row_to_sum)
our_correlation 
#> [1] -0.6649819

cor_correlation <- cor(x = d_f$x,
                       y = d_f$y)
cor_correlation
#> [1] -0.6649819

our_correlation == cor_correlation
#> [1] TRUE

# NOW FOR SOME PLOTTING
# Variable x on X-axis
var_x <- d_f %>% 
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  labs(title = "Variable *x* on X-axis") +
  coord_equal()

var_x


# Variable y on Y-axis
var_y <- d_f %>% 
  ggplot(aes(x = y, y = x)) +
  geom_point() +
  labs(title = "Variable *y* on X-axis") +
  coord_equal()

var_y


# Some fillers to show the *mirroring*
upper_right <- d_f %>% 
  ggplot(aes(x = x, y = y)) + 
  coord_equal()

lower_left <- d_f %>% 
  ggplot(aes(x = y, y = x)) +
  coord_equal()

(var_x + upper_right) / (lower_left + var_y) #patchwork composition of plots

Created on 2020-08-10 by the reprex package (v0.3.0)

The relationship is exactly the same, just "flipped" or "mirrored" if you like. You can imagine the red line below corresponding to a mirror, or a crease you fold over.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.