Calculate Pearson and then Spearman correlation matrices using all numerical variables

How do I specify all numerical values only when calculating Pearson and then Spearman correlation matrices? And why can't the program find my dataframe that is sitting in the environment? When I run str(Data) it creates an output, yet the rest of the code doesn't work.

image

# Calculate the Pearson and then the Spearman correlation matrices using all numerical variables
str(Data) 
#> Error in str(Data): object 'Data' not found
Data.cor = cor(Data)
#> Error in is.data.frame(x): object 'Data' not found
cor(Data, x,y, method = c("pearson"), use = "complete.obs") 
#> Error in cor(Data, x, y, method = c("pearson"), use = "complete.obs"): unused argument (y)
cor(Data, x,y, method = c("spearman"), use = "complete.obs") 
#> Error in cor(Data, x, y, method = c("spearman"), use = "complete.obs"): unused argument (y)

If I understand correctly, then you want to find correlation between all numeric columns of your dataset. Is that correct?

If so, then you can do something like follows.

dataset <- ggplot2::diamonds
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

dataset_with_numeric_columns_only <- Filter(f = is.numeric,
                                            x = dataset)

cor(x = dataset_with_numeric_columns_only,
    method = "pearson",
    use = "complete.obs")
#>            carat       depth      table      price           x           y
#> carat 1.00000000  0.02822431  0.1816175  0.9215913  0.97509423  0.95172220
#> depth 0.02822431  1.00000000 -0.2957785 -0.0106474 -0.02528925 -0.02934067
#> table 0.18161755 -0.29577852  1.0000000  0.1271339  0.19534428  0.18376015
#> price 0.92159130 -0.01064740  0.1271339  1.0000000  0.88443516  0.86542090
#> x     0.97509423 -0.02528925  0.1953443  0.8844352  1.00000000  0.97470148
#> y     0.95172220 -0.02934067  0.1837601  0.8654209  0.97470148  1.00000000
#> z     0.95338738  0.09492388  0.1509287  0.8612494  0.97077180  0.95200572
#>                z
#> carat 0.95338738
#> depth 0.09492388
#> table 0.15092869
#> price 0.86124944
#> x     0.97077180
#> y     0.95200572
#> z     1.00000000

cor(x = dataset_with_numeric_columns_only,
    method = "spearman",
    use = "complete.obs")
#>            carat       depth      table      price           x           y
#> carat 1.00000000  0.03010375  0.1949803 0.96288280  0.99611660  0.99557175
#> depth 0.03010375  1.00000000 -0.2450611 0.01001967 -0.02344221 -0.02542522
#> table 0.19498032 -0.24506114  1.0000000 0.17178448  0.20223061  0.19573406
#> price 0.96288280  0.01001967  0.1717845 1.00000000  0.96319611  0.96271882
#> x     0.99611660 -0.02344221  0.2022306 0.96319611  1.00000000  0.99789493
#> y     0.99557175 -0.02542522  0.1957341 0.96271882  0.99789493  1.00000000
#> z     0.99318344  0.10349836  0.1598782 0.95723227  0.98735532  0.98706751
#>               z
#> carat 0.9931834
#> depth 0.1034984
#> table 0.1598782
#> price 0.9572323
#> x     0.9873553
#> y     0.9870675
#> z     1.0000000

Created on 2019-05-25 by the reprex package (v0.3.0)

I didn't have your dataset, and hence used diamonds from ggplot2, as it has columns of many types. I used Filter to select the numeric columns, which is much faster than dplyr::select_if.

Hope this helps.

1 Like

Brilliant! Thanks very much.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.