Correlation matrix : A quick start guide to analyze, format and visualize a correlation matrix using R software

This article has been updated, you are now consulting an old release of this article!

What is correlation matrix

Correlation matrix or covariance matrix is used to investigate the dependence between multiple variables at the same time. The result is a table containing the coefficients of correlation between each variable and the others. There are different methods for correlation analysis : Pearson correlation test, Spearman and Kendall rank-based correlation analysis. These methods are discussed in the next sections. Correlation matrix can be visualized using correlogram. The aim of this article is to show you how to compute and visualize a correlation matrix in R.

Note that online software is also available here to compute correlation matrix and to plot a correlogram without any installation.

Correlation analysis in R

The R function cor() can be used to compute a correlation matrix. A simplified format of the function is :

# x is a matrix or data.frame
cor(x, method = c("pearson", "kendall", "spearman"))

The argument method= indicates the correlation coefficient to be computed. The default is pearson correlation coefficient which measures the linear dependence between two variables. kendall and spearman correlation methods are non-parametric rank-based correlation test.

data for correlation analysis

The mtcars data is used in the following examples to compute the correlation matrix.

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Correlation matrix

mcor <- cor(mtcars)
mcor
         mpg     cyl    disp      hp     drat      wt    qsec      vs       am    gear     carb
mpg   1.0000 -0.8522 -0.8476 -0.7762  0.68117 -0.8677  0.4187  0.6640  0.59983  0.4803 -0.55093
cyl  -0.8522  1.0000  0.9020  0.8324 -0.69994  0.7825 -0.5912 -0.8108 -0.52261 -0.4927  0.52699
disp -0.8476  0.9020  1.0000  0.7909 -0.71021  0.8880 -0.4337 -0.7104 -0.59123 -0.5556  0.39498
hp   -0.7762  0.8324  0.7909  1.0000 -0.44876  0.6587 -0.7082 -0.7231 -0.24320 -0.1257  0.74981
drat  0.6812 -0.6999 -0.7102 -0.4488  1.00000 -0.7124  0.0912  0.4403  0.71271  0.6996 -0.09079
wt   -0.8677  0.7825  0.8880  0.6587 -0.71244  1.0000 -0.1747 -0.5549 -0.69250 -0.5833  0.42761
qsec  0.4187 -0.5912 -0.4337 -0.7082  0.09120 -0.1747  1.0000  0.7445 -0.22986 -0.2127 -0.65625
vs    0.6640 -0.8108 -0.7104 -0.7231  0.44028 -0.5549  0.7445  1.0000  0.16835  0.2060 -0.56961
am    0.5998 -0.5226 -0.5912 -0.2432  0.71271 -0.6925 -0.2299  0.1683  1.00000  0.7941  0.05753
gear  0.4803 -0.4927 -0.5556 -0.1257  0.69961 -0.5833 -0.2127  0.2060  0.79406  1.0000  0.27407
carb -0.5509  0.5270  0.3950  0.7498 -0.09079  0.4276 -0.6562 -0.5696  0.05753  0.2741  1.00000

In the table above correlations coefficients between the possible pairs of variables are shown.

If your data contain missing values, use the following R code to handle missing values by case-wise deletion.

cor(mtcars, use = "complete.obs")

Correlation significance levels (p-value)

The output of cor() function is the correlation coefficients between each variable and the others. However the function doesn’t display the correlation signicance levels (p-value). In the next section, we will use Hmisc R package to calculate correlation p-value.

The function rcorr() from Hmisc package can be used to compute the significance levels for pearson and spearman correlations. Using this function, Pearson’s r or Spearman’s rho rank correlation coefficients are computed for all possible pairs of columns in the data table.

The simplified format is :

rcorr(x, type=c("pearson","spearman"))

x should be a matrix. The correlation type can be pearson or spearman.

library(Hmisc)
rcorr(as.matrix(mtcars)) 
       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
n= 32 
P
     mpg    cyl    disp   hp     drat   wt     qsec   vs     am     gear   carb  
mpg         0.0000 0.0000 0.0000 0.0000 0.0000 0.0171 0.0000 0.0003 0.0054 0.0011
cyl  0.0000        0.0000 0.0000 0.0000 0.0000 0.0004 0.0000 0.0022 0.0042 0.0019
disp 0.0000 0.0000        0.0000 0.0000 0.0000 0.0131 0.0000 0.0004 0.0010 0.0253
hp   0.0000 0.0000 0.0000        0.0100 0.0000 0.0000 0.0000 0.1798 0.4930 0.0000
drat 0.0000 0.0000 0.0000 0.0100        0.0000 0.6196 0.0117 0.0000 0.0000 0.6212
wt   0.0000 0.0000 0.0000 0.0000 0.0000        0.3389 0.0010 0.0000 0.0005 0.0146
qsec 0.0171 0.0004 0.0131 0.0000 0.6196 0.3389        0.0000 0.2057 0.2425 0.0000
vs   0.0000 0.0000 0.0000 0.0000 0.0117 0.0010 0.0000        0.3570 0.2579 0.0007
am   0.0003 0.0022 0.0004 0.1798 0.0000 0.0000 0.2057 0.3570        0.0000 0.7545
gear 0.0054 0.0042 0.0010 0.4930 0.0000 0.0005 0.2425 0.2579 0.0000        0.1290
carb 0.0011 0.0019 0.0253 0.0000 0.6212 0.0146 0.0000 0.0007 0.7545 0.1290       

As an output, the rcorr() function returns a list with elements : - r : the matrix of correlations - n : the matrix of number of observations used in analyzing each pair of variables - P : the p-values corresponding to the significance levels of correlations.

Correlogram : Visualization of correlation matrix

Several methods are available to plot a correlogram in R. You can use either R symnum function, corrplot function or scatter graph to make a graph of correlation matrix.

Use symnum function

The R function symnum replaces correlation coefficients by symbols according to the value. It takes the correlation matrix as an argument :

symnum(mcor)
     m cy ds h dr w q v a g cr
mpg  1                        
cyl  + 1                      
disp + *  1                   
hp   , +  ,  1                
drat , ,  ,  . 1              
wt   + ,  +  , ,  1           
qsec . .  .  ,      1         
vs   , +  ,  , .  . , 1       
am   . .  .    ,  ,     1     
gear . .  .    ,  .     , 1   
carb . .  .  ,    . , .     1 
attr(,"legend")
[1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (" “); correlation coefficients between 0.3 and 0.6 are replace by”.“; etc …

Correlogram using R corrplot function

You have to install corrplot package which performs a graphical display of a correlation matrix in R.

To read more about corplot() function click here : visualize a correlation matrix using corrplot.

The function corrplot takes the correlation matrix as the first argument. The second argument (type=“upper”) is used to display only the upper triangular of the correlation matrix.

library(corrplot)
corrplot(mcor, type="upper", order="hclust", tl.col="black", tl.srt=45)

plot of chunk correlogram

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.

The correlation matrix is reordered according to the correlation coefficient using “hclust” method. tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.

Infos

This analysis was performed using R (ver. 3.1.0).

Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!