This vignette describes the “analysis” features of COINr. By this, we mean functions that retrieve statistical measures from the coin in various ways. This excludes things like sensitivity analysis, which involves tinkering with the construction methodology.
In short, here we discuss obtaining indicator statistics, correlations, data availability, and some slightly more complex ideas such as Cronbach’s alpha and principal component analysis.
Indicator statistics can be obtained using the
The resulting data frame has 18 columns, which is hard to display concisely here. Therefore we will look at the columns in groups of five.
Each row is one of the indicators from the targeted data set. Then columns are statistics, here obvious things like the minimum, maximum, mean and median.
In the first three columns here we find the standard deviation,
skewness and kurtosis. The remaining two columns are “N.Avail”, which is
the number of non-
NA data points, and “N.NonZero”, the
number of non-zero points. This latter can be of interest because some
indicators may have a high proportion of zeroes, which can be
Here we have “N.Unique”, which is the number of unique data points
(i.e. excluding duplicate values). The following three columns are
similar to previous columns, e.g. “Frc.Avail” is the fraction of data
availability, as opposed to the number of available points (N.Avail).
The final column, “Flag.Avail”, is a logical flag: if the data
availability (“Frc.Avail”) is below the limit specified by the
t_avail argument of
get_stats(), it will be
flagged as “LOW”.
The first two final columns are analogous to “Flag.Avail” and have
thresholds which are controlled by arguments to
get_stats(). The final column is a basic test of outliers
which is commonly used in composite indicators, for example in the COIN
Tool. This is the same process as used in
check_SkewKurt() which will flag “OUT” if the absolute
skewness is greater than a set threshold (default 2) AND kurtosis is
greater than a threshold (default 3.5). In short, indicators that are
flagged here could be considered for outlier treatment.
The same kind of analysis can be performed for units, rather than
indicators. Here, the main thing of interest is data availability, which
can be obtained throug the
Here we see the output is a list with two data frames. The first is a summary for each unit:
Each unit has its number of missing points, zero points, missing-or-zero points, as well as the percentage data availability and percentage non-zero. The “ByGroup” data frame gives data availability within aggregation groups:
Here we just view the first few columns to save space. The values are the fraction of indicator availability within each aggregation group.
Correlations can be obtained and viewed directly using the
plot_corr() function which is explained in the Visualisation vignette. Here, we explore
the functions for obtaining correlation matrices, flags and
The most general-purpose function for obtaining correlations between
get_corr() function (which is called by
plot_corr()). This allows almost any set of
indicators/aggregates to be correlated against almost any other set. We
won’t go over the full functionality here because this is covered in Visualisation vignette. However to
demonstrate a couple of examples we begin by building the full example
coin up to the aggregated data set.
Now we can take some examples. First, to get the correlations between indicators within the Environmental group:
Here we see that the default output of
get_corr() is a
long-format correlation table. If you want the wide format, set
make_long = FALSE.
# get correlations cmat <- get_corr(coin, dset = "Raw", iCodes = list("Environ"), Levels = 1, make_long = FALSE) # examine first few rows round_df(head(cmat), 2) #> CO2 Forest MatCon PrimEner Renew #> 1 1.00 NA 0.72 0.33 -0.60 #> 2 NA 1.00 NA NA 0.35 #> 3 0.72 NA 1.00 0.38 -0.42 #> 4 0.33 NA 0.38 1.00 NA #> 5 -0.60 0.35 -0.42 NA 1.00
This gives the more classical-looking correlation matrix, although
the long format can sometimes be easier to work with for futher
processing. One further option that is worth mentioning is
pval: by default,
get_corr() will return any
correlations with a p-value lower than 0.05 as
indicating that these correlations are insignificant at this
significance level. You can adjust this threshold by changing
pval, or disable it completely by setting
pval = 0.
On the subject of p-values, COINr includes a
function which can be used to get p-values of correlations between a
supplied matrix or data frame. This cannot be used directly on a coin
and is more of a helper function but may still be useful.
Two further functions are of interest regarding correlations. The
get_corr_flags(). This is a useful function for
finding correlations between indicators that exceed or fall below a
given threshold, within aggregation groups:
Here we have identified any correlations above 0.75, from the “Normalised” data set, that are between indicators in the same group in level 2. Actually 0.75 is quite low for searching for “high correlations”, but it is used as an example here because the example data set doesn’t have any very high correlations.
thresh_type = "low" we can similarly look
for low/negative correlations:
Our example has some fairly significant negative correlations! All within the “Institutional” group, and with the Technical Barriers to Trade indicator.
A final function to mention is
get_denom_corr(). This is
related to the operation of denominating indicators (see Denomination vignette), and identifies any
indicators that are correlated (using the absolute value of correlation)
above a given threshold with any of the supplied denominators. This can
help to identify in some cases whether to denominate an
indicator and with what - i.e. if an indicator is strongly
related with a denominator that means it is dependent on it, which may
be a reason to denominate.
get_denom_corr(coin, dset = "Raw", cor_thresh = 0.7) #> Ind Denom Corr #> 70 CultGood Energy 0.71 #> 119 CultGood GDP 0.85 #> 52 FDI Energy 0.74 #> 101 FDI GDP 0.85 #> 99 Goods GDP 0.80 #> 102 PRemit GDP 0.71 #> 116 Research GDP 0.78 #> 100 Services GDP 0.77 #> 66 StMob Energy 0.73 #> 115 StMob GDP 0.84
Using a threshold of 0.7, and examining the raw data set, we see that several indicators are strongly related to the denominators, a classical example being export value of goods (Goods) being well correlated with GDP. Many of these pairs flagged here are indeed used as denominators in the ASEM example, but also for conceptual reasons.
A first simple tool is to calculate Cronbach’s alpha. This can be done with any group of indicators, either the full set, or else targeting specific groups.
This simply calculates Cronbach’s alpha (a measure of statistical consistency) for the “P2P” group (People to People connectivity, this case).
Another multivariate analysis tool is principal component analysis
(PCA). Although, like correlation, this is built into base R, the
get_PCA() function makes it easier to obtain PCA for groups
of indicators, following the structure of the index.
l_pca <- get_PCA(coin, dset = "Raw", iCodes = "Sust", out2 = "list") #> Warning in PCAwts(parents[ii]): 1 missing values found. Removing 1 rows with missing values in order to perform #> PCA. You can also try imputing data first to avoid this. #> Warning in PCAwts(parents[ii]): 22 missing values found. Removing 14 rows with missing values in order to perform #> PCA. You can also try imputing data first to avoid this. #> Warning in PCAwts(parents[ii]): 8 missing values found. Removing 7 rows with missing values in order to perform #> PCA. You can also try imputing data first to avoid this.
The function can return its results either as a list, or appended to
the coin if
out2 = "coin". Here the output is a list and we
will explore its output. First note the warnings above due to missing
data, which can be suppressed using
nowarnings = TRUE. The
output list looks like this:
I.e. we have a data frame of “PCA weights” and some PCA results. We ignore the weights for the moment and look closer at the PCA results:
str(l_pca$PCAresults, max.level = 2) #> List of 3 #> $ Environ :List of 3 #> ..$ wts : num [1:5] -0.581 0.214 -0.543 -0.313 0.473 #> ..$ PCAres:List of 5 #> .. ..- attr(*, "class")= chr "prcomp" #> ..$ iCodes: chr [1:5] "CO2" "Forest" "MatCon" "PrimEner" ... #> $ Social :List of 3 #> ..$ wts : num [1:9] -0.417 -0.324 0.401 -0.191 0.302 ... #> ..$ PCAres:List of 5 #> .. ..- attr(*, "class")= chr "prcomp" #> ..$ iCodes: chr [1:9] "CPI" "FemLab" "FreePress" "NGOs" ... #> $ SusEcFin:List of 3 #> ..$ wts : num [1:5] 0.472 0.451 -0.492 -0.227 -0.53 #> ..$ PCAres:List of 5 #> .. ..- attr(*, "class")= chr "prcomp" #> ..$ iCodes: chr [1:5] "GDPGrow" "NEET" "PrivDebt" "PubDebt" ...
get_PCA() will run a separate PCA for each
aggregation group within the specified level. In this case, it has run
three: one for each of the “Environ”, “Social” and “SusEcFin” groups.
Each of these contains
wts, a set of PCA weights for that
PCAres, which is the direct output of
iCodes, which is the
corresponding vector of indicator codes for the group.
We can do some basic PCA analysis using base R’s tools using the “PCAres” objects, e.g.:
# summarise PCA results for "Social" group summary(l_pca$PCAresults$Social$PCAres) #> Importance of components: #> PC1 PC2 PC3 PC4 PC5 PC6 PC7 #> Standard deviation 2.2042 1.1256 0.9788 0.78834 0.77153 0.56836 0.42463 #> Proportion of Variance 0.5398 0.1408 0.1065 0.06905 0.06614 0.03589 0.02003 #> Cumulative Proportion 0.5398 0.6806 0.7871 0.85611 0.92225 0.95814 0.97817 #> PC8 PC9 #> Standard deviation 0.36068 0.25760 #> Proportion of Variance 0.01445 0.00737 #> Cumulative Proportion 0.99263 1.00000
stats::prcomp() and elsewhere for more resources on
PCA in R.
Now turning to the weighting,
get_PCA() also outputs a
set of “PCA weights”. These are output attached to the list as shown
above, or if
out2 = "coin", will be written to the weights
list inside the coin if
weights_to is also specified. See
the Weights vignette for some more details on
this. Note that PCA weights come with a number of caveats: please read
the documentation for
get_PCA() for a discussion on