We use the iris dataset, and ampute the data:
# Load data
# Ampute the data. iris contains no missing values by default.
ampIris <- amputeData(iris,perc=0.25)
miceObj <- miceRanger(
miceRanger comes with an array of diagnostic plots that tell you how valid the imputations may be, how they are distributed, which variables were used to impute other variables, and so on.
We can take a look at the imputed distributions compared to the original distribution for each variable:
The red line is the density of the original, nonmissing data. The smaller, black lines are the density of the imputed values in each of the datasets. If these don’t match up, it’s not a problem, however it may tell you that your data was not Missing Completely at Random (MCAR).
We are probably interested in knowing how our values between datasets converged over the iterations. The
plotCorrelations function shows you a boxplot of the correlations between imputed values in every combination of datasets, at each iteration:
Different correlation measures can be plotted by specifying
?plotCorrelations for more details.
Sometimes, if the missing data locations are correlated with higher or lower values, we need to run multiple iterations for the process to converge to the true theoretical mean (given the information that exists in the dataset). We can see if the imputed data converged, or if we need to run more iterations:
It doesn’t look like this dataset had a convergence issue. We wouldn’t expect one, since we amputed the data above completely at random for each variable. When plotting categorical variables, the center and dispersion metrics plotted are the percent of the mode and the entropy, respectively.
Random Forests give us a cheap way to determine model error without cross validation. Each model returns the OOB accuracy for classification, and r-squared for regression. We can see how these converged as the iterations progress:
It looks like the variables were imputed with a reasonable degree of accuracy. That spike after the first iteration was due to the nature of how the missing values are filled in before the models are run.
Now let’s plot the variable importance for each imputed variable. The top axis contains the variable that was used to impute the variable on the left axis.
The variable importance metric used is returned by ranger when
importance = 'impurity'. Due to large possible variances in the returned value, the data plotted here has been 0-1 scaled within each imputed variable. Use
display = 'Absolute' to show unscaled variable importance.
We are probably interested in how “certain” we were of our imputations. We can get a feel for the variance experienced for each imputed value between the datasets by using
When plotting categorical data, the distribution of the number of unique imputed levels is compared to the theoretical distribution of unique levels, given they were drawn randomly. You can see that most of the imputed values only had 1 imputed value across our 8 datasets, which means that the imputation process was fairly ‘certain’ of that imputed class. According to the graph, most of our samples would have had 3 different samples drawn, if they were drawn randomly for each dataset sample.
When plotting the variance of numeric features, the standard deviation of the imputed values is calculated for each sample. This is then compared to the total population standard deviation. Percentage of the samples with a SD below the population SD is shaded in the densities above, and the Quantile is shown in the title. The
iris dataset tends to be full of correlation, so all of our imputations had a SD lower than the population SD, however this will not always be the case.