For illustration, we selected the Cleveland Clinic Heart Disease Data set from the University of California in Irvine (UCI) machine learning data repository (Dua and Graff 2017). Below, we are using eleven variables, five of which are continuous, four are dichotomous, and two categorical variables.
library(modgo)
data("Cleveland", package = "modgo")
# Specifying dichotomous and ordinal categorical variables
<- c("Sex","HighFastBloodSugar","CAD","ExInducedAngina")
binary_variables <- c("Chestpaintype","RestingECG")
categorical_variables <- 500
nrep <- c("Age", "STDepression", binary_variables[c(1,3)], categorical_variables) plot_variables
In this section, we run modgo with its default settings. For modgo to produce results that mimic the original data set efficiently, user needs to specify dichotomous and ordinal categorical variables. Variables will be considered as continuous, otherwise. All modgo runs in this and the following sections will produce 500 data sets with the specification nrep = 500; the default is 100.
Figure 1 shows the correlation plots for the default modgo run, and Figure 2 displays the distribution plots for the original data set and one simulated data set. The default displayed simulated data set is the first one. Moreover, for all the plots a set of variables are used.
<- modgo(data = Cleveland,
test bin_variables = binary_variables,
categ_variables = categorical_variables,
nrep = nrep)
modgo provides an option so that only subjects (instances) are simulated that fulfill a specific requirement. In the simplest case (Section 2.1), the user can specify an upper or a lower boundary, or an interval for a variable. The use may alternatively specify a combination of variables and thresholds.
Three steps are required when subjects need to fulfill a specific selection criterion for a continuous variable. First, the name of the variable needs to be specified, for which the threshold needs to be set. Second, the left and right boundaries need to be specified. Third, a data frame with three columns is defined with Column 1: variable name of threshold variable, Column 2: left boundary, i.e., lower bound, Column 3: right boundary, i.e., upper bound. Finally, the data frame is imported using the thresh_var argument. In the example, all subjects have to be at least 66 years old. The selection variable therefore is Age with left threshold 65 and right threshold infinity NA.
If the percentage of samples fulfilling the indicated threshold requirements are less than 10% of the simulated samples, modgo stops to avoid excessive computation time. However, users can force thresh_force = TRUE the requested simulation to be run.
Figure 3 shows the correlation plot for this illustration. Substantial differences between the original and the simulated correlation plots can be observed for the RestingECG and several other variables. Figure 4 displays the corresponding distribution plot. The age distribution is shifted as expected. Furthermore, the distribution of subjects with coronary artery disease (CAD = 1) is higher in the simulated than the original data set.
<- c("Age")
Variables <- c(65)
thresh_left <- c(NA)
thresh_right <- data.frame(Variables, thresh_left, thresh_right)
thresholds
print(as.matrix(thresholds))
## Variables thresh_left thresh_right
## [1,] "Age" "65" NA
<- modgo(data = Cleveland,
test_thresh bin_variables = binary_variables,
categ_variables = categorical_variables,
thresh_var = thresholds,
nrep = nrep,
thresh_force = TRUE)
For continuous variables, modgo provides the option to add a normally distributed noise with mean 0 and variance \(\sigma_{p}^2\). With this perturbation, the variance of the perturbed variable is identical to the variance of the original variable. This option permits the generation of values from continuous variables, which were not observed in the original data set.
To specify which variables are to be perturbed and to which degree, i.e., percentage, the user needs to provide modgo with a named vector of the percentages and with the corresponding variables names as the names of the vector.
Similar to the previous examples, Figure 5 shows the correlation plots for the expansion to perturbations, and Figure 6 displays the distribution plots. Figure 6 shows that the distribution of both resting blood pressure and cholesterol change substantially due to the perturbation.
#Create named vector
<- c(0.9,0.7)
perturb_vector names(perturb_vector) <- c("RestingBP","Cholsterol")
<- modgo(data = Cleveland,
test_pertru bin_variables = binary_variables,
categ_variables = categorical_variables,
pertr_vec = perturb_vector,
nrep = nrep)