This vignette is organized as follows:

Getting started

Exploring the data

Model specification

Parameter estimation

Robust prediction of the area-level means

Mean square prediction error

In small area estimation (SAE), we distinguish two types of models:

- model A: basic area-level model (also known as Fay-Herriot model),
- model B: basic unit-level model,

The classification of the models (A or B) is from Rao (2003, Chapter 7). The current version of the
package implements the following estimation methods under the
**unit-level model (model B)**:

- maximum-likelihood (ML) estimation,
- robust Huber type
*M*-estimation.

The package can be installed from CRAN using
`install.packages("rsae")`

. Once the `rsae`

package has been installed, we need to load it to the current session by
`library("rsae")`

.

**Work flow**

The prediction of the area-level means takes three steps.

- model specification using
`saemodel()`

, - model fitting using
`fitsaemodel()`

, - prediction of the random effects and the area-specific means using
`robpredict()`

; this step includes the computation of the mean square prediction error.

We use the `landsat`

data of Battese et
al. (1988), which is loaded by

The `landsat`

data is a compilation of survey and
satellite data from Battese et al. (1988). It
consists of data on segments (primary sampling unit; 1 segement approx.
250 hectares) under corn and soybeans for 12 counties in north-central
Iowa.

- The survey data on the areas under corn and soybeans (reported in hectares) in the 37 segments of the 12 counties have been determined by USDA Statistical Reporting Service staff, who interviewed farm operators. A segment is about 250 hectares.
- For the LANDSAT satellite data, information is recorded as “pixels”. A pixel is about 0.45 hectares.

In the three smallest counties (Cerro Gordo, Hamilton, and Worth), data is available only for one sample segment. All other counties have data for more than one sample segment (i.e., unbalanced data). The largest area (Hardin) covers six units.

The data for the observations 32, 33, and 34 are shown below

```
> landsat[32:34,]
SegmentsInCounty SegementID HACorn HASoybeans PixelsCorn PixelsSoybeans
32 556 1 88.59 102.59 220 262
33 556 2 88.59 29.46 340 87
34 556 3 165.35 69.28 355 160
MeanPixelsCorn MeanPixelsSoybeans outlier CountyName
32 325.99 177.05 FALSE Hardin
33 325.99 177.05 TRUE Hardin
34 325.99 177.05 FALSE Hardin
```

We added the variable `outlier`

to the original data. It
flags observation 33 as an outlier, which is in line with the discussion
in Battese et al. (1988).

We consider estimating the parameters of the basic unit-level model (Battese et al. , 1988)

\[ \begin{equation*} \mathrm{HACorn}_{i,j} = \alpha + \beta_1 \mathrm{PixelsCorn}_{i,j} + \beta_2 \mathrm{PixelsSoybeans}_{i,j} + u_i + e_{i,j}, \end{equation*} \]

where \(j=1,\ldots, n_i\), \(i=1, \ldots,12\), and

- \(\alpha\), \(\beta_1\), and \(\beta_2\) are unknown real-valued coefficients,
- the \(u_i\)’s are area-specific random effects, \(u_i \sim N(0, \sigma_u^2)\), and \(\sigma_u^2 \geq 0\) is unknown,
- the \(e_i\)’s are residual errors, \(e_{ij} \sim N(0, \sigma_e^2)\), and \(\sigma_e^2 > 0\) is unkown,
- the \(u_i\)’s and \(e_{ij}\)’s are independent.

The model is defined with the help of
the`saemodel()`

function:

```
> bhfmodel <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans,
+ area = ~ CountyName,
+ data = subset(landsat, subset = (outlier == FALSE)))
```

where

`formula`

defines the fixed-effect part of the model (the`~`

operator separates dependent and independent variables; by default, the model includes a regression intercept),`area`

specifies the area-level random effect (variable`CountyName`

serves as area identifier; note that the argument`area`

is also a`formula`

object),`data`

specifies the`data.frame`

(here, we consider the subset of observations that are not flagged as outliers).

If you need to know more about a particular model, you can use the
`summary()`

method.

Having specified `bhfmodel`

, we consider estimating its
parameters by different estimation method.

The maximum likelihood (ML) estimates of the parameters are computed by

On print, object `mlfit`

shows the following.

```
> mlfit
ESTIMATES OF SAE-MODEL (model type B)
Method: Maximum likelihood estimation
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Coefficients:
(Intercept) PixelsCorn PixelsSoybeans
50.9676 0.3286 -0.1337
---
Random effects
Model: ~1| CountyName
(Intercept) Residual
Std. Dev. 11.00 11.72
---
Number of Observations: 36
Number of Areas: 12
```

Inferential statistics for the object `mlfit`

are computed
by the `summary()`

method.

```
> summary(mlfit)
ESTIMATION SUMMARY
Method: Maximum likelihood estimation
---
Fixed effects
Value Std.Error t-value df p-value
(Intercept) 50.96756 23.47509 2.17113 22 0.0410 *
PixelsCorn 0.32858 0.04798 6.84772 22 7.05e-07 ***
PixelsSoybeans -0.13371 0.05306 -2.51984 22 0.0195 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Function `coef()`

extracts the estimated coefficients from
a fitted model.

- By default, function
`coef()`

is called with argument`type = "both"`

, which implies that the fixed effects and the random effect variances are returned. - Calling the function with
`type = "ranef"`

or`type = "fixef"`

returns the random effect variances or the fixed effects, respectively.

Function `convergence()`

can be useful if the algorithm
did not converge; see Appendix.

The Huber-type *M*-estimator is appropriate for situations
where the response variable is supposed to be (moderately) contaminated
by outliers. The *M*-estimator downweights residual outliers, but
it *does not limit* the effect of high-leverage observations.

The Huber-type *M*-estimator of `bhfmodel`

is

where `k`

is the robustness tuning constant of the Huber
\(\psi\)-function (\(0<k < \infty\)). On print, we
have

```
> huberfit
ESTIMATES OF SAE-MODEL (model type B)
Method: Huber-type M-estimation
Robustness tuning constant: k = 1.5
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Coefficients:
(Intercept) PixelsCorn PixelsSoybeans
50.2849 0.3285 -0.1392
---
Random effects
Model: ~1| CountyName
(Intercept) Residual
Std. Dev. 11.95 12.36
---
Number of Observations: 36
Number of Areas: 12
```

If the algorithm did not converge, see paragraph *safe mode*
(below) and Appendix. Inferential statistics for
`huberfit`

are computed by the `summary()`

method.

```
> summary(huberfit)
ESTIMATION SUMMARY
Method: Huber-type M-estimation
Robustness tuning constant: k = 1.5
---
Fixed effects
Value Std.Error t-value df p-value
(Intercept) 50.28489 24.84651 2.02382 22 0.0553 .
PixelsCorn 0.32845 0.05077 6.46906 22 1.65e-06 ***
PixelsSoybeans -0.13916 0.05618 -2.47711 22 0.0214 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
---
Degree of downweighting/winsorization:
sum(wgt)/n
fixeff 0.9841
residual var 0.9723
area raneff var 0.9841
```

The output is separated into 2 blocks. The first block shows inferential statistics of the estimated fixed effects. The second block reports the degree of downweighting that is applied to outlying residuals at the final iteration (by estimating equations, EE, separately). The more the value of “sum(wgt)/n” deviates from 1.0, the more downweighting has been applied.

The methods `coef()`

and `convergence()`

are
also available.

In the safe mode, the algorithm is initialized by a
high-breakdown-point regression estimator. **Note.** In
order to use the safe mode, the `robustbase`

package of Maechler et al. (2021) must be installed.

The safe mode is entered by specifying one the following initialization methods:

`init = "lts"`

: least trimmed squares (LTS) regression estimator; see Rousseeuw (1984) and Rousseeuw and van Driessen (2006),`init = "s"`

: regression*S*-estimator; see Rousseeuw and Yohai (1984) and Salibian-Barerra and Yohai (2006)

in the call of `fitsaemodel()`

. The safe mode uses a
couple of internal tests to check whether the estimates at consecutive
iterations behave well. Notably, it tries to detect cycles in the
sequence iteratively refined estimates; see Schoch
(2012).

The package implements the following methods to predict the area-level means:

- empirical best linear unbiased predictor (EBLUP); see Rao (2003, Chapter 7.2),
- robust prediction method of Copt and Victoria-Feser (2009); see also Heritier et al. (2009, p. 113–114).

EBLUP and robust predictions are computed by

where

`fit`

is a fitted model (ML estimate or*M*-estimate),`k`

is the robustness tuning constant of the Huber \(\psi\)-function for robust prediction. By default`k`

is`NULL`

which means that the procedure inherits the tuning constant`k`

that has been used in fitting the model; see`fitsaemodel()`

. For the ML estimator, \(k\) is taken as a large value that “represents” infinity; see argument`k_Inf`

in`fitsaemodel.control()`

.`reps`

and`progress_bar`

are used in the computation of the mean square prediction error (see below).

In the `landsat`

data, the county-specific population
means of pixels of the segments under corn and soybeans are recorded in
the variables MeanPixelsCorn and MeanPixelsSoybeans, respectively. Each
sample segment in a particular county is assigned the county-specific
means. Therefore, the population mean of the variables MeanPixelsCorn
and MeanPixelsSoybeans occurs \(n_i\)
times. The unique county-specific population means are obtained
using

```
> dat <- unique(landsat[-33, c("MeanPixelsCorn", "MeanPixelsSoybeans", "CountyName")])
> dat <- cbind(rep(1,12), dat)
> rownames(dat) <- dat$CountyName
> dat <- dat[,1:3]
> dat
rep(1, 12) MeanPixelsCorn MeanPixelsSoybeans
Cerro Gordo 1 295.29 189.70
Hamilton 1 300.40 196.65
Worth 1 289.60 205.28
Humboldt 1 290.74 220.22
Franklin 1 318.21 188.06
Pocahontas 1 257.17 247.13
Winnebago 1 291.77 185.37
Wright 1 301.26 221.36
Webster 1 262.17 247.09
Hancock 1 314.28 198.66
Kossuth 1 298.65 204.61
Hardin 1 325.99 177.05
```

The first column of `dat`

is a dummy variable (because
`bhfmodel`

has a regression intercept). The second and third
column represent, respectively, the county-specific population means of
segments under corn and soybeans.

Consider the ML estimate, `mlfit`

, of the
`bhfmodel`

model. The EBLUP of the area-level means is
computed by

```
> pred <- robpredict(mlfit, areameans = dat)
> pred
Robustly Estimated/Predicted Area-Level Means:
raneff fixeff area mean
Cerro Gordo -0.348 122.629 122.281
Hamilton 2.731 123.379 126.110
Worth -11.522 118.677 107.154
Humboldt -8.313 117.053 108.741
Franklin 13.641 130.380 144.021
Pocahontas 9.529 102.425 111.954
Winnebago -9.043 122.052 113.009
Wright 1.648 120.358 122.006
Webster 11.082 104.073 115.155
Hancock -3.229 127.671 124.442
Kossuth -14.621 121.740 107.119
Hardin 8.445 134.408 142.853
```

because (by default) `k = NULL`

. By explicitly specifying
`k`

in the call of `robpredict()`

, we can—in
principle—compute robust predictions for the ML estimate instead of the
EBLUP.

Consider the Huber *M*-estimate `huberfit`

(with
tuning constant `k = 1.5`

, see call of above). If
`k`

is not specified in the call of
`robpredict()`

, robust predictions are computed with \(k\) equal to 1.5; otherwise, the robust
predictions are based on the value of `k`

in the call of
`robpredict()`

.

Object `pred`

is a `list`

with slots
`"fixeff"`

, `"raneff"`

, `"means"`

,
`"mspe"`

, etc. For instance, the predicted means can be
extracted by `pred$means`

or
`pred[["means"]]`

.

A `plot()`

function is available to display the predicted
means.

Function `residuals()`

computes the residuals.

Function `robpredict()`

can be used to compute bootstrap
estimates of the mean squared prediction errors (MSPE) of the predicted
area-level means; see Sinha and Rao (2009). To
compute the MSPE, we must specify the number of bootstrap replicates
`(reps)`

. If `reps = NULL`

, the MSPE is not
computed.

Consider (for instance) the ML estimate. EBLUP and MSPE of the EBLUP based on 100 bootstrap replicates are computed by

```
> pred <- robpredict(mlfit, areameans = dat, reps = 100,
+ progress_bar = FALSE)
> pred
Robustly Estimated/Predicted Area-Level Means:
raneff fixeff area mean mspe
Cerro Gordo -0.348 122.629 122.281 82.418
Hamilton 2.731 123.379 126.110 99.031
Worth -11.522 118.677 107.154 60.627
Humboldt -8.313 117.053 108.741 64.231
Franklin 13.641 130.380 144.021 39.708
Pocahontas 9.529 102.425 111.954 39.070
Winnebago -9.043 122.052 113.009 43.537
Wright 1.648 120.358 122.006 42.533
Webster 11.082 104.073 115.155 32.976
Hancock -3.229 127.671 124.442 32.567
Kossuth -14.621 121.740 107.119 28.018
Hardin 8.445 134.408 142.853 20.681
(mpse: 100 boostrap replicates)
```

The number of `reps = 100`

is kept small for illustration
purposes; in practice, we should choose larger values. The progress bar
has been disabled as it is not suitable for the automatic vignette
generation.

A visual display of the predicted area-level means obtains by
`plot(pred, type = "l", sort = "means")`

.

COPT, S. and M.-P. VICTORIA-FESER
(2009). *Robust prediction in mixed linear models*, Tech. rep.,
University of Geneva.

BATTESE, G. E., R. M. HARTER and W. A.
FULLER (1988). An error component model for prediction of county crop
areas using, *Journal of the American Statistical Association*
**83**, 28–36. DOI: 10.2307/2288915

FELLNER, W. H. (1986). Robust estimation of variance components,
*Technometrics* **28**, 51–60. DOI:
10.1080/00401706.1986.10488097

HERITIER, S., E. CANTONI, S. COPT, and
M.-P. VICTORIA-FESER (2009). *Robust Methods in Biostatistics*,
New York: John Wiley & Sons.

MAECHLER, M., P. J. ROUSSEEUW, C. CROUX, V. TODOROC, A. RUCKSTUHL, M.
SALIBIAN-BARRERA, T. VERBEKE, M. KOLLER, M. KOLLER, E. L. T. CONCEICAO
and M. A. DI PALMA (2023).
*robustbase: Basic Robust Statistics*. R package version 0.99-1.
URL:
CRAN.R-project.org/package=robustbase.

RAO, J.N.K. (2003). *Small Area Estimation*, New York: John
Wiley and Sons.

RICHARDSON, A. M. and A. H. WELSH
(1995). Robust Restricted Maximum Likelihood in Mixed Linear Models,
*Biometrics* **51**, 1429–1439. DOI: 10.2307/2533273

ROUSSEEUW, P. J. (1984). Least Median of Squares Regression,
*Journal of the American Statistical Association*
**79**, 871–880. DOI: 10.2307/2288718

ROUSSEEUW, P. J. and K. VAN DRIESSEN
(2006). Computing LTS Regression for Large Data Sets, *Data Mining
and Knowledge Discovery* **12**, 29–45. DOI:
10.1007/s10618-005-0024-4

ROUSSEEUW, P. J. and V. YOHAI (1984).
Robust Regression by Means of S Estimators, in *Robust and Nonlinear
Time Series Analysis*, ed. by FRANKE, J., W. HÄRDLE AND R. D.
MARTIN, New York: Springer, 256–274.

SALIBIAN-BARRERA, M. and V. J. YOHAI
(2006). A Fast Algorithm for S-Regression Estimates, *Journal of
Computational and Graphical Statistics* **15**,
414–427. DOI:
10.1198/106186006x113629

SCHOCH, T. (2012). Robust Unit-Level Small Area Estimation: A Fast
Algorithm for Large Datasets. *Austrian Journal of Statistics*
**41**, 243–265. DOI:
10.17713/ajs.v41i4.1548

SINHA, S.K. and J.N.K. RAO (2009).
Robust small area estimation. *Canadian Journal of Statistics*
**37**, 381–399. DOI: 10.1002/cjs.10029

Suppose that we computed

The algorithm did not converge and we get the following output

```
> est
THE ALGORITHM DID NOT CONVERGE!
---
1) use convergence() of your fitted model to learn more
2) study the documentation using the command ?fitsaemodel
3) you may call fitsaemodel with 'init' equal to (either) 'lts'
or 's' (this works also for ML, though it may no be very efficient)
4) if it still does not converge, the last resort is to modify
'acc' and/or 'niter'
ESTIMATES OF SAE-MODEL (model type B)
Method: Maximum likelihood estimation
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans
Coefficients:
(Intercept) PixelsCorn PixelsSoybeans
NA NA NA
---
Random effects
Model: ~1| CountyName
(Intercept) Residual
Std. Dev. NA NA
---
Number of Observations: 36
Number of Areas: 12
```

To learn more, we call `convergence(est)`

and get

```
> convergence(est)
CONVERGENCE REPORT
NOTE: ALGORITHM DID NOT CONVERGE!
---
User specified number of iterations (niter) and
numeric precision (acc):
niter acc
overall loop 3 1e-05
fixeff 200 1e-05
residual var 200 1e-05
area raneff var 100 1e-05
---
Number of EE-specific iterations in each call
(given the user-defined specs), reported for
each of the3overall iterations:
fixeff residual var area raneff var
1 2 2 20
2 2 2 13
3 2 2 14
```

Clearly, we have deliberately set the number of (overall or outer
loop) iterations equal to `niter = 3`

to illustrate the
behavior; see also “niter = 3” on the line “overall loop” in the above
table. As a consequence, the algorithm failed to converge.

The maximum number of iterations for the fixed effects (“fixeff”) is
200, for the residual variance estimator (“residual var”) is 200, for
the area-level random effect variance is (“area raneff var”) is 100. The
default values can be modified; see documentation of
`fitsaemodel.control()`

.

The last table in the above output shows the number of iterations
that the algorithm required for the `niter = 3`

overall loop
iteration (i.e., in the first loop, we have `fixef = 2`

,
`residual var = 2`

and `area raneff var = 18`

iterations). From this table we can learn how to adjust the default
values in case the algorithm does not converge.