In 2016, the American Statistical Association released its first-ever
position paper, to warn of the problems of significance testing and
“p-values.” Though the issues had been well known for many years, it was
“significant” that the ASA finally took a stand. Let’s use the **lsa**
data in this package to illustrate.

According to the Kaggle entry, this is a

…Law School Admissions dataset from the Law School Admissions Council (LSAC). From 1991 through 1997, LSAC tracked some twenty-seven thousand law students through law school, graduation, and sittings for bar exams. …The dataset was originally collected for a study called ‘LSAC National Longitudinal Bar Passage Study’ by Linda Wightman in 1998.

Here is an overview of the variables:

```
data(lsa)
names(lsa)
#> [1] "age" "decile1" "decile3" "fam_inc" "lsat" "ugpa"
#> [7] "gender" "race1" "cluster" "fulltime" "bar"
```

Most of the names are self-explanatory, but we’ll note that: The two The ‘age’ variable is apparently birth year, with e.g. 67 meaning 1967. decile scores are class standing in the first and third years of law school, and ‘cluster’ refers to the reputed quality of the law school. Two variables of particular interest might be the student’s score on the Law School Admission Test (LSAT) and a logical variable indicating whether the person passed the bar examination.

There is concern that the LSAT and other similar tests may be heavily influenced by family income, thus unfair, especially to underrepresented minorities. To investigate this, let’s consider the estimated coefficients in a linear model for the LSAT

```
w <- lm(lsat ~ .,lsa) # predict lsat from all other variables
summary(w)
#>
#> Call:
#> lm(formula = lsat ~ ., data = lsa)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -19.290 -2.829 0.120 2.888 16.556
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 31.985789 0.448435 71.328 < 2e-16 ***
#> age 0.020825 0.005842 3.565 0.000365 ***
#> decile1 0.127548 0.020947 6.089 1.15e-09 ***
#> decile3 0.214950 0.020919 10.275 < 2e-16 ***
#> fam_inc 0.300858 0.035953 8.368 < 2e-16 ***
#> ugpa -0.278173 0.080431 -3.459 0.000544 ***
#> gendermale 0.513774 0.060037 8.558 < 2e-16 ***
#> race1black -4.748263 0.198088 -23.970 < 2e-16 ***
#> race1hisp -2.001460 0.203504 -9.835 < 2e-16 ***
#> race1other -0.868031 0.262529 -3.306 0.000947 ***
#> race1white 1.247088 0.154627 8.065 7.71e-16 ***
#> cluster2 -5.106684 0.119798 -42.627 < 2e-16 ***
#> cluster3 -2.436137 0.074744 -32.593 < 2e-16 ***
#> cluster4 1.210946 0.088478 13.686 < 2e-16 ***
#> cluster5 3.794275 0.124477 30.482 < 2e-16 ***
#> cluster6 -5.532161 0.210751 -26.250 < 2e-16 ***
#> fulltime2 -1.388821 0.116213 -11.951 < 2e-16 ***
#> barTRUE 1.749733 0.102819 17.018 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.197 on 20782 degrees of freedom
#> Multiple R-squared: 0.3934, Adjusted R-squared: 0.3929
#> F-statistic: 792.9 on 17 and 20782 DF, p-value: < 2.2e-16
```

There are definitely some salient racial aspects here, but, staying with the income issue, look at the coefficient for family income, 0.3009. The p-value is essentially 0, which in an academic research journal would classically be heralded with much fanfare, termed “very highly significant,” with a 3-star insignia. Indeed, the latter is seen in the output above. But actually, the impact of family income is not significant in practical terms. Here’s why:

Family income in this dataset is measured by quintiles. So
Mathematically, testing for a 0 effect is equivalent to checking
whether the CI contains 0. But this is missing the point of the CI,
which is to (a) give us an idea of the effect *size*, and (b) to
indicate *how accurate* our estimate is of that size. Aspect (a)
is given by the location of the center of the interval, while (b) is
seen from the CI’s radius.
this estimated coefficient says that,
for example, if we compare people who grew up in the bottom 20% of
income with those who were raised in the next 20%, the mean LSAT score
rises by only about 1/3 of 1 point–on a test where scores are typically
in the 20s, 30s and 40s. The 95% confidence interval (CI),
(0.2304,0.3714), again indicates that the effect size here is very
small.

So family income is not an important factor after all, and the significance test was highly misleading.

Some who read the above may object, “Sure, there sometimes may be a difference between statistical significance and practical significance. But I just want to check whether my model fits the data.” Actually, it’s the same problem.

*“I just want to check whether my model fits the data”*`

For instance, suppose we are considering adding an interaction term between race and undergraduate GPA to our above model. Let’s fit this more elaborate model, then compare.

```
w1 <- lm(lsat ~ .+race1:ugpa,lsa) # add interaction
summary(w1)
#>
#> Call:
#> lm(formula = lsat ~ . + race1:ugpa, data = lsa)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -19.1783 -2.8065 0.1219 2.8879 16.0633
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 26.574993 1.219611 21.790 < 2e-16 ***
#> age 0.020612 0.005837 3.531 0.000415 ***
#> decile1 0.127585 0.020926 6.097 1.10e-09 ***
#> decile3 0.213918 0.020902 10.234 < 2e-16 ***
#> fam_inc 0.295042 0.035939 8.210 2.35e-16 ***
#> ugpa 1.417659 0.363389 3.901 9.60e-05 ***
#> gendermale 0.513686 0.059986 8.563 < 2e-16 ***
#> race1black 4.121631 1.439354 2.864 0.004194 **
#> race1hisp 1.378504 1.570833 0.878 0.380191
#> race1other 2.212299 1.976702 1.119 0.263073
#> race1white 6.838251 1.201559 5.691 1.28e-08 ***
#> cluster2 -5.105703 0.119879 -42.590 < 2e-16 ***
#> cluster3 -2.427800 0.074862 -32.430 < 2e-16 ***
#> cluster4 1.208794 0.088453 13.666 < 2e-16 ***
#> cluster5 3.777611 0.124422 30.361 < 2e-16 ***
#> cluster6 -5.565130 0.210945 -26.382 < 2e-16 ***
#> fulltime2 -1.406151 0.116132 -12.108 < 2e-16 ***
#> barTRUE 1.743800 0.102855 16.954 < 2e-16 ***
#> ugpa:race1black -2.876555 0.460281 -6.250 4.20e-10 ***
#> ugpa:race1hisp -1.022786 0.494210 -2.070 0.038508 *
#> ugpa:race1other -0.941852 0.617940 -1.524 0.127479
#> ugpa:race1white -1.737553 0.370283 -4.693 2.72e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.193 on 20778 degrees of freedom
#> Multiple R-squared: 0.3948, Adjusted R-squared: 0.3942
#> F-statistic: 645.4 on 21 and 20778 DF, p-value: < 2.2e-16
```

Indeed, the Black and white interaction terms with undergraduate GPA are “very highly significant.” But that does mean we should use the more complex model? Let’s check the actual impact of including the interaction terms, by doing predictions of both models on an example X value:

```
typx <- lsa[1,-5] # set up an example case
predict(w,typx) # no-interaction model
#> 2
#> 40.2294
predict(w1,typx) # with-interaction model
#> 2
#> 40.2056
```

We see here that adding the interaction term changed the predictions–and the estimated value of the regession function–by only about 0.02 out of a 40.23 baseline. So, while the test has validated our with-interaction model, we may well prefer the simpler, no-interaction model.

*“I just want to know whether the effect is positive or negative”*

Here we have a different problem, bias. Our linear model is just that,
a model, and its imperfection will induce a bias. This could change the
estimated effect from positive to negative or vice versa, **even with an
infinite amount of data**. With larger and larger dataset size n, the
variance of estimated parameters goes to 0, but the bias won’t go away.

*Bottom line:*

We must not take small p-values literally.

The central issue in the above examples, and essentially in any other
testing situation, is that *a significance test is not answering the
question of interest to us.*

We wish to know whether family income plays a substantial role in the LSAT, not whether there is any relation at all, no matter how meaningless. Similarly, we wish to know whether the interaction between race and GPA is substantial enough to include it in our model, not whether there is any interaction at all, no matter how tiny.

The question at hand in research studies is rarely, if ever, whether a quantity is 0.000… to infinitely many decimal places. And ask noted, our measuring instrument is not this accurate in the first place; there will always be systemic bias, in our model, our dataset and so on.

Thus in almost all cases, significance tests don’t address the issue of interest, which is whether some population quantity is substantial enough to be considered important. Analysts should not be misled by words like “significant.” Modern statistical practice places reduced value, or in the view of many, no value at all, on significance testing.