Motivation

TL;DR: The major motivation behind this method is: they use Jackknife to estimate the distribution of error, then it use a threshold to find the upper- and lower-bound of the error in the distribution, which is used for quantifying the uncertainty.

Quantifying the uncertainty over the predictions of existing deep learning models remains a challenging problem.

Deep learning models are increasingly popular in various application domains. A key question often asked of such model is “Can we trust this particular model prediction?” This is highly relevant applications wherein predictions are used to inform critical decision-making.

Existing methods for uncertainty estimation are based on Bayesian neural network. They do not guarantee (1) cover the true prediction targets with high probability (2) discriminate between high- and low-confidence prediction.

20221023160357-2022-10-23-16-03-58

  • Frequentist coverage: denotes whether the estimated confidence interval cover the data points.

  • Discrimination: denotes whether the model is able to discriminate
    high-confidence predictions (regions with dense training data) and low-confidence ones (regions with scarce training data).

Existing methods for uncertainty estimation are based predominantly on Bayesian neural networks.

  • Bayesian neural networks require significant modifications to the training procedure.
  • Approximate the posterior distributions could jeopardize both the coverage and discrimination performance of the resulting credible intervals.

Contributions

  • Propose the discriminative jackknife (DJ) to estimate the uncertainty over samples inspired by the jackknife leave-one-out (LOO) re-sampling procedure

  • To avoid exhaustively re-training the model for each sample, they adopt the high-order influence function to approximate the impact of each sample.

  • DJ is post-hoc to the model training. It is capable of improving coverage and discrimination without any modifications to the underlying predictive model.

Preliminaries

Learning setup

Considering a standard supervised learning setup, we try to minimize the prediction loss on the training data $\mathcal{D}n = {(x_i, y_i)}{i=1}^n$.
20221023163759-2022-10-23-16-37-59

Uncertainty Quantification

We aim to estimate the uncertainty in the model’s prediction though the pointwise confidence interval $\mathcal{C}(x;\hat{\theta})$.
20221023164043-2022-10-23-16-40-43
The degree of uncertainty in the model’s prediction is quantified by the interval width
20221023164130-2022-10-23-16-41-31

Frequentist coverage

This is satisfied if the confidence interval $\mathcal{C}(x;\hat{\theta})$ covers the true target $y$ with a prespecified coverage probability of $(1-\alpha), \alpha \in (0,1)$.
20221023164623-2022-10-23-16-46-23

discrimination

The confidence interval is wider for test points with less accurate predictions.
20221023165455-2022-10-23-16-54-55

Discriminative Jackknife

Classical Jackknife

The jackknife quantifies predictive uncertainty in terms of the average prediction error, which is estimated via leave-one-out (LOO) construction found by systematically leaving out each sample in $\mathcal{D}_n$, and evaluating the error of the restrained model on the left-out samples.

For a target coverage of $(1-\alpha)$, the native jackknife is
20221023170543-2022-10-23-17-05-43
20221023170626-2022-10-23-17-06-27
$\mathcal{\hat Q}_\alpha^+(\mathcal{R})$: $(1-\alpha)(n+1)$-th smallest element of $\mathcal{R}$.

The interval width is constant, which renders discrimination impossible.
20221023171754-2022-10-23-17-17-54

DJ Confidence Intervals

20221023172307-2022-10-23-17-23-07
The $\mathcal{G}_{\alpha,\gamma}$ is a quantile function applied on the elements of the sets of marginal prediction errors $\mathcal{R}$ and local prediction variability $\mathcal{V}$.

  • The prediction error is constant, i.e., does not depend on $x$, hence it only contributes to coverage but does not contribute to discrimination.

  • The local variability term depends on $x$, hence it fully determines the discrimination performance.

20221023175514-2022-10-23-17-55-14

The confidence interval is bounded by
20221023175929-2022-10-23-17-59-29

Efficient Implementation via Influence Functions

20221023224929-2022-10-23-22-49-29

Approximate the $\hat\theta_i$ using the high-order influence function.
Influence functions enable efficient computation of the effect of a training data point $(x_i,y_i)$ on $\hat\theta$. This is achieved by evaluating the change in $\hat\theta$, if $(x_i,y_i)$ was up-weighted by a small factor $\epsilon$.

20221023232505-2022-10-23-23-25-06
20221023233142-2022-10-23-23-31-42
Removing a training point is equivalent to upweighting it by $\frac{-1}{n}$.
20221023233701-2022-10-23-23-37-01
20221023231913-2022-10-23-23-19-13

Experiments

20221023235450-2022-10-23-23-54-52
20221023235509-2022-10-23-23-55-10