Motivation
TL;DR: The major motivation behind this method is: they use Jackknife to estimate the distribution of error, then it use a threshold to find the upper- and lower-bound of the error in the distribution, which is used for quantifying the uncertainty.
Quantifying the uncertainty over the predictions of existing deep learning models remains a challenging problem.
Deep learning models are increasingly popular in various application domains. A key question often asked of such model is “Can we trust this particular model prediction?” This is highly relevant applications wherein predictions are used to inform critical decision-making.
Existing methods for uncertainty estimation are based on Bayesian neural network. They do not guarantee (1) cover the true prediction targets with high probability (2) discriminate between high- and low-confidence prediction.
Frequentist coverage: denotes whether the estimated confidence interval cover the data points.
Discrimination: denotes whether the model is able to discriminate
high-confidence predictions (regions with dense training data) and low-confidence ones (regions with scarce training data).
Existing methods for uncertainty estimation are based predominantly on Bayesian neural networks.
- Bayesian neural networks require significant modifications to the training procedure.
- Approximate the posterior distributions could jeopardize both the coverage and discrimination performance of the resulting credible intervals.
Contributions
Propose the discriminative jackknife (DJ) to estimate the uncertainty over samples inspired by the jackknife leave-one-out (LOO) re-sampling procedure
To avoid exhaustively re-training the model for each sample, they adopt the high-order influence function to approximate the impact of each sample.
DJ is post-hoc to the model training. It is capable of improving coverage and discrimination without any modifications to the underlying predictive model.
Preliminaries
Learning setup
Considering a standard supervised learning setup, we try to minimize the prediction loss on the training data $\mathcal{D}n = {(x_i, y_i)}{i=1}^n$.
Uncertainty Quantification
We aim to estimate the uncertainty in the model’s prediction though the pointwise confidence interval $\mathcal{C}(x;\hat{\theta})$.
The degree of uncertainty in the model’s prediction is quantified by the interval width
Frequentist coverage
This is satisfied if the confidence interval $\mathcal{C}(x;\hat{\theta})$ covers the true target $y$ with a prespecified coverage probability of $(1-\alpha), \alpha \in (0,1)$.
discrimination
The confidence interval is wider for test points with less accurate predictions.
Discriminative Jackknife
Classical Jackknife
The jackknife quantifies predictive uncertainty in terms of the average prediction error, which is estimated via leave-one-out (LOO) construction found by systematically leaving out each sample in $\mathcal{D}_n$, and evaluating the error of the restrained model on the left-out samples.
For a target coverage of $(1-\alpha)$, the native jackknife is
$\mathcal{\hat Q}_\alpha^+(\mathcal{R})$: $(1-\alpha)(n+1)$-th smallest element of $\mathcal{R}$.
The interval width is constant, which renders discrimination impossible.
DJ Confidence Intervals
The $\mathcal{G}_{\alpha,\gamma}$ is a quantile function applied on the elements of the sets of marginal prediction errors $\mathcal{R}$ and local prediction variability $\mathcal{V}$.
The prediction error is constant, i.e., does not depend on $x$, hence it only contributes to coverage but does not contribute to discrimination.
The local variability term depends on $x$, hence it fully determines the discrimination performance.
The confidence interval is bounded by
Efficient Implementation via Influence Functions
Approximate the $\hat\theta_i$ using the high-order influence function.
Influence functions enable efficient computation of the effect of a training data point $(x_i,y_i)$ on $\hat\theta$. This is achieved by evaluating the change in $\hat\theta$, if $(x_i,y_i)$ was up-weighted by a small factor $\epsilon$.
Removing a training point is equivalent to upweighting it by $\frac{-1}{n}$.