Introduction
The combination, or aggregation, of differing probability distributions into
a single one could result in beneficial effects, since the differences
between various forecast systems provide a better understanding of the
uncertainty about the target quantities, and the aggregates may reflect more
accurately the information. However,
the biggest advantage of aggregation is that the forecaster is not forced to decide
a priori which forecast system is the most reliable at the actual point of issuing
a forecast, because the combination method will be optimized at each forecast run
by taking into consideration the quality of the forecast from previous time steps.
Thus, the data themselves will automatically lead to the optimal decision
incorporating
all available information about the different deficiencies and strengths of the individual forecast systems.
In econometrics and related disciplines, the combination of forecasts has a
long tradition starting with suggesting the use of empirical
weights derived from “out of sample” forecast variances. An overview of the
last 40 years of forecast combination in the field of economics
can be found in .
was one of the first who outlined the advantages of forecast combinations in
meteorology and showed different methods of combining the
output of different hydrological models. In different
combination methods for hydrological forecast models are compared.
compare different model averaging approaches, showing that a
simple regression method could result in improvements comparable to more
sophisticated methods.
In general the challenge of model combination is that, apart from the simple
model averaging methodologies, different weights need to be assigned
according to the quality of the forecast of the preceding days and periods. A
frequently used method for model averaging and forecast combination is the
method of Bayesian model averaging (BMA) introduced by and
, where the weights are based on posterior model probabilities
within a Bayesian framework. The BMA method has been applied in the field of
ensemble forecast calibration () and for flood
forecasting purposes, e.g. in , ,
, and .
In and the term “calibration” is used to
describe the statistical consistency between the distributional forecasts and
the observations and is a joint property of the predictions and the events
that materialize. A state of the art calibration and bias correction method
is non-homogeneous Gaussian regression (NGR), also known as the ensemble
model output statistic (EMOS) technique of . It fits a single
parametric predictive probability density function (pdf) using summary
statistics from the (multi-model) ensemble and corrects simultaneously for
biases and dispersion errors. Also, NGR has been applied many times
successfully for calibrating and combining hydro-meteorological ensemble
forecasts (see for example ).
The Beta-transformed linear pooling (BLP) approach, which has been developed
recently by and for combining predictive
distributions, will be tested and compared with the NGR and the BMA in this
study. To the author's knowledge the BLP and the associated estimation of
weights, which assign relative importance to the individual predictive
distributions, have not been applied to hydrological forecasts so far.
Before the combination methods are applied, the errors of the hydrological
model are corrected in order to minimize the difference between the last
available observation and the predictions at the time of initialization of
the forecast. This process of error correction is later on called
post-processing, since it starts after completing the hydrological
simulations and predictions given meteorological observations or forecasts.
Depending on the post-processing method, quantiles or pdfs for future
streamflows will be derived for each single forecast time step. Whereas
quantile regression (QR) methods () and modifications of
them will lead to predictions of quantiles, a predictive pdf can be derived
for example by the recently developed waveVARX method ()
directly. For more details of these post-processing methods, the reader is
referred to , whereas the objective of this paper will be the
analysis of combination methods of forecasts. In the next section the three
combination methods and the applied verification measures will be described.
After the presentation of the data and the results, the outcome of the
comparison will be discussed and summarized in the conclusions.
Methods
Three different combination methods have been applied to the flood
forecasting system for the Sihl River at station Zurich (Switzerland), where
two meteorological forecasts, the 16-member COSMO-LEPS ()
and the deterministic C7 system (produced at MeteoSwiss with ≈ 7 km
resolution) are implemented (a detailed description can be found in
, , and ).
In a first step the hydrological modelling errors of all these forecasts will
be minimized, using a QR method in combination with neural networks (QRNN,
). This will result in direct estimates of the
inverse cumulative density function (i.e. the quantile function), which in
turn allows the derivation of the predictive uncertainty (see for example
, , and , where the
application of the QR in order to estimate predictive uncertainties (PUs) is
outlined). If the number of estimated quantiles within the domain {0<τ<1} is sufficiently large, the resulting distribution could be considered
to be continuous. In this study the number of quantiles is set to nine with
probability levels τ=0.01,0.1,0.2,0.25,0.5,0.75,0.8,0.9,0.99. In the cdf or pdf is constructed by combining step
interpolation of probability densities for specified τ-quantiles with
exponential lower and upper tails, which will be called the empirical method
(EMP). Alternatively the pdf could be constructed by monotone re-arranging of
the τ-quantiles and estimating a log-normal distribution (LN) to these
quantiles for each lead time Δt. The advantage of the quantile
re-arranging and the distribution fitting is 2-fold and efficiently prevents
known problems occurring with QR: firstly it eliminates the problem of
crossing of different quantiles (i.e. the unrealistic but possible outcome of
the non-linear optimization problem yielding lower quantiles for higher
streamflow values – – e.g. the value of the 0.90
quantile is higher than the value of the 0.95 quantile), and secondly it
permits the extrapolation to extremes not included in the training sample
().
This QRNN method will be applied to each ensemble member of the COSMO-LEPS
forecasts, resulting in 16 forecasts of quantiles, and to the
C7 forecasts. have
examined averaging quantiles of continuous distributions given by multiple
information sources rather than averaging probabilities. Both approaches of
probability and quantile averaging have been applied in this paper for
averaging the post-processed ensemble prediction system (EPS) based
streamflow forecasts in order to get one predictive pdf or quantile forecast.
Before applying the probability averaging approach, a pdf has been
constructed by the LN method, i.e. a log-normal distribution has been fitted
to the re-arranged τ-quantiles.
Thus, in total there are five different forecasts available after
post-processing, two based on the application of the QRNN method for the
COSMO-LEPS with probability averaging (p.aver.) or quantile averaging
(q.aver.), two post-processed C7 forecasts based on QRNN with the EMP and the
LN approach, and one forecast based on the waveVARX method. Additionally the
raw COSMO-LEPS forecast will be included in the following combination
procedures as well (see Fig. ).
Set of six different forecast models available for combination, five
post-processed plus one raw forecast. For the quantile averaging (M1) and the
probability averaging (M2) method, an example of averaging two ensemble
members is indicated.
Three different methods will be tested for optimally combining these six
forecast models (M1, …, M6), which allow us to assign different
weights to the raw and five post-processed forecasts. For the application of
the first two methods, BMA and NGR, the streamflow values have been
transformed to the normal space by the help of the normal quantile
transformation ().
Bayesian model averaging (BMA)
If the combination is calculated within a Bayesian framework by using weights
corresponding to the posterior model probabilities, it is usually referred to
as BMA and follows from direct application of Bayes' theorem as explained in
e.g. and .
In the statistical BMA model is extended to dynamical forecast
models, where each forecast and/or ensemble member is represented by a
probabilistic distribution for which a weight is assigned based on the past
performance of each individual forecast. These weights are used to combine
all distributions into one single mixture distribution. Therefore the BMA
predictive model of the quantity of interest y is given by
p(y|k1,…,kM)=∑m=1Mhmgm(y|km),
where hm is the posterior probability (i.e. weight) of forecast km, the
best forecast derived from its performance in the training period, and the
conditional pdf of y on km, gm(y|km), given that km is the best
forecast in the ensemble with m=1,…,M members, or models. The
transformation of the streamflow values to the normal space beforehand allows
the application of the BMA method based on mixtures of univariate normal
distributions. In the work of and variants
of the BMA method have been applied, which allow the direct usage of the cdfs
for estimating the weighting parameters. However, in this study these BMA
approaches have not been implemented, and the estimated medians (τ=0.5) from the five post-processing methods and from the raw COSMO-LEPS are
taken as input only in order to allow better comparison with the following
NGR approach.
Non-homogeneous Gaussian regression (NGR)
Another possibility to address underdispersion and forecast bias is the use
of the NGR method, also known as EMOS, and is based on multiple linear
regression for linear variables, such as temperature or streamflows, and
logistic regression for binary variables, such as precipitation occurrence or
freezing. More information about the MOS technique can be found for example
in and . Its extension for ensembles is
explained in and a brief summary of this method is given
hereafter. Let y denote again the variable of interest (e.g. streamflow)
and let k1,…,kM be the corresponding forecast of the M
ensemble members or models. If N(μ,σ2) denotes a
Gaussian density with mean μ and variance σ2, the NGR predictive
distribution is given by
y|k1,…,kM∼N(a0+a1k1+⋯+aMkM,b0+b1s2),wheres2=1M∑m=1Mkm-1M∑m=1Mkm2.
Thus the predictive mean is equal to the regression estimates with
coefficients a0,…,am,b0, and b1 and forms a bias-corrected
weighted average of the different forecasts (ensemble members), whereas the
predictive variance depends linearly on the variance of the forecast models
(ensemble members). Although modifications for the NGR exist for non-normal
distributed variates (see for example ), the
streamflow values have been transformed to the normal space for comparison
reasons and the medians (τ=0.5) from the five post-processing methods
and from the raw COSMO-LEPS are taken as input as in the BMA method.
Beta-transformed linear pool (BLP)
In it has been stated that any non-trivially weighted average
of distinct probability forecasts will be uncalibrated and lack sharpness,
even when the individual forecasts have been calibrated. Hence they suggested
a composite of the traditional linear pool with a beta transform. The
aggregation method introduced by and
considers the BLP for a set of predictive cdfs F1,…,FM as
F(y)=Bα,β∑m=1MωmFmy
for y∈R, where Bα,β denotes the cdf of the standard
Beta distribution with parameters α>0 and β>0 and ω1,…,ωM are non-negative weights that sum to 1. The BLP density
forecast for the component densities fi,…,fM then is
f(y)=∑m=1Mωmfmybα,β∑m=1MωmFmy,
with parameters α>0 and β>0 of the Beta density function
bα,β. For α=β=1 the BLP corresponds to the
traditional linear opinion pool.
Thus Bα,β can be interpreted as a parametric calibration
function for combining F1,…,FM with mixture weights ω∈ΔM, which assign relative importance to the individual
predictive distributions. The parameters α>0 and β>0 and the
weights ω1,…,ωM are estimated with the maximum
likelihood method. The log-likelihood function for the BLP model
(Eq. ) is
ℓ(ω1,…,ωM;α,β)=∑j=1Jlog(f(yj))=∑j=1Jlog∑m=1Mωmfmjyj+∑j=1Jlogbα,β∑m=1MωmFmjyj=∑j=1Jlog∑m=1Mωmfmjyj+∑j=1J(α-1)log∑m=1MωmFmjyj+(β-1)log1-∑m=1MωmFmjyj+JlogB(α,β)
where B is the classical Beta function.
This BLP approach has been applied now to combine the different forecast
systems. The quantiles resulting from the QRNN method (models M1, M4, and M5)
forecasts have been converted to pdfs by applying the LN method (by fitting a
log-normal distribution to the re-arranged τ quantiles).
Verification
Although probability and quantile forecasts are both probabilistic products,
the former is expressed in terms of a probability (e.g. that a certain
threshold will be exceeded) and the latter is given by a quantile for a
particular probability level of interest (). Since the
outputs of the QRNN model are quantiles, it is reasonable to evaluate the
performance with a skill score which has been developed for predictive
quantiles (), known as the quantile score (QS). It
is based on an asymmetric piecewise linear function, the so-called check
function, ρτyi-qτ,i, which is a function of the
probability level τ(0<τ<1) and the error between the observation
yi and the quantile forecast qτ,i for i=1,…,N, where N
is the sample size. The check function is defined as
ρτyi-qτ,i=τyi-qτ,i∀yi≥qτ,iτ-1yi-qτ,i∀yi<qτ,i
and the QS results as the mean of the check function with penalties 1-τ
and τ for under- and over-forecasting (see ):
QS=1-τN∑i:yi<qτ,i(qτ,i-yi)+τN∑i:yi≥qτ,i(yi-qτ,i).
The CRPS compares the forecast probability distribution with the observation
and both are represented as cdfs. If F is the predictive cdf and y is the
verifying observation, showed that the CRPS can be defined
equivalently as a standard form,
CRPS(F,y)=∫-∞∞F(t)-I{y≤t}2dt,and as=2∫01Iy<F-1(τ)-τF-1(τ)-ydτ.
Thus, in the standard form (Eq. ) an ensemble of predictions can
be converted into a piecewise constant cdf with jumps at the different models
(ensemble members), and I{.} is a Heaviside step function, with a single
step from 0 to 1 at the observed value of the variable. The equivalence of
Eqs. ()–() was noted by . For the
quantile forecast qτ=F-1(τ), the integrand in
Eq. () equals the quantile score, i.e. the mean of the check
function (Eq. ). That means the CRPS corresponds to the integral of
the QS over all thresholds, or likewise the integral of the QS over all
probability levels ( and ). Hence, the CRPS
averages over the complete range of forecast thresholds and probability
levels, whereas the QS looks at specific τ-quantiles; thus, it is more
efficient in revealing deficiencies in different parts of the distributions,
especially with respect to the tails of the distribution. Both verification
measures are negatively oriented, meaning the smaller the better.
Results
COSMO-LEPS and C7 forecasts are available from 24 February 2010 to
27 April 2016 once a day with hourly time resolution, which have been
post-processed in order to derive predictive distributions and quantile
forecasts. To calibrate and validate the post-processing parameters (QRNN and
waveVARX), the data sets of available hourly observations and corresponding
simulations have been split into two halves (calibration period: 2010–2012;
validation period: 2013–2016). The results of the validation, which are not
shown due to lack of space, highlight the improvements of the QRNN method
(similar to the results shown in ).
The weighting parameters of the combination methods are estimated by applying
a moving window with a size of 7 days (168 h) for optimization. Different
window sizes have been tested as well, but 7 days was chosen finally as a
trade-off between computing time and efficiency. In Fig. an
example of the temporal evolution of the hourly weights for a lead time of
48 h for the three combination methods is shown.
Hourly weights of the BMA (a), NGR (b), and
BLP (c) methods estimated for a lead time of 48 h. The six
forecasts are the QRNN method for the COSMO-LEPS with quantile averaging
(QRNN-CL-q.) – M1, probability averaging (QRNN-CL-p.) – M2, the
waveVARX(-CL) method – M3, the raw COSMO-LEPS (CL) forecast – M4, the
two post-processed C7 forecasts based on QRNN with the EMP – M5, and the
LN approach – M6.
Probability integral transform (PIT) of the raw and three combined
forecasts at a lead time of 48 h.
Before the forecast skill of the three combination methods are compared, the
statistical consistency between the predictive cdf and the observations are
analysed with the help of the probability integral transform (PIT) as
proposed by (see Fig. ). In the case of
well-calibrated forecasts, the sequence of PIT values will follow a uniform
distribution U(0,1). U-shaped PIT histograms indicate underdispersed
forecasts with too little spread on average, and inverse U-shaped histograms
correspond to overdispersed forecasts (see for example
).
The question now is whether there are significant differences between the
three combination methods. Therefore the QS has been applied at first to
highlight possible differences between the combination methods in more
detail.
In Fig. the results of the QS at four lead times for the raw
COSMO-LEPS (C-L, black line) and for the three combination methods BLP (red
line), NGR (green line), and BMA (blue line) are shown and compared to the QS
results of the raw C-L (black circles). Additionally, a simple quantile
mapping (QM) is applied (cyan diamonds) to the raw C-L forecasts in order to
evaluate the positive effect of using more complex methods. Thereby the cdf
of the raw forecast is matched to the cdf of the observations. As mentioned
in , QM is highly effective for bias correction, but ensemble
spread reliability problems cannot be solved properly.
Quantile score (QS) for various lead times and the three combination
methods in comparison to the raw COMSO-LEPS and a simple quantile mapping
(QM) approach.
In Fig. the CRPS results of the six forecast models are shown in
comparison to the BLP in order to demonstrate the motivation of aggregating
these systems. As can be seen clearly, the combined forecast outperforms each
of the individual forecasts in view of the CRPS.
CRPS of the six forecast models: COSMO-LEPS with quantile averaging
(QRNN-CL-q.) – M1, probability averaging (QRNN-CL-p.) – M2, the
waveVARX(-CL) method – M3, the raw COSMO-LEPS (CL) forecast – M4, the
two post-processed C7 forecasts based on QRNN with the EMP – M5, and the
LN approach – M6. Additionally, the CRPS of the BLP combined forecast is
shown.
The CRPS for the raw C-L, the QM approach and the three combination methods
is shown in Fig. .
CRPS of the raw and combined forecasts.
Discussion
So far most of the studies comparing the results of the BMA and the NGR
approach have not found any preference (see for example ). In
this paper these two methods are checked against the BLP, which has not been
used for hydrological purposes until now. In a first step the weights derived
for each individual, raw and post-processed, forecast system are compared.
The pattern of these optimized weights in Fig. shows rather
vague similarities between the three combination methods. The BLP and the NGR
are in general more spiky, with rapid changes between consecutive hours. This
could result from problems in convergence from the optimization algorithm
applied for estimating the parameters (“constrOptim” in R ).
In general the weights show some periodicity, which indicates that some
models are more appropriate to be used in certain seasons and for certain
flow conditions during a year. However, the limited amount of data does not
allow us to draw clear conclusions.
The results of the PIT clearly indicate that all three combinations result in
well-calibrated forecasts with close to uniform histograms. In Fig.
the examples for the 48 h forecast are given, highlighting the heavy
underdispersiveness of the raw forecasts. The same behaviour is visible for
almost all lead times; however, the raw COSMO-LEPS forecasts get less
underdispersed with increasing lead time, since the spread and the
uncertainty in the ensemble increase.
The analysis of the QS (Fig. ) shows slightly better results for
the BLP, followed by the NGR and BMA. The raw COSMO-LEPS (C-L) and the QM are
much worse, especially for smaller lead times. It is interesting to see that
the QS of the raw C-L follows a straight line for smaller lead times (6 and
12 h) in the same manner as one would expect from deterministic forecasts,
because of the underdispersiveness of the C-L at the beginning of the
forecast horizon. The slope of this line is an indicator of the size of the
(positive) bias. The QM at a lead time of 6 h is also a straight line,
however, with an opposite but much smaller and negative slope (bias) in
comparison to the raw C-L. With increasing lead times the QS of the raw C-L
and the QM forecasts come closer to the combined forecasts for probability
levels between 0.1 and 0.5. This is most probably caused by the increased
spread of the ensemble. However, for a lead time of 24 and 48 h, the raw C-L
forecasts still show the worst behaviour at higher flows, whereas the QM
method performs at a lead time of 48 h almost as well as the combination
methods, apart from the forecasts around the median.
As already stated previously, the comparison of the CRPS of the different
post-processed methods and the aggregated ones (e.g. BLP) clearly identifies
the advantage of combination (Fig. ). The CRPS, i.e. the integral
of the QS, for the different combination methods (Fig. ) confirms
the results of the QS. In general the results of the BLP are slightly better
than the NGR and BMA results. It seems that for those periods of lead times,
where the BLP is not superior (e.g. around 20 h), the optimization routines
had problems on convergence. However, further analysis will be necessary. The
comparison with the QM approach confirmed the results of , since
the forecast quality did not show any improvements at the first lead times
because of the underdispersiveness of the raw C-L. Thus, the more complex
combination by far outperforms the QM method.