Error reduction and representation in stages (ERRIS) in hydrological modelling for ensemble streamflow forecasting

This study develops a new error modelling method for ensemble short-term and real-time streamflow forecasting, called error reduction and representation in stages (ERRIS). The novelty of ERRIS is that it does not rely on a single complex error model but runs a sequence of simple error models through four stages. At each stage, an error model attempts to incrementally improve over the previous stage. Stage 1 establishes parameters of a hydrological model and parameters of a transformation function for data normalization, Stage 2 applies a bias correction, Stage 3 applies autoregressive (AR) updating, and Stage 4 applies a Gaussian mixture distribution to represent model residuals. In a case study, we apply ERRIS for one-step-ahead forecasting at a range of catchments. The forecasts at the end of Stage 4 are shown to be much more accurate than at Stage 1 and to be highly reliable in representing forecast uncertainty. Specifically, the forecasts become more accurate by applying the AR updating at Stage 3, and more reliable in uncertainty spread by using a mixture of two Gaussian distributions to represent the residuals at Stage 4. ERRIS can be applied to any existing calibrated hydrological models, including those calibrated to deterministic (e.g. least-squares) objectives.


Introduction
Streamflow forecasts have long been used to support management of river conditions, such as flood emergency response and optimal water allocation. Recently, much research has been carried out on ensemble streamflow forecasting (e.g. Alfieri et al., 2013;Bennett et al., 2014a;Demargne et al., 2014;Thielen et al., 2009), encouraged by research communities such as the Hydrological Ensemble Prediction Experiment (HEPEX -http://hepex.org/). In recognition that streamflow forecasts can be subject to significant errors, forecast ensembles are used to represent forecast uncertainty. In producing ensemble forecasts, one aims to reduce forecast uncertainty as much as possible to give the most accurate forecasts. One also aims to represent the remaining forecast uncertainty reliably to give the right distribution among ensemble members.
Streamflow forecasts are usually made by initializing hydrological models (e.g. conceptual rainfall-runoff models) and then forcing them with forecast rainfall. There are a number of sources of errors in streamflow forecasts, including errors in measurement of rainfall and streamflow, errors in hydrological model structure, errors in model parameters, and errors in forecast rainfall. Ideal hydrological error quantification would account for each individual source of errors explicitly and reliably, such that all sources of errors would accumulate to accurately represent overall errors in the streamflow forecasts. Various attempts have been made to identify and decompose the sources of errors, using methods such as sequential optimization and data assimilation (Vrugt et al., 2005), sequential assimilation (Moradkhani et al., 2005), the Bayesian total error analysis (BATEA) (Kavetski et al., 2006a, b;Kuczera et al., 2006), and the integrated Bayesian Uncertainty estimator (IBUNE) (Ajami et al., 2007). Such methods are useful for attempting to separate the major sources of errors, identifying deficiencies of model structure, performing parameter sensitivity analyses and comparing different hydrological models, without confounding input and output errors. However, because of a lack of information on the different sources of errors and on how they interact with each other, it is highly challenging to apply an error decomposition approach to arrive at statistically reliable overall errors in streamflow forecasts .
An alternative approach is to consider only the overall errors of forecasts, without attempting to explain the sources of errors. An estimate of the overall error of a forecast is the residual, defined as the difference between modelled streamflow and observations. We now concentrate our discussion on residuals, but we will continue to refer to models of residuals as "error models", following common practice. Residuals of a series of forecasts form a time series. The most traditional and simplest error model, related to the classical least-squares calibration, is based on the assumption of uncorrelated homoscedastic Gaussian residuals in the time series of residuals (Diskin and Simon, 1977). This assumption is generally not valid for hydrological applications, where residuals are frequently autocorrelated, heteroscedastic and non-Gaussian (Kuczera, 1983;Sorooshian and Dracup, 1980). More sophisticated error models have been developed to address correlation, variance structure and the distribution of residuals. Autoregressive models have been widely used to account for autocorrelation of residuals (e.g. Bates and Campbell, 2001;Xiong and O'Connor, 2002). Heteroscedasticity may be explicitly dealt with by describing the variance of residuals as a function of some statedependent variables (e.g. observed streamflow, dry/wet seasons) (e.g. Evin et al., 2013;Pianosi and Raso, 2012;Schaefli et al., 2007;Yang et al., 2007). Non-Gaussianity of residuals may be explicitly represented by non-Gaussian probability distributions (e.g. Marshall et al., 2006;Schaefli et al., 2007;Schoups and Vrugt, 2010). Heteroscedasticity and non-Gaussianity of residuals may also be dealt with implicitly, and often more conveniently, by using data transformation to normalize the residuals and stabilize their variance, such as the normal quantile transform (Kelly and Krzysztofowicz, 1997;Krzysztofowicz, 1997;Montanari and Brath, 2004), the Box-Cox transformation (Thyer et al., 2002) and the log-sinh transformation (Wang et al., 2012). Solomatine and Shrestha (2009) presented an alternative method of predicting residual error distributions using machine learning techniques. They built a non-linear regression model to predict the forecast quantiles at each lead time. Their method is not based on an autoregressive model but captured recent information about the model error with a non-linear regression.
Broadly, previous attempts to model residuals can be divided into "post-processor" methods that separate the estimation of hydrological model parameters from the estimation of error model parameters, and "joint inference" methods that estimate all parameters at once. Post-processor methods (e.g. Evin et al., 2014) are often held to be less theoretically desirable than joint inference methods (e.g. Kuczera, 1983;Bates and Campbell, 2001). This is because joint inference methods aspire to a complete description of the behaviour of errors, including behaviours that arise from interactions be-tween parameters from hydrological and error models (see discussion in Evin et al., 2014). Unfortunately, joint inference methods can have serious limitations for operational forecasting of streamflows. Li et al. (2015) showed that a joint inference method caused poor performance in the hydrological model when it was isolated from the error model (we will call this the "base" hydrological model). Error models that account for autocorrelated residuals usually have less influence on forecasts as lead time increases. Thus, as lead time increases, and the influence of the error model decreases, the quality of the forecast relies on the performance of the base hydrological model and the quality of meteorological forecasts (Bennett et al., 2014a). Evin et al. (2014) demonstrated another (and perhaps more egregious) limitation of joint inference methods: joint estimation can result in deleterious interference between error model and hydrological model parameters, leading to poor out-of-sample streamflow predictions. In our experience, interactions between parameters of the hydrological model and the error model can make it very difficult to calibrate the models jointly. The shape of the distribution of forecast residuals can change markedly after hydrological model forecasts are updated, for example with an autoregressive error model. Despite considerable progress in hydrological uncertainty modelling, few studies in the literature present model forecasts (or simulations) that are practically reliable when error updating is applied (e.g. Gragne et al., 2015;Schoups and Vrugt, 2010). This paper develops a new error modelling method, called error reduction and representation in stages (ERRIS), for real-time and short-term streamflow forecasting applications. ERRIS is a further development of the restricted autoregressive model (Li et al., 2015) and a seasonal error model developed by Li et al. (2013). ERRIS is a post-processing method developed to deal with the overall errors of streamflow forecasts resulting from hydrological uncertainty only. We assume that errors in streamflow forecasts due to weather forecasts (precipitation in particular) will be considered separately by using ensemble weather forecasts (Bennett et al., 2014a;Robertson et al., 2013;Shrestha et al., 2013), and we do not consider these in this paper. For convenience, in this study we use the term streamflow forecast to mean onestep-ahead model prediction of streamflow, given observed weather and streamflow up to just before the forecast start time and assuming a one-step-ahead weather forecast that turns out to perfectly match observations. In future work, we will extend ERRIS to multiple-step-ahead streamflow forecasting.
In this study we use the term "ensemble" to mean a set of equally probable realizations of future streamflow that represents the hydrological model uncertainty. The forecasts based on ERRIS are not typical probabilistic forecasts (Gneiting and Katzfuss, 2014), which explicitly provide the predictive distribution of future streamflow. For ERRIS, the probability distribution may be theoretically derived for one-step-ahead forecasts based on the distributional assumption of model residuals. However, we can only obtain the predictive distribution of ERRIS forecasts at multiple step by means of Monte Carlo simulation.
The novelty of ERRIS is that it does not rely on a single complex error model but runs a sequence of simple error models through multiple stages. We start with a very simple model of independent Gaussian residuals after data transformation to determine hydrological model parameters. At each subsequent stage, an error model is introduced to improve over the previous stage and to finalize the representation, including associated parameter values, of one particular statistical feature (bias, correlation in residuals or a non-Gaussian distribution). ERRIS progressively refines model features, focusing only on a small number of model parameters at each stage. This is achieved by estimating the values for a core set of parameters at each stage and holding them constant at subsequent stages. In doing so, ERRIS avoids the problems associated with parameter interactions that can occur under joint inference methods.
This paper is organized as follows. The ERRIS method is described in detail in Sect. 2. A case study is introduced in Sect. 3. Major results are presented in Sect. 4, followed by discussion and further results in Sect. 5. Conclusions are made in Sect. 6. We start from a simplified version of the seasonally invariant error model described by Li et al. (2013) to calibrate the hydrological model in the ERRIS method. At Stage 1, we apply the log-sinh transformation (Wang et al., 2012) where a and b are transformation parameters, to the raw values of streamflow Q. We assume at this stage that hydrological model forecast residuals are independent and, in the transformed space, follow a Gaussian distribution with a constant variance. The forecast variance in the original (untransformed) space is not a constant but is dependent on the magnitude of simulated streamflow through the back-transformation. The log-sinh transformation has been applied to a wide range of hydrological data (e.g. Li et al., 2013;Peng et al., 2014;Robertson et al., 2013;Shrestha et al., 2015;Zhao et al., 2015) including extreme daily streamflow values (Bennett et al., 2014b) to normalize data and stabilize variance, and has been shown to perform at least as well as other commonly used transformations (Del Giudice et al., 2013;Wang et al., 2012).
We denote the observed and simulated streamflows at day t by Q(t) and Q(t), respectively. The error model at Stage 1 is mathematically specified as where N denotes a Gaussian distribution of the model residuals in the transformed space at Stage 1, with mean Z 1 (t) and standard deviation σ 1 . We will use similar notations (e.g. Q, Z, Z and σ ) for all stages in the ERRIS method, with stages distinguished by subscripts (i.e. 1, 2, 3, 4). No autocorrelation within the forecast residuals is assumed at Stage 1. This avoids the potential parameter interference between the autocorrelation parameter and hydrological model parameters (e.g. parameters describing time persistence of the hydrograph) when the hydrological model is jointly calibrated with the error model. Stage 1 of the ERRIS method is summarized in Table 1. At the end of Stage 1, the simulated streamflow Q(t) is taken as the forecast median of the ensemble streamflow forecast.

Stage 2: linear bias correction
At Stage 1, we assume that the hydrological simulation is overall unbiased. However, the hydrological model often overestimates low flows and underestimates high flows. At Stage 2, we adopt a simple but effective bias-correction scheme firstly introduced by Wang et al. (2014) to revise the forecast value made at Stage 1. This bias correction describes the forecast bias in the transformed domain by a linear function. Because the bias correction is applied to transformed data, it is able to cope with conditional biases (biases that vary with flow magnitude) that are often present in hydrological model simulations, even if these vary in a strongly non-linear way. We express the specific error model structure of Stage 2 as where c and d represent the intercept and slope parameters of the bias correction and σ 2 denotes the standard deviation of the residuals at Stage 2. The slope parameter d allows much flexibility in the bias correction. When d equals 1, this bias correction becomes a simple additive correction. When d equals 0, the bias correction forces the forecast to approach a constant (in additional to uncertainty). This may happen when the hydrological forecast performs worse than climatology (i.e. long-term average). When d is greater than 1, the bias correction can correct the very strongly conditional biases, as might be found in ephemeral catchments. At the end of Stage 2, the forecast median in the original space is revised to where f −1 (x) = b −1 arsinh {exp(bx) − a} is the backtransformation of the log-sinh transformation given in Eq. (1).

Stage 3: AR updating
At Stage 3, we no longer assume that forecast residuals are independent, and use an AR-based error model to describe the correlation structure of forecast residuals. The AR-based error model enables the ERRIS method to correct forecast residuals based on the latest available observations of streamflow. Specifically, we assume that the forecast residuals at Stage 2 follow a restricted AR error model described by Li et al. (2015). The error model at Stage 3 can be written as is the updated streamflow without applying the restriction, and ρ and σ 3 are the lag-1 autocorrelation parameter and the standard deviation of the residuals at Stage 3, respectively. Li et al. (2015) demonstrated that when AR models are applied to normalized residuals without restriction, overcorrection of forecasts can occur, particularly at the peak or on the rise of a hydrograph. Equation (8) uses the restricted AR error model to reduce the tendency to overcorrect forecasts. In Eq. (8) the forecast median, denoted by Q 3 (t), is given by The forecast at Stage 3 updates Q 2 (t) based on the latest observed streamflow Q(t −1) and its difference from Q 2 (t −1). Therefore, more information (i.e. streamflow observations at the previous time step) is required to generate streamflow forecasts at Stage 3 than at the previous two stages.

Stage 4: residual distribution refinement
In Sect. 4, we will demonstrate that the residuals after Stages 1 and 2 are well described by Gaussian distributions, but the shape of the residual distribution after Stage 3 dramatically changes. In particular, the distribution of the residuals after Stage 3 looks more peaked and has longer tails than a Gaussian distribution. The reason for the non-Gaussian residuals after Stage 3 is as follows. The AR updating at Stage 3 is very effective in correcting small residuals especially at hydrograph recession and therefore reducing residuals to very small values. The updating, however, is not very effective around peaks, where the residuals remain large even in the transformed space. This results in a centrally peaked and long-tailed distribution of residuals after Stage 3. At Stage 4, we use a non-Gaussian distribution to describe the model residuals from Stage 3. Several longtailed distributions have been used in hydrological modelling studies, such as the finite mixture distribution (Schaefli et al., 2007;Smith et al., 2010), the exponential power distribution (Schoups and Vrugt, 2010) and Student's t distribution (Marshall et al., 2006). In this study, we assume that the model residuals can be grouped into two categories with respect to variance and thus choose a two-component Gaussian mixture distribution. It is possible to use more than two components, but we will show in our case study that two components are sufficient. We discuss the possibility of using other long-tailed distributions in Sect. 5.1.
Using a two-component Gaussian mixture distribution, we express the residual model at Stage 4 as where MN Z 4 (t), σ 2 4, 1 , σ 2 4, 2 , w represents a mixture of two Gaussian distributions N Z 4 (t), σ 2 4, 1 and N Z 4 (t), σ 2 4, 2 with weights w and 1 − w. The corresponding probability density function of MN Z 4 (t), σ 2 4, 1 , σ 2 4, 2 , w , denoted by PDF Z(t)| Z 4 (t), σ 2 4, 1 , σ 2 4, 2 , w , can be explicitly written as a weighted sum of two Gaussian probability density functions: where φ is the probability density function (PDF) of a Gaussian distribution. We assume that σ 4, 1 < σ 4, 2 to make the two components identifiable. This assumption implies that w represents the probability associated with the mixture component that has a smaller variance.

Model estimation
The maximum likelihood estimation (Li et al., 2013;Wang et al., 2009) is used to estimate model parameters at all four stages. Denote the parameter set as θ S for Stage S. The likelihood functions for the four stages are given by for S = 1, 2, 3, and where J z→Q = 1/ tanh {a + bQ(t)} is the Jacobian determinant of the log-sinh transformation. At Stage 1, the hydrological model parameters, transformation parameters (a and b) and the residual standard deviation (σ 1 ) are jointly estimated by maximizing the likelihood function. It is also possible to use a set of parameters already calibrated for the hydrological model (using a different objective, such as the least of the sum of squared errors) and estimate at Stage 1 only the transformation parameters and the residual standard deviation (see discussion in Sect. 5.2). At the end of Stage 1, the values of the hydrological parameters and the transformation parameters are concluded, without further changes in subsequent stages.
At Stage 2, the bias-correction parameters (c and d) and the residual standard deviation (σ 2 ) are estimated by maximizing the likelihood function. At the end of Stage 2, the values of the bias-correction parameters are concluded. At Stage 3, the autocorrelation coefficient (ρ) and the residual standard deviation (σ 3 ) are estimated. At the end of Stage 3, the value of the autocorrelation coefficient is concluded. At Stage 4, the model residual parameters (σ 4, 1 , σ 4, 2 and w) are finalized. The variances at Stages 1-3 (i.e. σ 1 , σ 2 and σ 3 ) are not used to generate forecasts but only for estimating parameters at corresponding stages. We use maximum likelihood at each stage to estimate parameters, and this requires us to specify the variance of residuals at each stage.
The shuffled complex evolution (SCE) algorithm (Duan et al., 1994) is used to maximize the log likelihood function at Stage 1, where a number of parameters are required to be calibrated. The simplex algorithm (Nelder and Mead, 1965) is used in the likelihood-based calibration at other stages, where fewer parameters are present. We use different optimization algorithms because the simplex algorithm is more computationally efficient when the number of parameters is small.

Model verification
We use several performance measures to evaluate the ensemble forecasts derived at each stage. The evaluation criteria suggested by Engeland et al. (2010) are used to test for important attributes of ensemble forecasts including reliability, sharpness and efficiency.
Reliability is often described as the property of statistical consistency, which allows ensemble forecasts to reproduce the frequency of an event. Reliability can be checked by the forecast probability integral transform (PIT) of streamflow observations, defined by where F t is the forecast cumulative distribution function of the streamflow at time t. In the case of zero flows, we use the pseudo-PIT (Wang and Robertson, 2011), which is randomly generated from a uniform distribution with a range [0, π t ]. If a forecast is reliable, π t follows a uniform distribution over [0, 1]. We graphically examine π t with the corresponding theoretical quantile of the uniform distribution using the PITuniform probability plot (or simply PIT plot; also called the predictive quantile-quantile plot; Renard et al., 2010). The PIT plot is closely related to the rank histogram (Gneiting et al., 2005;Hamill, 2001). From our experience, the PIT plot is more suitable than the rank histogram for the experiments where observations are abundant (such as daily or sub-daily forecasting verification). A perfectly reliable forecast follows the 1 : 1 line. A Kolmogorov-Smirnov significant band can be included in the PIT plots to as a test of uniformity (Laio and Tamea, 2007). In addition, PIT plots can be summarized by the α index , defined by where π * t is the sorted π t in increasing order. The α index represents the total deviation of π * t from the corresponding uniform quantile (i.e. the tendency to deviate from the 1 : 1 line in PIT diagrams). The range of the α index is from 0 (worst reliability) to 1 (perfect reliability).
Sharpness is a measure of the spread of the forecast probability distribution. Sharp forecasts with narrow forecast intervals are usually preferred by forecast users as they reduce the range of possible outcomes that are anticipated -that is, it is easier to make decisions with sharp forecasts. However, if a sharp forecast is unreliable, it is overconfident and is likely to lead to poor decisions. Thus, sharp forecasts are desirable, but only if the forecasts are also reliable. We use the average width of the 95 % forecast intervals (AWCI) to indicate forecast sharpness (Gneiting et al., 2007). Wider forecast intervals suggest less sharp forecasts. In order to compare the sharpness across different catchments, we define a score relative AWCI with respect to a reference forecast: where AWCI REF is AWCI calculated from the reference forecast. The reference forecast in this study is generated by resampling historical streamflows. To issue a reference forecast for a given month/year (e.g. February 1999), we randomly draw a sample of 1000 daily streamflows that occur in that month (e.g. February) from other years (e.g. years other than 1999) with replacement. As only 14 years of data are used in this study, the reference forecast for each month is more robust than the similar reference forecast for each day. The relative AWCI is unitless and the maximum is 1, corresponding to the sharpest forecast.
The efficiency (or accuracy) of a forecast is commonly used to assess deterministic (single-valued) forecasts. For the ensemble forecasts we generate here, we measure the efficiency with the well-known Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970), calculated for the forecast mean. A greater value of NSE indicates a more accurate forecast mean and thus better forecast efficiency. We also use relative bias to assess how the forecast mean deviates from observations.
We evaluate the overall forecast skill with a skill score derived from the widely used continuous ranked probability score (CRPS) (Gneiting and Katzfuss, 2014;Grimit et al., 2006;Wang et al., 2009) (denoted by CRPS_SS). CRPS is a negatively oriented score: a smaller value of CRPS indicates a better forecast. As with the relative AWCI, the skill score CRPS_SS is defined as the normalized version of CRPS with respect to a reference forecast: where CRPS REF is CRPS calculated from the reference forecast (already defined for Eq. 18, above). The maximum of CRPS_SS is 1, corresponding to a perfectly skillful forecast. While a decomposition of CRPS is available that gives an indication of reliability (Hersbach, 2000), we do not use this.
PIT plots are a stronger test of reliability (Candille and Talagrand, 2005), and accordingly we focus on PIT plots to discuss reliability.
3 Case study

Study region and data
We select six catchments in southeastern Australia and three catchments in the United States (US) for this study (Fig. 1), from a range of climatic and hydrological conditions. The streamflow data for the Australian catchments are obtained from the Catchment Water Yield Estimation Tool (CWYET) data set (Vaze et al., 2011). The rainfall and potential evaporation data for the Australian catchments are taken from the Australian Water Availability Project (AWAP) data set (Jones et al., 2009). All data for the US catchments are taken from the Model Intercomparison Experiment (MOPEX) data set (Duan et al., 2006). The Abercrombie and Emu catchments have many instances of zero flow (  Table 3.

Cross-validation
Daily streamflow is simulated with the GR4J rainfall-runoff model (Perrin et al., 2003) and then forecasted with ERRIS as described in Sect. 3. GR4J is a widely used conceptual model that was designed to be as parsimonious as possible. Its four parameters describe the depth of a production store (X1), groundwater exchange (X2), the depth of a routing store (X3) and the length of unit hydrographs (X4). Forecasts are generated from "perfect" (observed) deterministic rainfall forecasts at a lead time of 1 day (i.e. one time step ahead). All results reported in this study are based on cross-validation unless specified. Cross-validation allows us to generalize the forecast skill to data outside the sample period. Because of data availability, we choose different study periods for Australian and US catchments. For Australian catchments, data from 1990 to 1991 are used to warm up the hydrological model and the data from 1992 to 2005 are used to generate a leave-two-years-out cross-validation (i.e. effectively 14fold cross-validation). For a particular year, we remove the streamflow data from this year and the following year and apply ERRIS to forecast the streamflow for the year. The removal of the data from the following year aims to minimize the impact of streamflow memory on model performance. For US catchments, the data from 1979 to 1980 are used in the warm-up period and the data from 1981 to 1998 are used for a leave-two-years-out cross-validation (i.e. effectively 18fold cross-validation).

Results
Figure 2 compares forecasts at different stages for an example period. In this example, we generate daily streamflow forecasts for the Mitta Mitta catchment in the period between 1 July 2000 and 31 December 2000. The forecast mean and the 95 % forecast interval are plotted against observations. The forecast at Stage 1 (the base hydrological model forecast) frequently overestimates low flows, such as in the period between July and September. For high-flow periods (e.g. October), the forecast mean is generally more accurate but virtually all observations lie within the 95 % forecast intervals, suggesting that the forecast intervals are too wide (i.e. the forecasts may be underconfident). The forecast mean at Stage 2 is closer to the observations and the 95 % forecast intervals tend to be narrower. Stage 2 tends to overestimate high flows less than Stage 1, but introduces the problem of underestimating high flows in some instances (e.g. September).
The AR error updating applied in Stage 3 significantly reduces the forecast residuals, as we expect given that streamflows are usually heavily autocorrelated. The forecasts at Stage 3 are not only more accurate but also more certain, indicated by the considerably narrower 95 % forecast intervals. The differences between Stage 3 and Stage 4 are not evident in the time-series plots, in essence because Stage 4 is an attempt to address issues of reliability, which is difficult to see when forecast intervals are so narrow. We give a detailed view of changes to reliability at each stage below. Figure 3 summarizes the performance at each stage for all catchments, and generally confirms the improvements in performance at each stage observed in Fig. 2. In general, Stage 1 and Stage 2 are similarly efficient (Fig. 3b), skillful (Fig. 3c), sharp (Fig. 3d) and reliable (Fig. 3e). As we expect, Stage 2 forecasts are consistently less biased than Stage 1 (Fig. 3a) (except for the Hope catchment, where many instances of zero flow occur; see Table 2). Stage 3 is generally much more efficient and skillful than Stages 1 and 2. A partial exception to this is the Abercrombie catchment, which is less efficient at Stage 3 than Stage 2. The Abercrombie catchment experiences low (to zero) flows, but is also punctuated by abrupt high flows. Stage 3 is based on the time persistence of the residuals and may introduce more errors when flows change abruptly, which sometimes occurs in the Abercrombie catchment. In addition, residuals tend to be larger at higher flows and because NSE is a measure of squared residuals, it tends to give more weights to residuals at high flows. This causes the Abercrombie Stage 3 forecasts to be less efficient than those of Stage 2.
As we expect, Stage 3 forecasts are notably sharper than those at Stage 2 (Fig. 3d). However, this sharpness is not supported by reliability: Stage 3 forecasts tend to be much less reliable than all other stages (Fig. 3e). Figure 4 illustrates the reliability of the forecasts at each stage in more detail with the PIT plots. As PIT values are highly autocorrelated, we have to "thin" them in order to make the Kolmogorov-Smirnov significant band applicable (Zhao et al., 2015). We generate PIT plots from every 30th forecast to eliminate the autocorrelation. The PIT plots show that the forecasts at the first two stages are reliable (as with the α index in Fig. 3e). However, for Stage 3 the points on the PIT plots deviate substantially from the 1 : 1 line, with a clear S-shape pattern for almost all catchments (the exception is the Tarwin catchment). A traditional interpretation of this S shape is that the forecasts are underconfident (Laio and Tamea, 2007). However, in this case, the S shape is caused by the high level of kurtosis in the distribution of the residuals, as we will show below. The α index from Stage 3 is smaller than those indices from Stages 1 and 2 (the Tarwin catchment is the only exception), confirming the lack of the reliability at Stage 3. Stage 4 consistently improves the reliability of the forecast after the AR updating. The PIT plot at Stage 4 is much closer to the 1 : 1 line than that at Stage 3, and this is reflected by the α index, which increases for all catchments. Stage 4 corrects the underconfident forecasts from Stage 3 and slightly decreases the sharpness from Stage 3 (Fig. 3d).
At Stage 3, unreliable forecasts are caused by representing the model residual by an inappropriate (Gaussian) probability distribution. We compare the underlying density of the model residuals at Stage 3, ε(t) = Z 3 (t)− Z 3 (t) (fitted by the nonparametric density estimation), with the fitted parametric densities for different distributions in Fig. 5. The fitted Gaussian density is flatter than the underlying density of ε(t) in order to match the tails for each catchment. This suggests that the residual distribution is more peaked and has longer tails than the Gaussian distribution. As we have seen above, forecast residuals are, in general, dramatically reduced by the AR error updating. Unfortunately, this reduction in residuals does not occur at all events, especially where abrupt changes in flow occur (and hence the assumption of strong autocorrelation breaks down). Thus, the magnitude of the forecast residuals at Stage 3 for a small proportion of events is large relative to the majority of events. As we have seen, the practical implication of the dichotomous behaviour of the residuals is that their distribution is still bell-shaped and symmetric but is peakier and has a much longer tail than the Gaussian distribution. The Gaussian mixture distribution treats model residuals as two groups with different variances. The Gaussian mixture distribution is able to capture the peak and tails of the underlying residual density for all catchments, resulting in reliable ensemble forecasts that also have a highly accurate forecast mean. As we note in the introduction, however, other distributions have also been used to describe "peaky" data, and we explore these in the next section. To provide a basis for any future comparisons with this study, we include example parameter values for each stage in Table 4 (derived by calibrating each stage to the full set of data -i.e. without cross-validation). We note that (1) the variance parameter at Stage 3 is always much smaller than at Stage 1 and Stage 2, which leads to the dramatic reduction in the width of forecast intervals at this stage, and (2) that the w parameter that weights the component of the Gaussian mixture distribution with smaller variance is always greater than 0.5, confirming that the majority of residuals take a narrow range of values as we have described.

Testing an alternative residual distribution
It is possible to use long-tailed distributions other than the Gaussian mixture distribution at Stage 4. For example, Student's t distribution is a simple long-tailed distribution that has been used in hydrological modelling (e.g. Marshall et al., 2006). In this section we investigate whether Student's t distribution is a viable alternative to the Gaussian mixture distribution at Stage 4. To do this, we modify the model residual in Eq. (12) as follows: where ξ(t) is assumed to independently follow a Student's t distribution with ν degrees of freedom, and r is a scale parameter describing the spread and variation of the model residuals.
We first examine how well Student's t distribution can fit the residual distribution at Stage 4 for all nine catchments (Fig. 5). High peaks and long tails of the residual densities can be captured reasonably well by Student's t distribution for nearly all catchments. The fitted densities of Student's t distribution appear more "peaked" for most catchments than those of the Gaussian mixture distribution, which is originally used at Stage 4. Figure 6 further investigates how Student's t distribution can fit the upper quantile of the model residuals. There is a clear tendency of Student's t distribution to overestimate the upper quantile (e.g. 98 % or higher) of the model residuals (especially for the Australian catchments). These upper quantiles are more accurately estimated by the Gaussian mixture distribution. This implies that Student's t distribution often has tails that are too long. We note, however, that if the ERRIS method is tested on other catchments, it is possible that Student's t distribution may describe the residuals better than the Gaussian mixture distribution in some cases. A disadvantage of the very long tail of Student's t distribution is that it can be problematic for operational forecasting. The degrees of freedom, ν, determine how heavy the tails of Student's t distribution are. Table 5 presents the two calibrated parameters (i.e. ν and r) for all catchments. Calibrated ν values are less than 2 for eight out of nine catchments. The exception is the Hope catchment, and even here the calibrated ν is very close to 2. It is well known that for degrees of freedom less than 2, Student's t distribution is so heavy-tailed that the variance is infinite (if 1 < ν ≤ 2) or even undefined (if ν ≤ 1). This is obviously undesirable for operational forecasting: it can cause a few forecast ensemble members to be so large that the forecast mean becomes implausibly large. Figure 7 compares the forecast mean with observations if the model residual is revised as Eq. (19). In all catchments, in some cases forecast mean values are unrealistically large even as observations are relatively small. Student's t distribution is thus prone to being too long-tailed to be practically implemented. Therefore, we do not recommend using Student's t distribution to describe the residual distribution at Stage 4, and advocate the Gaussian mixture distribution as a practical alternative.

Testing an alternative calibration of the hydrological model
In this study, we apply a likelihood-based calibration at Stage 1 to derive the distribution of the forecast residuals. However, in operational practice forecasters may prefer to use their own methods for calibrating hydrological models (or it may be onerous to recalibrate large numbers of hydrological models, whatever method is used). It is possible to simply "bolt on" the ERRIS method to existing hydrological models. We simply need to calibrate the transformation parameters and the model residual standard deviation at Stage 1 while fixing the hydrological parameters to those already calibrated. We demonstrate this by first calibrating hydrological models with a simple least-squares objective. We then apply the ERRIS method and repeat the cross-validation analysis. Figure 8, an analog to Fig. 3, summarizes forecast performance when the hydrological model is calibrated to a leastsquares objective. The least-squares calibration essentially maximizes NSE as an objective, but the corresponding crossvalidated NSE is not necessarily always greater than that of the likelihood-based calibration. The forecast performance from the two different calibrations can differ markedly at Stage 1 but is largely similar after the AR error updating at Stages 3 and 4. Thus, ERRIS is flexible enough to accommodate existing hydrological models. Figure 9, an analogue to Fig. 4, compares the PIT plots for different catchments when the hydrological model is leastsquares-calibrated. The main change is that the forecasts at Stage 1 are no longer reliable in many instances. This is caused by the least-squares calibration, which does not ensure the forecast residuals are Gaussian (even after the logsinh transformation). The PIT plots derived from Stage 2 and  Stage 3 in Fig. 9 show a very similar pattern to their counterparts in Figure 4. It suggests that poor reliability at Stage 3 occurs irrespective of the calibration strategy employed for the hydrological model. As with Fig. 4, Fig. 9 shows the Gaussian mixture distribution used at Stage 4 effectively ameliorates the problems with the reliability of Stage 3.

Discussion
There are several advantages of using a multi-stage error model compared to a single complex error model. (1) The parameter estimation in ERRIS is relatively simple, and hence computationally efficient. Only a small number of parameters are estimated at each stage. Joint parameter estimations associated with a single complicated error model are often more computationally demanding.
(2) Interference between parameters is minimized. The parameters of a single complex model can confound each other and the contribution of one parameter can sometimes be explained by others. For example, the hydrological model parameters describing soil moisture storage capacity may interfere strongly with the error parameters describing bias. Interference between parameters can make the parameter estimation unstable, because more than one set of parameters can achieve a similar ob- jective function value, and thus over-fit parameters.
(3) In operational forecasting it is often important that individual components of the forecasting model can function independently. For example, if forecasts are issued to long lead times, the influence of an AR model diminishes as lead time extends. Thus forecasts at long lead times rely strongly on the hydrological model (and, in our case, a bias correction) to be plausible, even with perfect meteorological forcings. If all parameters are estimated jointly, it is difficult to guarantee that each component of a forecasting model can operate independently. In addition, because stages are independent, it is possible to change a stage without affecting other stages, making the ERRIS approach easy to extend or modify. This paper is aimed at developing a staged error model suitable for eventual use in an operational ensemble forecasting system. We have focused on presenting the theoretical underpinnings of this approach, and have limited its testing to forecasting with "perfect" (observed) rainfall forecasts at a lead time of 1 day. Operational systems routinely forecast to long lead times, and use uncertain rainfall forecasts to force hydrological models. In future work we will extend the validation of this model to forecast multiple lead times, and couple the ERRIS approach with reliable ensemble rainfall forecasts Shrestha et al., 2015).
The staged approach of ERRIS sets it apart from several predecessors, for example the hydrological uncertainty processor (HUP) and the dynamic uncertainty model by regression on absolute error (DUMBRAE). HUP is a Bayesian forecasting system to produce probabilistic streamflow forecasts (Kelly and Krzysztofowicz, 1997;Krzysztofowicz, 1999Krzysztofowicz, , 2001Krzysztofowicz and Kelly, 2000;Reggiani et al., 2009;Todini, 2008). HUP and ERRIS have some similarities: (1) both are post-processors of deterministic hydrological models for hydrological uncertainty quantification, (2) both apply transformation to normalize data, (3) both use a linear regression in the transformed space for bias correction, and (4) both use an autoregressive model to update hydrological simulation. However, ERRIS differs fundamentally from HUP by being implemented in stages. As we have noted, the staged approach avoids unwanted interaction between parameters, and ensures the base hydrological model performs as strongly as possible. In addition, some other technical advances distinguish ERRIS from HUP. For instance, ERRIS applies a restricted autoregressive model in order to avoid the possible overcorrection from the ordinary autoregressive model used in HUP. ERRIS uses a mixture of two Gaussian distributions for the residual distribution, which is more flexible than a Gaussian distribution used in HUP to describe the peak, shoulder and tail of the distribution. Pianosi and Raso (2012) developed DUMBRAE to quantify predictive uncertainty of deterministic hydrological models. Unlike ERRIS, DUMBRAE does not apply data transformation and considers an error model in the original space. To deal with heteroscedastic residual errors, DUMB-RAE explicitly formulates the error variance as a function of time series of absolute hydrological model errors and several independent predictors (such as precipitation). The dynamic variance model of DUMBRAE is an interesting alternative to the method we have presented here. As with HUP, another major difference between ERRIS and DUMBRAE is staged error modelling that allows ERRIS to characterize the forecast error in stages and to avoid potential parameter interference and ensure robust performance of the base hydrological model.

Summary and conclusions
In this study, we introduce the error reduction and representation in stages (ERRIS) method to update errors and quantify uncertainty in streamflow forecasts. The first stage of  ERRIS employs a simple error model that assumes independent Gaussian residuals after the log-sinh transformation. The second stage applies a bias correction that is able to correct conditional and unconditional biases, including the sometimes strongly non-linear biases that occur in ephemeral catchments. The third stage exploits autocorrelation in residuals with an AR model to dramatically reduce forecast residuals, but this results in unreliable ensemble forecasts. In the fourth stage a Gaussian mixture distribution is used to describe the residuals, resulting in ensemble forecasts that are both highly accurate and very reliable. Based on extensive validation of ERRIS, the accuracy of the forecast mean is slightly improved by the bias correction at Stage 2 and is considerably improved by the updating at Stage 3. The reliability of the forecasts at Stage 3 becomes a problem, because the shape of the residual distribution dramatically changes. The revision of the residual distribution at Stage 4 is effective for representing non-Gaussian residuals and leading to highly reliable forecasts. The Gaussian mixture distribution is showed to be more suitable than the Student's t distribution for describing the residuals after updating. ERRIS was designed with operational forecasting in mind, and we have shown that it is flexible enough to adapt to existing calibrated hydrological models.

Data availability
All data for the US catchments are taken from the MOPEX data set, which are available at http://www.nws.noaa.gov/ ohd/mopex/mo_datasets.htm. All data for the Australian catchments are available from the corresponding author, Ming Li (ming.li@csiro.au), upon request, because the underlying data sets CWYET and AWAP (see Sect. 3.1) are not publicly available.