Accounting for three sources of uncertainty in ensemble hydrological forecasting

Line 261-263. Not clear what ”intrinsic variability” and ”epistemic uncertainty” refer to here. These notions refer to the nature of the uncertainty. Walker et al., 2003, define them as: -Variability uncertainty: ”The uncertainty due to inherent variability, which is especially applicable in human and natural systems and concerning social, economic, and technological developments.” -Epistemic uncertainty: ”The uncertainty due to the imperfection of our knowledge, which may be reduced by more research and empirical efforts.” We suggest adding in brackets a short definition of both types of uncertainty: 1)the uncertainty due to the inherent variability of the values of interest, 2) the uncertainty related to the imperfect knowledge of the processes


Introduction
The complexity of hydrometeorological systems is such that it is not possible to perfectly represent their "true" descriptive physical processes, and even less to integrate them forward in time with mathematical models.These models are only an approximation of varying quality to represent and predict variables of interest, yet they proved to be skillful and useful for water resource management and hazard prevention (e.g., Bartholmes et al., 2009;Pagano et al., 2014;Demargne et al., 2014).
Inadequacies between simulation or predictions and observations can be largely attributed to the many sources of uncertainty that are located along the hydrometeorological chain (e.g., Walker et al., 2003;Beven and Binley, 2014).Hence, it is admitted that improvement of the forecast ought to go through understanding and reducing the sources of uncertainty (e.g., Liu and Gupta, 2007).These sources have a different nature that ranges from epistemic uncertainty due to the imperfection of our knowledge to variability uncertainty where the imperfections are due to the inherent system variability (e.g., Walker et al., 2003;Beven, 2008).They also differ in location, i.e., where they lay in the hydrometeorological modeling chain: meteorological forcing, model parameters and structure, hydrological initial conditions, and, to a lesser extent, observations (Walker et al., 2003;Vrugt and Robinson, 2007;Ajami et al., 2007;Salamon and Feyen, 2010).
As all models are exposed to these sources of uncertainty, they necessarily lead to forecasts with imperfections.It is thus possible -and frequent -that several models can simulate the process of interest with the same accuracy.These Published by Copernicus Publications on behalf of the European Geosciences Union.
simulations are equally likely in the mathematical sense; it is referred to as the principle of equifinality (Beven and Binley, 1992).
Ensembles provide a probabilistic answer to the equifinality problem.They are a collection of deterministic predictions issued by different models to simulate the same event and attempt to produce a representative sample of the future.They can be built by a suitable method wherever a source of uncertainty needs to be put under scrutiny.Additionally, in general, the ensemble mean is more skillful than deterministic systems and offers a better ability to forecast extreme events (e.g., Wetterhall et al., 2013).
As the sources of uncertainty differ in their location, nature, and statistical properties, they need specific tools to be deciphered efficiently (Liu and Gupta, 2007).A wide range of methods have been developed in the past year to cater hydrological forecast needs.
At the beginning of the 1990s, meteorologists pioneered the operational use of ensembles by constructing meteorological ensemble prediction systems (MEPSs), mostly to take into account imperfect initial conditions that are a prime importance uncertainty source in view of the chaotic nature of the atmospheric physics.Several methods have been proposed to tackle this issue.For instance, to define the initial condition uncertainty, the European Center for Medium-Range Weather Forecasts (ECMWF) generates an ensemble by initiating their model with singular vectors (Molteni et al., 1996) to which a stochastic scheme is added to deal with the model physical parametrization uncertainty (Buizza et al., 1999).
The increasing accessibility of MEPS benefited the hydrology community in issuing probabilistic hydrological forecasts that take into account meteorological uncertainty forcing with hydrological ensemble prediction systems (HEPSs; e.g., Cloke and Pappenberger, 2009;Brochero et al., 2011;Boucher et al., 2012;Abaza et al., 2014).Since 2007, the Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) allows for free access to meteorological ensemble forecasts for hydrologists and other researchers.This database regroups the outputs from nine operational atmospheric models around the world, which can be downloaded in grib2 format.
A lot of attention has been paid to the identification of hydrological model parameters and the non-uniqueness of the solutions.Among other techniques, Vrugt et al. (2003) proposed the shuffled complex evolution metropolis algorithm (SCEM-UA), a calibration technique that retains several sets of parameters instead of a single one for a more realistic assessment of parameter uncertainty.Beven and Binley (1992) suggested a more comprehensive approach for model acceptance or rejection with the generalized likelihood uncertainty estimation (GLUE) that allows one to include different forms of competing models.Gourley and Vieux (2006) asserted that dealing only with input and parameter uncertainty is likely to issue unreliable forecasts and that hydrological model structural uncertainty should be deciphered explicitly.This statement is substantiated by Clark et al. (2008), who compared 79 unique model structures and concludes that a single structure is unlikely to perform better than the others in all situations.Poulin et al. (2011) added that the structural uncertainty is larger than the parameter estimation uncertainty and provides more diverse outputs.Combining dissimilar hydrological model structures proved to possess a great potential (Breuer et al., 2009) even with simple combination patterns (Ajami et al., 2006;Velázquez et al., 2011;Seiller et al., 2012).
Initial condition uncertainty has also aroused scientific interest.Many studies using various data assimilation techniques to incorporate observations within the simulation processes demonstrated that the specification of catchment descriptive states is a crucial aspect of short and medium range forecasting (DeChant and Moradkhani, 2011;Lee et al., 2011).Among them, sequential data assimilation techniques such as the particle filter (e.g., DeChant and Moradkhani, 2012;Thirel et al., 2013), the ensemble Kalman filter (EnKF) (e.g., Weerts and El Serafy, 2006;Rakovec et al., 2012), and variants (Noh et al., 2013(Noh et al., , 2014;;Chen et al., 2013;McMillan et al., 2013) can substantially improve forecasting skills over the open-loop scheme (i.e., no data assimilation is performed), by reducing and characterizing the uncertainty in initial conditions.
Considerable efforts have been made in the development of these sophisticated techniques and this gave rise to many tools that have been individually tested useful.As Bourdin et al. (2012) pointed out, "to date, applications of ensemble methods in streamflow forecasting have typically focused on only one or two error sources [. . .] A challenge will be to develop ensemble streamflow forecasts that sample a wider range of predictive uncertainty".As underlined, the forecasting tools frequently tackle different sources of uncertainty and therefore do not exclude each other but can be seen as complementary, combining their assets to compose an overall better system.The present study identifies three efficient tools, namely a hydrological multimodel approach, EnKF, and MEPS forcing that are used together to decipher the traditional hydrometeorological sources of uncertainty.The paper scope is to identify how they complement each other, to assess their individual contribution to the hydrological forecast reliability and accuracy, and to eventually evaluate the possibility of achieving reliability without resorting to post-processing.This is achieved by issuing hindcasts on 20 catchments using the aforementioned techniques, either individually or combined, to investigate their specific role in the forecasting process.Each of them produces an ensemble that can be cascaded through the next ensemble technique in order to produce a larger ensemble that possesses a more comprehensive error handling.Finally, if all sources of error are accounted for, the ensemble should generate a forecast that is reliable (Bourdin et al., 2012).This paper is organized as follows: Sect. 2 presents the catchments, models, the EnKF basics and scores, and Sect. 3 sums up the systems specificities and their respective performances followed by a conclusion in Sect. 4

Catchments and hydrometeorological data
The 20 catchments located in the south of the province of Québec have been selected for this study (Fig. 1).The catchments experience a mixed hydrological regime with a spring freshet resulting from the important winter snow cover and a lesser second peak in autumn.There is little or no human intervention on the catchments.
The climatology of the catchments is varied (Table 1), particularly in terms of annual snowfall and annual total precipitation.The differences in the catchments' physical characteristics (area, length, slope, etc.) and climatology are reflected in their streamflow statistics (e.g., average streamflow, coefficient of variation).
Daily total precipitation, maximum and minimum temperature, and streamflows were provided by the Centre d'Expertise Hydrique du Québec.They performed Kriging on the observations over a 0.1 • resolution grid to which a temperature correction with an elevation gradient of −0.005 • C m −1 is added.The database is split into three periods: 1990-2000 for the calibration of the models, Oc-tober 2005-October 2008 for the spin up, while November 2008-December 2010 is committed to the hydrological forecast assessment.
The MEPSs used as inputs to the hydrological model were retrieved from the TIGGE database.The temperatures and precipitation forecasts from the ECMWF were chosen for this study.They are formed by 50 exchangeable members (Fraley et al., 2010) with a 6 h time step and a 10-day horizon.However, after conversion from Greenwich time to local Québec time, the horizon reduces to 9 days.For the sake of the study and to match the common framework of the hydrological models, weather forecasts are aggregated at a daily time step starting at 06:00 EST (12:00 UTC).The ECMWF raw forecasts are provided on a regular grid with a 0.5 • horizontal resolution (N200 Gaussian grid), which is too coarse for this application, especially for the smallest catchments.To ensure that several representative grid points are situated within each catchment boundary, meteorological forecasts are downscaled to a 0.1 • resolution during data retrieval by using bilinear interpolation (e.g., Gaborit et al., 2013).Also, the interpolation allows one to take into account the contribution of the grid points that are close but not directly situated within catchment boundaries and thus allows for a better description of each catchments' meteorological conditions.As the rainfall-runoff models are lumped, a single representative point forecast is obtained for each MEPS member by averaging the downscaled grid points situated within the catchment boundaries.
The weather forecasts display acceptable performance over the 20 selected catchments.In fact, in the initial group of 38 catchments, 18 displayed unsatisfactory performances so they were withdrawn from the experiment from the beginning, as pre-processing the meteorological inputs falls outside the scope of the project.When compared to the meteorological observations, precipitation and temperature MCRPS (Mean Continuous Ranked Probability Score) over the 9 days (see Sect. 2.4) remain below 3 mm and 3 • C, respectively, for the remaining 20 catchments.Other scores have been evaluated (Nash-Sutcliffe efficiency, root-mean-square error, mean absolute error, normalized root-mean-square error ratio) and are in agreement with the MCRPS values, confirming the exclusion of the aforementioned 18 catchments.An alternative to the ECMWF ensemble forecasts is used to simulate a deterministic meteorological forcing with equivalent theoretical skill.For this purpose, a single member is drawn randomly among the 50 exchangeable members.

Models, snow module, and evapotranspiration
The multimodel ensemble is composed of 20 conceptual lumped models.In this study, their outputs are pooled together with equal weights or studied individually.Models have been initially selected by Perrin (2000) for their conceptual and structural diversity and revised by Seiller et al. (2012).They present various degrees of complexity: 4 to 10 calibrated parameters and 2 to 7 reservoirs to describe the main hydrological processes (Table 2).Model selection is a key element for an efficient multimodel ensemble as the diversity among them contributes to encompassing the error in model conceptualization and structure (Viney et al., 2009).Close attention has been paid to the diversity of the different components of the models, especially regarding the representation of the different storages and flows.This maximizes the chance to encompass the most effective way to describe storage and routing by providing an ensemble of likely descriptions of the processes.All models were derived from existing ones, keeping their main specificities but adapting them to match a common framework where every snow modulemodel sets share the same inputs, namely precipitation and potential evapotranspiration.The models, in their original form, are either lumped (GR4J, GARDENIA, HBV, MO-HYSE, etc.) or use a spatial discretization of the catchment (CEQUEAU, TOPMODEL, SACRAMENTO, etc.).For the models that were initially semi-distributed, they have been converted into lumped models (Perrin, 2000).This has been done in order to facilitate their integration in the common framework used in this study and for computational requirements.
The 20 conceptual lumped models are applied in a traditional way; i.e., no subsequent spatial discretization has been performed, hydrological processes are computed at the catchment scale, and the parametrization is uniform over the entire catchment.Despite their simplicity and the approximations they rely on, they have shown to perform well and are competitive with more complex ones, especially when combined (Thiboult and Anctil, 2015b).(Zhao et al., 1980) The snow accumulation and melt module, as well as the evapotranspiration formulation, have also been omitted in the case the hydrological models had their own to be replaced by Cemaneige and Oudin's potential evapotranspiration formulation, respectively.Thus, for all hydrological models the same snow accumulation, melting module, and evapotranspiration formulation have been used.A detailed description of the models' structure can be found in Perrin (2000).
Cemaneige, a degree-day snow accounting routine, is used to model the catchment snow processes (Valery et al., 2014).It divides the catchment into five elevation bands and requires two parameters to be calibrated: a snowmelt and a cold-content factor.As it is calibrated conjointly with individual models and according to an objective function based on streamflow observations, its parameter values depend on the hydrological model with which it is coupled.The 20 hydrological models have therefore precipitation inputs that are driven by the same snow accounting routine but differently parametrized.Thus, part of the uncertainty related to the snowmelt module is taken into account through dissimilar parameter sets that drives the state of the snowpack accumulation and melting.
All models were given the same potential evapotranspiration input, which is computed following the formula from Oudin et al. (2005) that relies on the mean air temperature and the calculated extraterrestrial radiation.

Forecasting approaches
Two approaches are used and compared for forecasting: the open loop and the EnKF.Regardless of the method used, the meteorological observations over the 3 years preceding the forecast period are used for model spin up to provide better estimates of initial catchment conditions.

Open-loop forecasting
When the open-loop forecast is activated, the state variables are obtained in simulation mode and used as a starting point to initiate the hydrological forecast.The simulation and forecast steps then alternate as follows: (1) the models are forced with observations up to the first day t of the forecast and (2) the models are next forced with meteorological forecasts to issue the hydrological predictions until t + 9.The procedure is repeated as the models are brought forward in time with the observations from t.

Ensemble Kalman filter
The EnKF is a sequential data assimilation technique that uses a recursive Bayesian estimation scheme to provide an ensemble of possible model re-initializations.The model state variable vector X is updated according to its likelihood probability density function that is inferred by the observations z, p(X t |z 1:t ) with the indices t referring to the time.
When an observation becomes available, model states are updated (X + , the a posteriori estimation) as a combination of the predicted (X − , also called the a priori states) and the difference between the prior estimate of the variable of interest HX − and the corresponding observation z t .
where H is the observation model that relates the state vectors and observations, and K is the Kalman gain matrix that defines the relative importance given to the output error and prior estimate, respectively.The Kalman gain is defined with the model error covariance matrix P t and the covariance of observation noise R t as A detailed explanation of the EnKF mathematical background and concepts can be found in Evensen (2003).In this study, the filter has been implemented in its traditional form following Mandel (2006).
The EnKF is able to decipher the catchment initial condition as it acts on variables after the spin-up time, i.e., at the very start of the hydrological forecast.Thus, it is frequently presented as a tool that describes catchment descriptive state uncertainty, such as soil moisture, but it also implicitly takes into account model parameters and structural uncertainty as these are reflected in the model states and output errors.The forecasting system comprises inaccuracies at several levels and consequently the error statistics that the EnKF uses to update state variables are due not only to the variability uncertainty (the uncertainty due to the inherent variability of the values of interest) but also to the epistemic uncertainty (the uncertainty related to the imperfect knowledge of the processes) that lay in the value of the state variables as well.
The EnKF performance is highly influenced by its setting, in particular by the required noise specification of inputs and outputs (Noh et al., 2014) and also by the choice of the updated state variables (Li et al.).This directly affects the spread of the ensemble and the corresponding uncertainty description (Thiboult and Anctil, 2015a).As the level of uncertainty varies from the model used and the simulated catchment, the optimal EnKF implementation also depends to a great extent on these aspects (Thiboult and Anctil, 2015a).
In practice, it is complex to untangle uncertainties through the use of the EnKF.The filter, in its traditional form, can decipher the overall predictive uncertainty but does not distinguish between input-output, structural, and parameter uncertainty.By artificially and deliberately overestimating the input uncertainty, it is possible to compensate for uncertainties that are not explicitly addressed and achieve reliability in simulation and possibly during forecast for the first lead times.
In this study, the EnKF is tuned to optimize reliability and accuracy per catchment and per model.The retained specifications are identified after extensive testing has been carried out.More precisely, two or three noise levels for each input and output were tested (a 25-50-75 % standard deviation of the mean value with a gamma law for precipitation, 10-25-50 % standard deviation of the mean value with the normal law for streamflow observations, and 2-5 • standard deviation with a normal law for the temperature).Additionally, as the choice of updated state variables is also a key component of the EnKF, all possible combinations of updated state variables were tested with the 12 noise combinations described above.The retained EnKF settings were based on a two-step criterion; first, the three settings that presented the best reliability were kept and then the one among them that led to the lowest bias.Therefore, the optimal settings may use unrealistically high perturbations that compensate partially for the structural error.A detailed description of the EnKF optimization with the 20 models is provided in Thiboult and Anctil (2015a) In this study, where the EnKF is meant to be combined with the multimodel approach, dual state-parameter updating was not considered since it is expected that the multimodel accounts for structural and parameter uncertainty simultaneously (Poulin et al., 2011), releasing the need to modify (update) model time-invariant parameters.

Scores
The continuous ranked probability score (CRPS; Matheson and Winkler, 1976) is a common verification tool for probabilistic forecasts that assesses accuracy and resolution.A cumulative distribution function is built based on the raw predictive ensemble, i.e., the collection of deterministic forecasts and then compared to the observation.It is defined as where F t (x) is the cumulative distribution function at time t, x the predicted variable, and x obs is the corresponding observed value.The function H is the Heaviside function, which equals 0 for predicted values smaller than the observed value, 1 otherwise.The CRPS shares the same unit as the predicted variable x.
As the CRPS assesses the forecast for a single time step, the MCRPS is defined as the average CRPS over the entire period.The MCRPS can reduce to the mean absolute error (MAE) if a single member is considered and thus it allows to compare deterministic and probabilistic forecasts (Hersbach, 2000;Gneiting and Raftery, 2007).Finally, a value of 0 indicates a perfect forecast and there is no upper bound.
The reliability diagram (Stanski et al., 1989) is a graphical method to assess the reliability of a predictive ensemble by plotting forecasted against observed event frequencies.A perfectly reliable forecast is represented by a 45 • line that indicates that forecasted and observed frequencies are equal.If the joint distribution curve differs from the perfect reliabil-  ity line, it indicates that the spread of the ensemble does not perfectly match its predictive skills.If the curve is situated above the perfect reliability line, this denotes an overdispersion of the ensemble, and an underdispersion in the opposite case.
The reliability is 2-fold.Since the reliability curve assesses the dispersion regarding the predictive skills of the ensemble, it is possible to have a perfectly reliable system with a low predictive capability in the case that dispersion is very high.For disambiguation, the ensemble spread is added to the plots.
Practically, one can define the deviation from perfect reliability by estimating a measure of distance between the forecast reliability curve and the perfect reliability line by computing the MAE or mean square error (MSE; Brochero et al., 2013).This dimensionless score allows one to reduce the measure of reliability to a scalar.In the case where the MAE is used, it can be easily interpreted as the average distance between forecasted frequencies and the observed frequencies over all quantiles of interest.This verification score is henceforth referred to as the mean absolute error of the reliability diagram, MaeRD.
Additional information about reliability can be obtained from the Spread Skill Plot (SSP, Fortin et al., 2014).It compares the Root Mean Square Error RMSE and the square root of average ensemble variance that is a measure of the ensemble spread.The reliability is thus somehow decomposed into an accuracy error part and a spread component.Ideally, the spread should match the RMSE.

Results
Table 3 summarizes the specificities of the nine variants of the hydrometeorological forecasting framework according to the three "forecasting tools": multimodel, EnKF, and ensemble meteorological forcing.Each of these switches may be activated or not and are marked accordingly as on/off in the table.
The multimodel switch dictates if the members issued by the 20 individual models are pooled together to create a single probabilistic forecast.In the case where the multimodel approach is not used, the models' outputs are kept individually and 20 distinct ensembles -one per model -are considered.
The EnKF switch indicates if sequential data assimilation or the open-loop procedure is applied.When EnKF updating is used, an ensemble of 50 members is created from 50 likely initial conditions sets identified by the filter.Otherwise, a single set of state variable values determined from the simulation is provided to the forecasting step.Note that the H and H system differ by the EnKF perturbations magnitude, where H uses perturbations that aim to optimize the combined criterion while H uses lower perturbations that are deemed to be more realistic.
Lastly, the meteorological forcing employed during the forecasting step can be either deterministic or probabilistic, using one randomly picked member or all 50 MEPS members.
These tools can be used alternatively or combined.For instance, if the EnKF and the meteorological ensemble forcing are used collectively, each of the 50 initial condition sets will serve as a starting point for each of the 50 meteorological forecast members, creating a larger hydrometeorological ensemble that contains 2500 members.
We chose to disregard more complex or "hybrid" cases in this study, where for example, the final ensemble is composed with some models that benefit EnKF state updating while others are used in an open-loop forecasting mode as these setups do not add additional information about the role of the tools, increase the degree of freedom for the system optimization and would increase computational costs considerably.
The results for each of the nine systems applied to every catchment, lead time, and possibly every model are not systematically detailed and compared to each other.The following graphs are deemed sufficient to interpret the role and benefits that the system components play on the forecast quality.Additional graphs representing the resolution and reliability of each system are provided online for readers, who are interested in a specific setup.
To picture an overview of the results, Fig. 2 represents the accuracy in terms of MCRPS (or MAE for system A that is fully deterministic) and MaeRD.For graphical convenience, the full distribution of performance according to various factors is not displayed but only a single representative value.To reduce all of the results to a single scalar, the median performance has been considered.In the case where a multimodel approach is used, the median performance over the 20 catchments is displayed on the figure.Otherwise, when individual  3).The four top radar plots illustrate the MCRPS with the center indicating the climatology reference performance, and the perimeter representing a perfectly accurate simulation.The four bottom plots describe the measure of distance from perfect reliability, with the center indicating a MaeRD = 0.5 while the perimeter corresponds to a perfect reliability.models are considered, the median performing model is first identified and then the median performance over the catchment is represented.This implies that the performance of individual models systems (A, B, C, and D) may refer to a different model for each lead time.
The four radar plots situated on the top of the figure illustrate the MCRPS performance.As a reference, the center of the disk consists of the median MCRPS value of the climatology over the 20 catchments while the perimeter represents a perfect MCRPS equal to 0. The radius lines represent the nine systems described in Table 3 and are referred to by their corresponding letter.
The nine systems present varying performance but all decrease logically with lead time.System A, which is deterministic, undoubtedly performs worse for every lead time.It is challenged from the third day and is outperformed for medium range forecast by the hydrological climatology.System B presents quite a similar behavior to system A but with a lower decrease of accuracy with lead time.System C may be considered as competitive for shorter lead times but loses quickly its edge.These preliminary results tend to indicate that simpler HEPSs may not be appropriate to accurately forecast streamflows over a 9-day horizon.However, all versions, including the simplest versions (except system A) are more informative than the climatology for all lead times.Systems G, H, and H stand out from the others for all lead times.
The second row in Fig. 2 illustrates the reliability of each system.The center of the disk corresponds to a MaeRD equal to 0.5.System A is artificially placed at the center of the radar plot to denote that no reliability information is communicated since it is deterministic.
The reliability result shares similarities with the accuracy assessment.Simpler systems face difficulties in providing a reliable forecast.Despite the use of the meteorological en-semble forcing, system B is far from providing the right dispersion.Systems C and D provide some information for short lead times, but experience a substantial loss with increasing lead time.Once again, G, H, and H perform the best.

Multimodel approach and structural uncertainty
To assess the gain related to the multimodel approach, Fig. 3 presents a comparison of the individual model MAE (A) and the MCRPS that pools all model output together (E).At this step, only the structural uncertainty is taken into account as the meteorological forcing is kept deterministic and no initial condition uncertainty estimation is provided for both cases.These systems are computationally cheap as they contain either 20 × 1 member or 20 members.
In Fig. 3, each box plot represents the distribution of performance (minimum, quantiles 0.25, 0.5, and 0.75, and maximum) of the 20 models while the curve details the multimodel accuracy.On the x axis, the 20 test catchments are sorted according to increasing multimodel MCRPS for the first lead time.This allows one to notice that certain catchments exhibit a faster growing error.
The multimodel performs consistently better than the median performance of the models but also better than any model in the large majority of cases.Exceptions can be occasionally observed for catchments 3 and 17 where only one or two models outperform the ensemble.However, the best performing models differ from one catchment to another while the multimodel presents the advantage of being more robust than any of the models.This is explained by the varied individual model behaviors.Each model may grasp different specificities of the hydrograph by focusing more specifically on different (conceptual) hydrological processes.Consequently, the ensemble members -the models -have disparate errors.Whenever the mismatch between forecast members and observation is poorly correlated, their errors tend to cancel each other out.
Figure 4 presents the reliability of system E.Each curve refers to one of the 20 catchments.As mentioned, the structural uncertainty of the hydrological models is solely explicitly taken into account by the combination of the models.
System E is generally slightly over confident for all lead times and this trend becomes more apparent as the lead time increases.This is expected as the meteorological forcing uncertainty increases with time while the deterministic forcing does not support that aspect.One can notice that the reliability also depends on the catchments.For the first lead time, most of the catchments are close to reliability while there are two outliers for which accuracy skills do not match their corresponding spread.In fact, these catchments exhibit a constant hydrological bias partially explained by an inaccurate meteorological forcing that is not captured by any of the models.Consequently, the models' errors are highly correlated and this prevents the members from performing an ensemble.This bias indicates that the aggregation of the other sources of uncertainty drive the system toward an inaccurate state.

Data assimilation and initial condition uncertainty
Figure 5 illustrates the increase of performance related to the data assimilation by comparing systems E and G. System G improves upon E as it benefits from the EnKF data assimilation to handle the initial condition uncertainty.The models' states are updated according to the last available observations and an ensemble is created for each model based on the probabilistic estimation of best initial conditions.
The EnKF provides a considerable gain over open-loop forecasts for all catchments and reduces the number of lower performance catchments.This indicates that inaccuracies accumulated and stored during the spin-up period in the state variable as the results of structural and forcing errors can be significantly reduced by providing adequate model reinitialization.
As the EnKF acts on model state variables right after the spin-up period, it is not surprising to see its efficiency decreasing with lead time.This clarifies why the EnKF is beneficial for all lead times but that its skill decreases faster than that of the open-loop scheme.Moreover, the EnKF provides satisfactory initial condition distribution to minimize the error at the time the observation becomes available but does not sample the posterior states to be optimally integrated through time.
Figure 6 details the reliability of system G.There is a considerable increase of spread in comparison to system E for shorter leads times that goes beyond adequate dispersion and lead to a slightly overdispersed forecast for the first lead time.This was expected as the EnKF was initially implemented to maximize individual model reliability for system G (see Sect. 2.3.2).As the EnKF also takes into account the parameter and structural uncertainties and is combined with a multimodel approach, there may be a redundancy in the error deciphering.The structural error and the corresponding ensemble spread that it should describe may be somehow accounted for twice in that particular case.However, the overestimation of the ideal spread diminishes as the EnKF influence fades away quickly and the system goes back toward a better reliability for medium range forecast and underdispersion from days 4-5.
To explain the rapid decrease in reliability, Fig. 7 displays the ensemble mean RMSE and the square root of average en-semble variance.This individual spread skill plot (one model and one catchment) is typical.The spread and the RMSE are close to a perfect match for the first day indicating an appropriate dispersion, yet, they diverge rapidly.The reliability deterioration of the system is 2-fold: the increase of the ensemble mean bias and the decrease of the spread.The loss of hydrological predictive skill is coherent regarding that the meteorological accuracy diminishes with increasing lead time.Concerning the second point, in most cases, the ensemble of initial conditions that EnKF provides often differ little from each other -a few percent -indicating that the posterior distribution of each parameter is rather narrow (DeChant and Moradkhani, 2012;Abaza et al., 2015).These dissimilarities are not large enough to provoke a divergence in the behavior of EnKF members during the forecasting step as the models are resilient.The different initial conditions thus tend to merge toward a certain value -often close the open-loop one -which may not be accurate.This behavior is attributed to the EnKF rather than to the model structures as it has been also observed by others, for example with a 3-hour time step and spatially distributed model in Abaza et al. (2014).Alternatives to the traditional EnKF (e.g., dual state-parameter, additional direct perturbations of state variables) may possibly contribute to slightly maintaining the spread for longer lead times but they may not be consistent with the use of the multimodel, as it may imply taking into account the same source of uncertainty twice.

Contribution of the meteorological ensemble forcing
One step further in terms of system complexity is taken as the MEPS forcing is introduced.In this study meteorological forcing was not processed, as the investigation of such technique was deemed out of scope.It is expected that a successful pre-processing would enhance the MEPS forecast and that these improvements could possibly be cascaded through the hydrological components to the final hydrological forecast.Counter-intuitively, recent attempts demonstrated that no or minor improvements were obtained in the hydrological forecast (Kang et al., 2010;Verkade et al., 2013;Zalachori et al., 2012;Roulin and Vannitsem, 2015).Figure 8 compares the MCRPS of systems G and H.They differ only in their meteorological forcing as the latter uses the 50-member probabilistic forecast.The difference between them is negligible until the seventh or eighth day where an improvement in performance can be noticed on some catchments.For these longer lead times, the probabilistic forcing is slightly more efficient for the MCRPS but the main difference lies in the reliability (Fig. 9).In fact, the reliability is substantially improved for the longest lead times when the meteorological uncertainty is provided to the system.The influence of the season is rather weak since the comparison of these systems with respect to seasonality leads to the same conclusions (see Supplement).
The ECMWF MEPS dispersion grows with lead time and logically contributes to the HEPS's spread accordingly.This is confirmed by comparing the spread of the G and H systems as they decrease at a different pace.While they are almost identical with a value of 0.58 and 0.59 mm day −1 for day 3, G spread drops to 0.45 mm day −1 for day 9 while the use of the MEPS maintains the spread to 0.59 mm day −1 .This also indicates that the tool that contributes the most to the HEPS dispersion is the EnKF since the raw MEPS forcing is not able to fully balance the decrease of the spread induced by the EnKF.Further improvement in the reliability could perhaps be achieved through bias removal and suitable preprocessing technique.
The main sources of uncertainty -hydrological model structure, initial conditions, and meteorological forcing -are cascaded through the different components of the forecasting system to provide better forecast than any of the systems previously described.Yet the system reliability is not perfect as the forecast for day 1 and day 9 are, respectively, slightly overdispersive and underdispersive in addition to present sensitivity to the catchments.To realistically represent the uncertainty of the system, the spread should grow with lead time as the future is more uncertain.This suggests that further improvement of this setup and particular application could be obtained with a more dispersed meteorological forcing.

Simplification of the framework
A potential drawback for operational use of such a system is that it is computationally expensive as 50 000 members are exploited to build it.The efficiency of a simpler system is assessed in Fig. 10.Eight typical catchments are displayed in the subplots to illustrate the conclusion.The box plots represent the MCRPS distribution of the 20 models results from system D that benefits EnKF state updating and MEPS forcing.Each of these models can be considered as a subensemble of the large ensemble H driven by a single model instead of using a multimodel approach.This is a more consistent approach with the EnKF individual optimization that is carried out to aim for reliability for each model one at a time.The numbers at the top of the subplots refer to the model number that outperform the multimodel for each lead time.
In Fig. 10, sub-ensembles are more skillful than the hydrological climatology for all lead times but rarely outperform the multimodel forecast.More precisely, the median performing sub-ensemble is always poorer than the multimodel and only the best models among the 20 occasionally exhibit lower MCRPS.Individual models that outperform the multimodel frequently differ from one catchment to another and from a lead time to another.This emphasizes the difficulty to choose a priori a single model as half of the 20 models never behave better than the multimodel and only model 1, 5, and 17 perform better than the multimodel for several catchments.Choosing a sub-ensemble doubtlessly enhances the system computational requirements and eases operational implementation, but relying on a single model may be misleading or, at least, minimize the expectation that one can have from the HEPS.
Figure 11 assesses the reliability of the same system with the MaeRD score.Like for the previous plots, the box plots contain the 20 ensembles that correspond to the 20 models and are sorted by catchment with increasing multimodel MaeRD.Note that the MaeRD does not provide precise information about dispersion but only about the distance from perfect reliability.Nevertheless, individual model ensemble may be either slightly over or underdispersive for the first lead time but are systematically underdispersive for longer lead times.However, system H can be either over or underdispersive depending on the catchment.Overdispersive forecasts, like for catchment 20, can be recognized as they tend to become more reliable for longer lead times.
For the first lead time, the best individual model ensembles may be competitive with the multimodel but are already less efficient from day 3 and are drastically underdispersive for day 9.Even if the EnKF takes into account the structural uncertainty at t = 0, it loses its efficiency during the forecast.The information that the updated state sets contain about the structural uncertainty vanishes when the sets converge toward a common value.The multimodel approach, by its nature, allows one to take over the role of the EnKF by dynamically preserving the required diversity.

Required EnKF perturbations
If the different sources of uncertainty along the hydrometeorological modeling chain are not explicitly accounted for by dedicated tools, the EnKF has to compensate for them.One way to achieve reliability is to increase the level of perturbation to the input.However, there is no obvious way to know by which amount the uncertainty on input should be overestimated to compensate for the other uncertainties (Zhang et al., 2015).Thus, to ensure hydrological reliability, one needs to perform a fastidious calibration of the EnKF hyper-parameters to identify the required noise magnitude (Thiboult and Anctil, 2015a).
H is identical to system H except that it relies on a different optimization of the EnKF.Instead of maximizing the combined criterion for individual models (see Sect. 2.3.2), the EnKF noise specification is set lower to values that are more consistent with real uncertainties estimations of observed climatological and streamflow observations at catchment scale.Namely, precipitation is perturbed with a gamma law with a standard deviation of 25 % of the mean value, temperatures with a normal law with a 2 • standard deviation, and streamflow observations with normal law with a 10 % standard deviation.
These noise magnitudes are therefore meant to describe the real uncertainties in forcing and observations in the EnKF but do not implicitly account for model error any longer.Also, in a perfect-model environment, i.e., without any model error, it has been shown that the EnKF spread is representative of the ensemble mean error with respect to a truth integration (Houtekamer et al., 2009).In other words, the implementation of the EnKF with realistic input and output perturbations corresponds to a potential "perfect" EnKF implementation if the total uncertainty could be summarized to the input and output error and were perfectly identified, i.e., in a perfectly controlled environment with a negligible model structural error.Consequently, with the system H , the structural error is theoretically only deciphered through the multimodel pooling.Yet this needs to be qualified as it is practically hard to untangle the sources of uncertainty within the actual configuration of the EnKF, but it reduces the risk that the tools' effects overlap.By choosing these perturbations, the user also gets rid of a fastidious EnKF tuning by screening adequate perturbation (e.g., Moradkhani et al., 2005;Thiboult and Anctil, 2015a) and hence simplifies the system implementation.
In Fig. 12, system H improves reliability for first lead times by reducing the overdispersion with a sensible decrease in the ensemble spread from 0.72 to 0.57 mm day −1 for day 1 without any degradation of the MCRPS (except for two catchments; all results are shown on additional figures online).System H maintains a more constant spread and reliability with increasing lead time as the main sources of uncertainty are more accurately deciphered specifically by their corresponding tool, leading to an overall better forecast.
Finally, it is unreasonable to assume that uncertainties are invariant from one catchment to another.The comparison of the MEPS forecast and meteorological observations showed that the quality over the 20 catchments remains close and indicates that the misfit probably originates from the structures composing the multimodel ensemble that can be maladapted to simulate these particular catchments or from doubtful streamflow measurements.This leads us think that further improvements in very uncertain environments are limited by a preliminary accurate quantification of error.
Also, considerable efforts have been paid to link performance with estimated times of concentration, size of catchments, and river slope without any clear results.The authors were not able to relate any catchment feature to particular results.

Conclusions
This work investigates the contribution of three different probabilistic tools commonly used in hydrometeorological sciences.They are used conjointly and alternatively to identify their effect on the hydrological predictive ensemble and to untangle sources of uncertainty that are aggregated in the outputs.
Each of these tools is dedicated to capture a certain aspect of the total uncertainty.A multimodel approach is used to quantify and reduce explicitly the hydrological model error, the ensemble Kalman filter (EnKF) to decipher the uncertainty related to initial conditions and the meteorological ensemble to account for the forcing uncertainty.
The experiment shows that important gain may be achieved in terms of accuracy and reliability by adequately using these techniques.Their action differ substantially by their mean and range of action.
The EnKF provides accurate quantification of initial error but fails to maintain reliability as its effect fades out quickly after model spin up.The information about the structural uncertainty deciphered by the EnKF, which is contained in the state variable posterior distribution, is not propagated with time integration during the forecast step.However, the EnKF remains a key component of the system as it is the one that provides the most dispersion for the first lead times.This also indicates that the accumulation of past errors in the initial conditions is a dominant source of uncertainty.
The multimodel approach is able to partially compensate for the EnKF decreasing action by taking over the structural uncertainty.Moreover, the combination of independent models improves accuracy as their errors may cancel each other.Lastly, the use of ensemble meteorological forecast contributes to the reliability of medium range forecast by representing the meteorological forcing errors.
Their action are complementary as they decipher different nature of uncertainty at different locations by acting at particular stages in the forecasting process.When combined, they need to be set according to the tools they are juxtaposed with to prevent overlapping actions.This is particularly the case for the EnKF that has an important degree of freedom in its implementation.It can eventually be tuned with more realistic input perturbations by coupling with the multimodel ensemble and therefore, facilitate its implementation by relaxing the constraints of optimal perturbation screening.
Possible avenues for further improvements may be achieved through a multimodel state updating rather than individual models updating, i.e., by treating initial condition in a single step as a whole.Lastly, the meteorological forecast has shown to be a little underdispersed for this application and could possibly be improved by applying suitable pre-processing techniques.
The Supplement related to this article is available online at doi:10.5194/hess-20-1809-2016-supplement.

Figure 1 .
Figure 1.Spatial distribution of the catchments.

Figure 2 .
Figure 2. Synthetic results of the nine systems that are referred by their code letter (see Table3).The four top radar plots illustrate the MCRPS with the center indicating the climatology reference performance, and the perimeter representing a perfectly accurate simulation.The four bottom plots describe the measure of distance from perfect reliability, with the center indicating a MaeRD = 0.5 while the perimeter corresponds to a perfect reliability.

Figure 3 .Figure 4 .
Figure 3.Comparison of individual models MAE and multimodel MCRPS sorted by increasing multimodel MCRPS for the first day (version A vs. E).

Figure 10 .
Figure 10.Comparative examples of the MCRPS on eight catchments of the EnKF individual models and the EnKF multimodel, both using MEPS forcing (system D vs. H).

Figure 11 .Figure 12 .
Figure 11.Comparison of the deviation from perfect reliability of EnKF individual models and the EnKF multimodel, both using MEPS forcing sorted by increasing EnKF multimodel MaeRD for the first day (system D vs. H).

Table 1 .
Main characteristics of the 20 catchments.Q and P are, respectively, the observed streamflow and precpitation.

Table 3 .
Description of the nine forecasting systems.