Evaluation of a probabilistic hydrometeorological forecast system

. Medium range hydrological forecasts in mesoscale catchments are only possible with the use of hydrological models driven by meteorological forecasts, which in particular contribute quantitative precipitation forecasts (QPF). QPFs are accompanied by large uncertainties, especially for longer lead times, which are propagated within the hydrometeorological model system. To deal with this limi-tation of predictability, a probabilistic forecasting system is tested, which is based on a hydrological-meteorological ensemble prediction system. The meteorological component of the system is the operational limited-area ensemble prediction system COSMO-LEPS that downscales the global ECMWF ensemble to a horizontal resolution of 10 km, while the hydrological component is based on the semi-distributed hydrological model PREVAH with a spatial resolution of 500 m. Earlier studies have mostly addressed the potential bene-ﬁts of hydrometeorological ensemble systems in short case studies. Here we present an analysis of hydrological ensemble hindcasts for two years (2005 and 2006). It is shown that the ensemble covers the uncertainty during different weather situations with appropriate spread. The ensemble also shows advantages over a corresponding deterministic forecast, even under consideration of an artiﬁcial spread.


Introduction
Recent flood events (e.g., the Alpine flood of August 2005, see Bezzola and Hegg, 2007;Jaun et al., 2008;Hohenegger et al., 2008) showed the vulnerability of the infrastructure we Correspondence to: S. Jaun (simon.jaun@wsl.ch) depend on. To reduce flood damages by taking appropriate precautions, long lead times (several days) in hydrological forecasting are needed, which is only possible with the use of medium range weather forecasts in a coupled hydrometeorological model chain. Especially the large uncertainties in precipitation forecasting affect the accuracy and reliability of the resulting hydrological forecast. As it would be imprudent to simply ignore these uncertainties (Pappenberger and Beven, 2006), they have to be forecasted too. For this purpose, probabilistic forecasts can be applied (Ehrendorfer, 1997).
In meteorology, probabilistic ensemble forecasts have been established for operational forecasts some time ago. Meteorological ensemble prediction systems (EPSs) are operationally available at the global scale from, for e.g., the US National Center for Environmental Predictions (NCEP, Toth and Kalnay, 1997), the European Centre for Medium Range Weather Forecasts (ECMWF, Molteni et al., 1996) and the Meteorological Center of Canada (MSC, Houtekamer et al., 1996). From these ensemble forecasts, a measure of the forecast uncertainty can be gained in terms of the ensemble spread. The spread of the ensemble members represents mainly the initialization uncertainty of the meteorological model, the main source of uncertainty for large-scale atmospheric circulation patterns in forecasts up to about five days (Buizza, 2003).
A number of case studies were conducted, which directly use the output from a global scale EPS to drive hydrological models (e.g., Bartholmes and Todini, 2005;Pappenberger et al., 2005;Roulin and Vannitsem, 2005;Siccardi et al., 2005;Rousset et al., 2007;Komma et al., 2007). While demonstrating promising results, some of the case studies suffer from biases related to the coarse resolution of the meteorological model ( Table 1) upstream of the Rheinfelden gauge (published in Jaun et al., 2008). models are not accurate at modeling local weather, because local sub-grid scale features and dynamics are not resolved, especially in regions with complex topography. To overcome this limitation, global-scale EPS forecasts can be dynamically downscaled by use of a limited area numerical weather model (e.g., the COSMO model, nested into the ECMWF ensemble as described in the following section). Mass et al. (2002) showed that such a refinement from a grid spacing of 36 km to a grid spacing of 12 km results in better forecasts, as it allows the definition of the major topographical features of the region and their corresponding atmospheric circulation. A dynamical downscaling of the global meteorological forecasts is expensive in terms of computing resources, and thus it is not feasible to downscale the full ensemble for everyday operational applications. Therefore, the ensemble size is normally reduced and only a subset of the ensemble members is used . This approach has successfully been used for several hydrological case studies (e.g., Verbunt et al., 2007;Jaun et al., 2008). In contrast, statistical downscaling approaches, like the use of meteorological analogues, rely heavily on the availability of long historical data sets and do not appear to be suitable to provide useful information about the future small-scale streamflow by itself (Diomede et al., 2008), especially in the case of extreme events. The aforementioned publications on evaluation of hydrological ensemble forecast systems have been limited to flood case studies and/or single catchments. Only recently have larger data sets become available. Olsson and Lindström (2007) and Bartholmes et al. (2009) provide analysis of extended time series over large areas (Sweden and Europe, respectively), both of which use direct output from the global scale ECWMF EPS to drive the hydrological model. An extensive list of recent studies applying ensemble approaches for runoff forecasts can be found in Cloke and Pappenberger (2009), together with a review of ensemble techniques. This paper investigates the applicability of a highresolution meteorological-hydrological ensemble system, using the dynamical downscaling approach for two continuous years (2005 and 2006). The study area consists of the upper Rhine basin, encompassing an overall area of 34 550 km 2 . To account for inhomogeneities in topography, atmospheric processes and runoff regimes, the domain is divided into 23 subcatchments with a typical size of 900 to 1600 km 2 (cf. Fig. 1 and Table 1), based on the setup described in Verbunt et al. (2006). In addition to the analysis of selected catchments, the full extent of the study area is considered.
Besides the input uncertainty (uncertainty from the meteorological data used to drive the hydrological model), which is addressed by the use of the meteorological ensemble, two additional components affect the output uncertainty of a hydrological model: the initialization uncertainty (i.e., the initial state of the model) and the model uncertainty itself (uncertainty from parameters and the conceptualization, Vrugt et al., 2005). In this work, the main focus remains on the input uncertainty, as forecasted meteorological data is regarded as the most uncertain component (Todini, 2004).

Methods
The meteorologic-hydrologic model chain used is the same as described in Jaun et al. (2008), where it was used for a case study of the extreme event in August 2005. For the meteorological component, either an ensemble forecast system or a deterministic forecast system, providing a single model realization, is applied. Forecast setups and strategies were adopted and applied to the operationally available meteorological forecasts for the years 2005 and 2006.

Deterministic and probabilistic meteorological forecast systems
The deterministic meteorological forecasts are provided by the operational weather forecast model COSMO-7. This model is the MeteoSwiss implementation of the COSMO model (COnsortium for Small-scale Modeling, Steppeler et al., 2003), which is nested in the global deterministic forecast model from ECMWF. COSMO-7 uses a horizontal gridspacing of 0.0625 degrees (7 km) and 45 model levels. Six meteorological variables (temperature, precipitation, humidity, wind, sunshine duration derived from cloud cover, global radiation) are further downscaled to 500 m grid-spacing (using bilinear interpolation, temperature adjusted according to elevation by adopting a constant lapse rate of 0.65 • C/100 m), to meet the grid size requirements of the hydrological model. The global meteorological ensemble is provided by the operational global atmospheric EPS of ECMWF and consists of 51 members. The generation of this ensemble is based on singular vectors to create optimally perturbed initial states (Buizza and Palmer, 1995). This global ensemble is downscaled by the limited-area EPS COSMO-LEPS Montani et al., 2003). Due to computational constraints, the operational COSMO-LEPS refines a subsample of 10 (16 from February 2006) representative global ensemble members only, selected by a cluster analysis . Prior to the clustering analysis, the preceding global EPS simulation from the previous day is combined with the actual forecast. Hence the clustering is applied to a recombined ensemble consisting of 102 members. This procedure, using "old" forecast information, generally results in a widening of the spread of the reduced ensemble. The clustering identifies similar circulation patterns based on the analysis of wind, geopotential height and humidity on three pressure levels (500 hPa, 700 hPa, 850 hPa) for two lead times (96 h, 120 h). From the resulting 10 (16) clusters, the respective representative cluster members (RMs) are selected and dynamically downscaled over a domain covering central and southern Europe. These ensemble members are run on a rotated spherical grid with a horizontal grid-spacing of 0.09 • ×0.09 • , equivalent to about 10×10 km 2 , and with 32 (40 from February 2006) model levels. The meteorological variables of the resulting high-resolution meteorological ensemble are treated analogous to the COSMO-7 variables. The cluster sizes can optionally be used to weight the representative members of COSMO-LEPS.

The hydrological model
The semi-distributed hydrological model PREVAH (Viviroli et al., 2009) is then driven by COSMO-LEPS with hourly time steps. PREVAH (Preciptation Runoff EVApotranspiration Hydrotope) uses hydrologic response units (HRUs, Flügel, 1997) and the runoff generation module is based on the conception of the HBV-model (Bergström and Forsman, 1973;Lindström et al., 1997), adapted to a spatially distributed application. Further information on the model physics, structure, interpolation methods and parameterisations can be found in Gurtz et al. (1999), Gurtz et al. (2003) and Zappa et al. (2003). The initial conditions of the hydrological model are obtained from a continuous reference simulation driven by meteorological observations, subsequently referred to as HREF. No additional perturbations were realised at the level of the hydrological model, e.g., consideration of initialization uncertainties.
The use of the deterministic meteorological forecast variables as input to PREVAH results in a deterministic hydrological forecast subsequently referred to as HDET. The coupling of PREVAH with COSMO-LEPS provides probabilistic hydrological forecasts in terms of a hydrological EPS (HEPS).

Set-up of simulations
Hindcasts were conducted daily for both years, 2005 and 2006, and for the deterministic forecasts as well as the ensemble set-up. The deterministic forecasts from COSMO-7 provide a forecast range of 72 h (3 days) and are initialized at 00:00 UTC.
The meteorological EPS forecasts are initialized at 12:00 UTC and span 132 h (120 h until June 2005). The first 12 h are not considered for the hydrological coupling, which is initialized at 00:00 UTC, resulting in a forecast range of 120 h (108 h) for HEPS. This cutoff considers the temporal availability of the operational ensemble forecasts and eases comparison to the deterministic forecast. To ensure consistency for the differing HEPS forecast ranges over the considered time period, the analysis of HEPS was restricted to 96 h (4 days).
For the quantitative analysis we focus on daily runoff values. The hindcasts are chained for the respective forecast ranges (0-24 h, 24-48 h, 48-72 h, 72-96 h), resulting in four (three for HDET) daily time series, which are accounted for separately and compared to each other. In the case of HEPS, we therefore get four time series consisting of daily ensembles derived from the summed up hourly values of the respective individual ensemble member within the forecast ranges. All calculations, e.g., the estimation of the ensemble interquartile range (IQR) are based on these daily values. Examples of chained daily runoff hindecasts are shown in Fig. 2a to Fig. 3.

Validation methodology
To evaluate the skill of the forecasts, score measures are applied. These are complemented by the evaluation of general ensemble properties to verify the statistical appropriateness of the probabilistic forecast (Laio and Tamea, 2007).
Yearly discharges were estimated for all catchments, in order to assess the representation of runoff volumes by the model chain. To test the general performance of the ensemble, a method used by the ECMWF for meteorological EPS verification was adopted (Lalaurette et al., 2005). Assuming a perfect probabilistic forecast with symmetric error quantiles, the following spread skill relation should be found: the absolute difference between the ensemble median and the verifying simulation should exceed half the interquartile range (referred to as spread) in exactly 50% of the cases. Therefore, for a theoretical perfect probabilistic forecast, averaging over spread categories should result in a diagonal relationship. Evaluating HEPS, deviations from this diagonal relationship will show whether the ensemble produces too high/low spread to cover the associated ensemble median error. As the assumption that error quantiles are symmetrical is not met in this application, positive and negative errors are accounted separately (cf. Lalaurette et al., 2005). Other methods for spread-skill evaluations, e.g. conducted by Scherrer et al. (2004), are based on the use of a skill score, e.g. the ensemble RMSE or the Brier skill score, which is compared to a measure of spread. The resulting relationship is then interpreted with respect to the relationship which results from a "perfect" forecast (e.g. from a toy model).
In addition to the spread-skill evaluation, the rank histogram (Anderson, 1996) of the probabilistic runoff forecast is evaluated, to check whether the ensembles include the observations being predicted as equiprobable members (consistency condition). If rank uniformity is not met, this can reveal deficiencies in ensemble calibration, or reliability (Wilks, 2006). In difference to the spread-skill relation, the rank histogram allows a distinction between bias and under-/overdispersion, but does not account for relative ensemble error.
To perform a probabilistic verification of the time series within the time window considered, we use the ranked probability skill score (RPSS) described in Wilks (2006). This score is widely used for the evaluation of probabilistic forecasts in meteorological sciences (e.g., Weigel et al., 2007a;Ahrens and Walser, 2008).The RPSS is based on the ranked probability score (RPS). The RPS is a squared measure that compares the cumulative density function of a probabilistic forecast with that of the corresponding observation over a given number of discrete probability categories. Thus the RPS measures how well the probabilistic forecast predicts the category in which the observation is found. For a given forecast-observation pair, the RPS is defined as where K is the number of forecast categories, py j is the predicted probability in forecast category j , and po j the observation in category j (0=no, 1=yes). The RPS is bounded by zero and K−1. While a perfect forecast would result in RPS=0, less accurate forecasts receive higher sores. By averaging the RPS over a number of forecast-observation pairs, these can be jointly evaluated, resulting in the mean RPS . The RPSS is finally obtained by relating the RPS of the forecast to the RPS ref of a reference forecast according to The RPSS can take values in the range −∞≤RPSS≤1.
Whereas RPS>0 indicates an improvement over the reference forecast, a forecast with RPSS≤0 lacks skill with respect to the reference forecast.
In this paper we chose climatological quantiles derived from 10 years of runoff data as catchment specific thresholds for the RPSS categories. Apart from the quartiles (0.25, 0.5, 0.75) we additionally selected the 0.95 quantile to better resolve higher runoff occurrences. For RPS ref we use the climatological probabilities of the mentioned quantiles.
For the two years of forecasts considered (2005 and 2006), we are faced with different ensemble sizes (10 and 16, respectively). From Müller et al. (2005), it is known that the RPSS is negatively biased for ensemble prediction systems with small ensemble sizes. The influence of the differing ensemble sizes is assessed by the additional use of a debiased version of the RPSS (RPSSd, after Weigel et al., 2007a).
Other probabilistic evaluation methods such as the Brier skill score (BSS, the probabilistic equivalent to the mean squared error), the reliability diagram or the relative operating characteristic (ROC), as described in Wilks (2006), are not considered as they require a single evaluation threshold, whereas the categorical evaluation by the RPSS allows a better judgement of the evolution of the hydrograph over an extended time period.
For evaluation of the deterministic hydrological forecast HDET, the Nash-Sutcliffe coefficient (E) (Nash and Sutcliffe, 1970) is applied, which is widely used for hydrological verification purposes (Legates and McCabe, 1999). The usual formulation of E is given by where y t and o t denote the forecasted and observed time series, respectively, andō the mean of the observations over the forecast period. E can take values in the range −∞≤E≤1, with E>0 indicating an improvement over a forecast with the observed mean discharge, while E≤0 shows no additional skill. E can also be interpreted as the coefficient of determination, representing the fraction of variability in o t that is contained in y t .
Direct comparison of the performance of HDET and HEPS is difficult to achieve. An evaluation of HDET by means of the RPSSd is not carried out, as the RPSSd does not directly quantify whether a specific forecast is more skillful, but rather is a measure for the gain in potentially usable ensemble information (Weigel et al., 2007b). The RPSS in turn suffers most from its negative bias in the deterministic case ("one member ensemble"). An evaluation based on a deterministic skill score like E implies the conversion of the probabilistic forecast into a deterministic one, e.g., by use of the ensemble median. As such a conversion implies a loss of valuable forecast information and can bias the ensemble performance in specific cases, it should not be applied for the sake of a simplified forecast interpretation (Wilks, 2006). Nevertheless we carry out a comparative deterministic evaluation of the HEPS median against HDET. If this evaluation reveals that the HEPS median performs equal or better than HDET over an extended time period, a first indication of an added value of the ensemble forecast system is given.
To further challenge the ensemble forecast system, it was tested against an artificial ensemble (HART) by means of the RPSS. HART is based on the climatological properties of HEPS, assuming a linear correlation between the ensemble median and the individually sorted ensemble members (separately for the different catchments and lead times). That means, for a specific catchment and lead time, as an example, the daily forecasts are sorted by runoff values in ascending order. Correlating the lowest member (second lowest,..., highest) of all daily forecasts with the according HEPS median results in a linear relationship. Applying these linear correlations, the daily artificial members are then constructed by use of the HEPS median. Consequently, spread and range of the artificial ensemble mainly depend on the actual runoff quantity of the HEPS median.
This evaluation reveals whether the ensemble forecast performance is better than a deterministic forecast with climatological ensemble spread. If HEPS shows no advantage over HART, the value of the ensemble forecast is at least questionable with regard to a deterministic forecast system that considers some sort of uncertainty information. Please note that this evaluation only reveals the minimal added value, as the median of the ensemble is used as base for HART. Consequently, HART contains ensemble information that is not available in the case of a deterministic forecast.
If the HEPS median outperforms HDET (in terms of E) and HEPS outperforms HART (in terms of RPSS), we can confidently state an added value of the ensemble forecast system, provided that the probabilistic evaluation of HEPS, including the general ensemble performance, shows positive results.
Apart from direct evaluations against runoff observations, we substitute the runoff observations by the reference simu-lation HREF to eliminate the additional uncertainties introduced by the hydrological model.

Analysis of a selected catchment
Figures 2a to c allow us to discuss important features of probabilistic hydrologic forecasts. Daily hindcasts (using HEPS) were conducted for the years 2005 and 2006, which were then chained for selected forecast ranges. As an example, graphs of chained daily hindcasts for the ranges 72-96 (Fig. 2a), 48-72 (Fig. 2b) and 24-48 (Fig. 2c) hours are shown at the Brugg (Aare) gauge for 2005. The ensemble IQR generally encompasses the reference simulation driven by observed meteorological inputs and also the measured runoff. Also, the ensemble range is much larger during flood peaks, representing the additional uncertainties during unstable weather situations.
Comparisons against the area-mean precipitation of the ensemble members (used as input for the hydrologic model, not shown) show the expected reduction in variability and amplitude due to hydrological processes. Note that there does not appear to be a problem with an overprediction of flood events (e.g., an event with a return period of two years is forecasted with some probability a few times in 2005) or a constantly large spread. For decreasing lead times, the HEPS full range and HEPS interquartile range decrease gradually and constrict around the reference simulation as expected. Figure 3 shows the chained daily hindcasts with HART for the same catchment as in Fig. 2a for the forecast lead time of 72-96 h. Compared to the corresponding Fig. 2a, peaks in spread and range are less distinctive. While the uncertainty seems to be well covered during runoff peaks, it remains constantly high during recession periods, as the simple synthetic ensemble construction cannot distinguish between inclining and declining phases of the runoff peaks. Figure 4 shows the yearly discharges for two example catchments (C12, C21) and catchment C23, which captures the out-flow from all catchments and can therefore be used as an indicator for the entire study area. Yearly discharge sums of daily range, IQR, and median for different lead times (0-24 h, 24-48 h, 48-72 h, 72-96 h) are compared to the respective values of HDET, HREF and measured runoff values. Figure 4 summarizes Fig. 2a to Fig. 2c and allows for simple and straightforward comparison between catchments and lead times.

Evaluation of yearly discharge
While HEPS IQR nicely encompasses HREF and HDET, the HEPS IQR does not contain the observations ideally. The yearly bias in volume from HREF to OBS visible in Fig. 4 (c) HDET is additionally marked in black. HQ 2 marks the two year recurrence period.  overall bias (median over catchments C1 to C23) is −6%, which is reflected in the bias of catchment C23 (−6%). HDET and the HEPS median show very similar performance when compared to HREF, with a slightly pronounced tendency of HDET to overforecast for the lead time 48-72 h.
Comparing 2005 and 2006 gives qualitatively very similar results, differing mainly in the observed yearly runoff sum (median increase of 3.5% from 2005 to 2006), changes in the bias of HREF to OBS (reflecting the skill of the hydrological model itself), a wider total range of the 16 member HEPS compared to the 10 member HEPS, but very similar IQR. As the relative positions of HREF, HDET and HEPS do not change between 2005 and 2006, the change of the HREF bias should not be neglected for inter-annual comparison of the forecast performance (compensation/amplification through overforecasting when compared to measured runoff).
For catchment C23, spread of HEPS shows a distinct reduction in ensemble spread and error for all lead times (2005 and 2006), which is visible in Fig. 4 as well as after normalization by measured runoff (not shown). On the one hand this indicates the overall decrease in uncertainty for forecasts over larger areas (i.e., differences in forecasts for small catchments even out over larger areas). On the other hand this is a result of the increase in the time of concentration. In the case of a forecasted large scale event (or a local event in the northern part of Switzerland), the contributing catchment area for C23 grows quickly and shows up in the (small) ensemble spread. A forecast of a local event in an alpine catchment will not be reflected in the ensemble spread of C23 (for short lead times due to the time of concentration, for longer lead times due to averaging). However, this is the real forecast situation and as we treat all forecasts the same way, none of them should benefit.

Verification of general ensemble properties
The scatter diagram for runoff comparing the ensemble spread and absolute error of the HEPS median is given in Fig. 5a for observed runoff and in Fig. 5b (Lalaurette et al., 2005), the coherence between ensemble spread and error exists for both choices of runoff reference (HREF and OBS). Large day-to- day variations occur within the shown relation, but the statistical relationship that should exist, when gathering a large sample of cases with similar spread (81 spread categories, each containing 100 daily values from the 23 considered catchments), holds and the distribution of errors within each spread category is centered around the diagonal. This evaluation shows that additional uncertainty is reasonably represented by an increase in spread (also cf. Fig. 2a).
A closer examination reveals that Fig. 5a shows a tendency to underestimate spread in the forecasts (especially for small and negative median errors). This is clearly reduced in Fig. 5b, reflecting the bias of HREF with regard to observed runoff (cf. Fig. 4). However, even with HREF, large negative errors are still not met by a sufficiently wide ensemble spread for longer lead times. This underestimation of spread disappears with shorter lead times and results from a single event (August 2005). Excluding the period of this event also removes the underestimation of spread for large negative errors. The analysis for the year 2006 yields very similar results for the lead time of 72-96 h, although spread is overestimated for large negative errors with HREF. The overestimation in spread remains with shorter lead times, which might be a result of the changed ensemble configuration.
Note that the considered period for evaluation of general ensemble properties is to short to allow for robust statistics. As shown above, features of the relation can be dominated by single events or catchments. Therefore the spread-skill relation should not be interpreted on its own. Figure 6a shows that the ensemble forecasts for the year 2005 do not satisfy the consistency condition (i.e., the ensembles do not reflect equiprobability of observations within their distributions). This is also the case for the year 2006 and the following remarks apply to both years considered. The U-shaped rank histogram indicates an under-dispersion (over-confidence) of the ensembles, as the observations are too frequently falling into the low and high ranks, resulting in an overpopulation of the extreme ranks. Evaluation of the rank histogram of HREF and HEPS (Fig. 6b) reveals that the overpopulation of high ranks is mostly a result of the uncertainty introduced by the hydrological model. This bias is also visible in Fig. 4 for C21 and C23. Still, the overpopulation of low ranks remains. We argue, that this is not only due to the slight overestimation of HEPS over HREF (Fig. 4), but also a result of the characteristics of the verified variable. We are considering a variable that is non-normally distributed, as the hydrograph and its evolution is bound by the baseflow. Runoff forecasts that lead to an increase in runoff are therefore less constrained, which explains the overforecasting bias in Fig. 6b. This hypothesis is supported by the skewness of HEPS towards lower runoff values in Fig. 4. For shorter lead times than those shown (72-96 h), the rank histograms for HREF and observations show an increase in frequencies at the highest rank. Indeed, the spread of the ensemble narrows with reduced lead times. The narrowing in spread mostly represents the increase in predictability. The indicated overconfidence of the ensemble for shorter lead times is probably an effect of the ensemble generation method, focusing on optimal spread for midrange forecasts. It should be noted that the relative error of the ensemble is not accounted for by the rank histogram. While the ensemble with a lead time of 0-24 h is actually overconfident, this has little effect for practical application, since the forecasted ensemble runoff for all members is almost identical to HREF and shows small relative errors. Indeed the associated scatter diagram (not shown) for error and spread shows that values group towards the left center for the shorter lead times.
The rank histogram for HART with HREF (not shown) features two superimposed characteristics: on the one hand, we see the same overpopulation of low ranks as for HEPS, on the other hand the central ranks are overpopulated too. This indicates an additional overestimation in spread for certain classes of runoff occurrences. Reviewing the chained plots for HART (cf. Fig. 3), this can be traced back to constantly high spread for median runoff occurrences as already mentioned in section 3.1. While spread for HART in Fig. 3 seems sufficiently wide to cover the peaks, although significantly narrower than HEPS spread, the combined spreadskill evaluation for all catchments indicates an underestimation of HART spread for high runoff occurrences.

Verification of time series
For further performance evaluation, the ranked probability skill scores (RPSS, 1: perfect skill, 0: no skill) of the ensemble hindcasts against the reference simulation and observed runoff were calculated, as described in Wilks (2006), separately for different lead times. This allows the temporal evolution of the hydrograph to be considered.
As the ensemble size differs for the two years considered (2005 and 2006), the influence of the change in ensemble size is assessed by an inter-comparison of the two years. This inter-comparison is restricted to the evaluation against HREF in order to exclude the varying performance of the hydrological model itself (changing biases of HREF from OBS as stated in Sect. 3.2). Furthermore the debiased ranked probability skill score (RPSSd) is used in addition to the RPSS. To test the direct effect of the change in ensemble size, the RPSS was calculated without the debiasing separately for the two years considered. Again, resulting score differences are minor and cannot be clearly associated to the increase in ensemble size for the year 2006. Considering the minor score differences and the fact that a separate evaluation of the two years yields the same conclusions, we assume that it is valid to evaluate the two years jointly.
The debiasing of the RPSS is only used for the inter-annual comparison but not for the evaluation of the full time period, as in the latter case we are primarily interested in the actual skill of the model system and not the theoretically obtainable skill score with a perfectly calibrated ensemble (represented by the RPSSd, Weigel et al., 2007b). In Table 2 we show results for the forecast system over the period [2005][2006]. The skill scores show differing results depending on the catchments, but in general, the RPSS is decreasing with increasing lead time (cf. Table 2). The decrease of RPSS is consistent with the results of the yearly analysis regarding HEPS range and HEPS IQR for different lead times and quantifies the additional uncertainty that is associated with longer lead times.
In contrast to, e.g., Olsson and Lindström (2007), where the use of the global EPS necessitates an additional bias correction, we find high score values using HREF as reference. This shows the suitability of the substitution of observed meteorological variables with forecasted ones and therefore the applicability of the coupled forecast system. Evaluation against observed runoff (OBS) results in RPSS values which clearly indicate improved skill of the forecast system over a climatological forecast. Nevertheless, the difference in skill between the evaluation against HREF and OBS leaves room for further improvements of the forecast system. While the uncertainty introduced with the use of meteorological forecasts is well covered by the ensemble approach, the score differences shown represent the bias of HREF against OBS, i.e., the model uncertainty of the hydrological model and the uncertainty in its initial state, which arises from uncertainties in the observed meteorological input (e.g., from interpolation uncertainties, cf. Ahrens and Jaun, 2007). Note that HREF evolves freely throughout the evaluation period and is not nudged against measured runoff. As HREF is used to generate the initial conditions for the forecast runs, a nudging against measured runoff would lead to a substantial increase in skill score values for HEPS relative to OBS.
While the median of an ensemble should not be applied for evaluation of single events, it was used for a deterministic evaluation of two continuous years (2005 and 2006). The comparison of the HEPS median to HDET by means of the deterministic skill score E is performed with regard to HREF. It reveals almost identical numbers for the shortest lead time (cf. Table 2). For longer lead times, the skill of the HEPS median decreases less rapidly than that of HDET. While the result of this evaluation should not be interpreted on its own (as the probabilistic information of the ensemble is lost), it gives a clear indication that the ensemble does not fall behind HDET in performance, even though the underlying meteorological model features a coarser numerical grid. This may not remain true for evaluations based on a shorter (hourly) timescale within the first 24 h.
The results of the evaluation of HEPS against its climatological correspondent HART are shown in Table 2. It reveals that HEPS performs better in terms of RPSS. For the longest lead time (72-96 h), HEPS shows a slight decrease in advantage over HART. This is consistent with the expectation that the relative performance of a climatological forecast should increase with longer lead times.
Taking into account the positive results of the probabilistic evaluation and the verification of the general ensemble performance, we can confidently state an added value of HEPS with regard to HDET: apart from better "deterministic results" for longer lead times, the ensemble is better than its own median forecast with climatological spread information and therefore shows the importance of temporal variability in the ensemble range and spread.
Weighting of the ensembles with the cluster sizes shows only marginal effects for all applied evaluation methods and can be neglected for the analysis of the probabilistic hydrological forecast series. This is consistent with the findings for precipitation verification by Marsigli et al. (2001), but it should be noted that weighting can improve the skill of the hydrological ensemble for specific cases and higher temporal resolution as showed in Jaun et al. (2008).

Conclusions
Using two years (2005 and 2006) of continuous daily hindcasts for the upper Rhine catchment, we find a good hindcast performance of the applied hydrometeorological forecast system. Statistical analysis shows that general ensemble requirements are reasonably met. The high RPSS values resulting from the evaluation against HREF demonstrate the applicability of the proposed coupled forecast system. It was shown that the chosen approach works for a wide band of weather conditions and that the ensemble spread represents the additional uncertainties during weather situations with low predictability. As expected, the IQR and the full spread of the ensemble increase systematically with lead time. Compared to the deterministic forecast HDET, the HEPS median shows higher skill for longer lead times. In addition, the evaluation of HEPS against its climatological correspondent HART shows the importance of temporal variability in the ensemble range and spread. As the ensemble forecast is better in both aspects, it is safe to assume that the use of the ensembles is superior to the deterministic alternative, especially with regard to the additional provision of probabilistic forecast information.
Although the evaluation against HREF reveals that the input uncertainty, introduced by the use of meteorological weather predictions, is well covered by the ensemble approach, the forecast system would probably profit from an additional ensemble calibration (Hamill et al., 2004). The reforecast series required for such a calibration recently became available for COSMO-LEPS (Fundel et al., 2009).
The evaluation against observed runoff shows further potential for improvements of the model system. Consequently, future work is planned to include the remaining uncertainties as adopted by, e.g., Pappenberger et al. (2005). Special attention will be payed to the initialization uncertainty of the hydrological component of the forecast system. Efforts towards an operational application of probabilistic forecasts, using similar setups as the described forecast system, are ongoing and were first demonstrated quasi-operationally within the framework of MAP D-PHASE (Zappa et al., 2008).