An intercomparison of approaches for improving operational seasonal streamflow forecasts

For much of the last century, forecasting centers around the world have offered seasonal streamflow predictions to support water management. Recent work suggests that the two major avenues to advance seasonal predictability are improvements in the estimation of initial hydrologic conditions (IHCs) and the incorporation of climate information. This study investigates the marginal benefits of a variety of methods using IHCs and/or climate information, focusing on seasonal water supply forecasts (WSFs) in five case study watersheds located in the US Pacific Northwest region. We specify two benchmark methods that mimic standard operational approaches – statistical regression against IHCs and model-based ensemble streamflow prediction (ESP) – and then systematically intercompare WSFs across a range of lead times. Additional methods include (i) statistical techniques using climate information either from standard indices or from climate reanalysis variables and (ii) several hybrid/hierarchical approaches harnessing both land surface and climate predictability. In basins where atmospheric teleconnection signals are strong, and when watershed predictability is low, climate information alone provides considerable improvements. For those basins showing weak teleconnections, custom predictors from reanalysis fields were more effective in forecast skill than standard climate indices. ESP predictions tended to have high correlation skill but greater bias compared to other methods, and climate predictors failed to substantially improve these deficiencies within a trace weighting framework. Lower complexity techniques were competitive with more complex methods, and the hierarchical expert regression approach introduced here (hierarchical ensemble streamflow prediction – HESP) provided a robust alternative for skillful and reliable water supply forecasts at all initialization times. Three key findings from this effort are (1) objective approaches supporting methodologically consistent hindcasts open the door to a broad range of beneficial forecasting strategies; (2) the use of climate predictors can add to the seasonal forecast skill available from IHCs; and (3) sample size limitations must be handled rigorously to avoid over-trained forecast solutions. Overall, the results suggest that despite a rich, long heritage of operational use, there remain a number of compelling opportunities to improve the skill and value of seasonal streamflow predictions.

Abstract.For much of the last century, forecasting centers around the world have offered seasonal streamflow predictions to support water management.Recent work suggests that the two major avenues to advance seasonal predictability are improvements in the estimation of initial hydrologic conditions (IHCs) and the incorporation of climate information.This study investigates the marginal benefits of a variety of methods using IHCs and/or climate information, focusing on seasonal water supply forecasts (WSFs) in five case study watersheds located in the US Pacific Northwest region.We specify two benchmark methods that mimic standard operational approaches -statistical regression against IHCs and model-based ensemble streamflow prediction (ESP) -and then systematically intercompare WSFs across a range of lead times.Additional methods include (i) statistical techniques using climate information either from standard indices or from climate reanalysis variables and (ii) several hybrid/hierarchical approaches harnessing both land surface and climate predictability.In basins where atmospheric teleconnection signals are strong, and when watershed predictability is low, climate information alone provides considerable improvements.For those basins showing weak teleconnections, custom predictors from reanalysis fields were more effective in forecast skill than standard climate indices.ESP predictions tended to have high correlation skill but greater bias compared to other methods, and climate predic-tors failed to substantially improve these deficiencies within a trace weighting framework.Lower complexity techniques were competitive with more complex methods, and the hierarchical expert regression approach introduced here (hierarchical ensemble streamflow prediction -HESP) provided a robust alternative for skillful and reliable water supply forecasts at all initialization times.Three key findings from this effort are (1) objective approaches supporting methodologically consistent hindcasts open the door to a broad range of beneficial forecasting strategies; (2) the use of climate predictors can add to the seasonal forecast skill available from IHCs; and (3) sample size limitations must be handled rigorously to avoid over-trained forecast solutions.Overall, the results suggest that despite a rich, long heritage of operational use, there remain a number of compelling opportunities to improve the skill and value of seasonal streamflow predictions.

Introduction
The operational hydrology community has long grappled with the challenge of producing skillful seasonal streamflow forecasts to support water supply operations and planning.Proactive water management has become critical for many regions in the world that are susceptible to water stress asso-Published by Copernicus Publications on behalf of the European Geosciences Union.ciated with the intensification of the water cycle.Paradoxically, although we have seen important technological advances, including increased computing power, the broader availability to climate reanalysis, forecasts and reforecasts, and more complex process-based hydrologic models (Pagano et al., 2016), the skill of operational seasonal runoff predictions in the US, termed water supply forecasts (WSFs), has shown little or no improvement over time (e.g., Pagano et al., 2004;Harrison and Bales, 2016).Hence, there is both a scientific and practical need to understand the potential of new datasets, modeling resources, and methods to accelerate progress towards more skillful and reliable operational seasonal streamflow forecasts.
There is general consensus in the research community on the main opportunities to improve seasonal streamflow prediction skill (e.g., Maurer et al., 2004;Wood and Lettenmaier, 2008;Yossef et al., 2013).These include improving knowledge of (i) the amount of water stored in the catchment, hereinafter referred to as initial hydrologic conditions (IHCs), and (ii) weather and climate outcomes during the forecast period.Our ability to leverage the first predictability source (i.e., hydrologic predictability) depends on the accuracy of watershed observations and models, including model input forcings (e.g., precipitation and temperature), process representations, and the effectiveness of hydrologic data assimilation (DA) methods.Our ability to leverage the second source (climate predictability) depends both on how well we can characterize and predict the state of the climate and on how effectively we can incorporate this information into streamflow forecasting methods.This idea has been explored in different frameworks using standard indices, e.g., Niño3.4,the Pacific Decadal Oscillation (PDO), and/or custom (i.e., watershed-specific) climate indices derived from climate reanalyses (e.g., Grantz et al., 2005;Bradley et al., 2015), or using seasonal climate forecasts to run hydrologic model simulations (e.g., Wood et al., 2005;Yuan et al., 2013).
Despite generally promising findings from this body of work and from a number of agency development efforts (Weber et al., 2012;Demargne et al., 2014), the use of largescale climate information for real-time seasonal streamflow forecasting in the US remains rare.In the western United States, where snowmelt commonly dominates the annual cycle of runoff, official WSFs are produced via two main approaches: (i) statistical models leveraging in situ watershed moisture measurements such as snow water equivalent (SWE), accumulated precipitation, and streamflow (Garen, 1992;Pagano et al., 2004); and (ii) outputs from the National Weather Service (NWS) ensemble streamflow prediction method (ESP; Day, 1985), which is based on watershed modeling.For the overwhelming majority of forecast locations, these approaches rely solely on the predictability from IHCs (measured or modeled).A small number of locations can be found, however, where climate indices also serve as predictors in the statistical framework, and the NWS has recently implemented techniques through which climate model forecasts may eventually be applied to ESP (Demargne et al., 2014).
This paper presents an assessment of several seasonal streamflow prediction approaches in harnessing both watershed and climate-related predictability.The methods are applied to seasonal WSFs and span a range of complexity, from purely statistical to purely dynamical and hybrid statistical/dynamical approaches.In this paper, "increased complexity" indicates a gradient from purely data-driven techniques (e.g., linear regression) to the use of dynamical watershed models (Plummer et al., 2009), the outputs of which may be further processed using additional statistical approaches.Although most of the techniques evaluated here are not new, the intercomparison offers new insights for researchers and developers in the operational community because (1) the experiment is broader than prior efforts and benchmarks alternative methods against current operational ones; and (2) the methods are chosen to be operationally feasible, avoiding the use of data that cannot be obtained in real time.In addition, the work uses a hindcast/verification framework and follows more rigorous standards for cross validation than were used in some of the prior studies.
The remainder of this paper is organized as follows.Section 2 describes prior methodological work and context for statistical, dynamical, and hybrid approaches to seasonal streamflow forecasting.The study domain is described in Sect.3. Datasets, experimental design, individual methods, and forecast verification measures are detailed in Sect. 4. Results and discussion are presented in Sect.5, followed by the main conclusions of this study (Sect.6).

Background
Seasonal streamflow forecasting methods can be categorized as dynamical, statistical, or hybrid, and span different degrees of complexity and information requirements.Dynamical methods use time-stepping simulation models to represent hydrologic processes.They describe future climate using either historical meteorology or inputs derived from seasonal climate forecasts (e.g., Beckers et al., 2016).On the other hand, statistical or purely data-driven methods rely on empirical relationships between seasonal streamflow volumes, and large-scale climate variables and/or in situ watershed observations.Several statistical approaches can be found in the literature, encompassing different degrees of complexity (e.g., Garen, 1992;Piechota et al., 1998;Grantz et al., 2005;Tootle et al., 2007;Pagano et al., 2009;Wang et al., 2009;Moradkhani and Meier, 2010).Other studies have tested multi-model combination techniques for purely statistical seasonal forecasts, using objective performance criteria (e.g., Regonda et al., 2006), both performance and predictor state information (Devineni et al., 2008), and Bayesian model averaging (e.g., Mendoza et al., 2014), among others.
Hybrid methods strive to combine the strengths from both dynamical and statistical techniques.For instance, uncertainties in dynamical predictions indicate that dynamical forecasts can benefit from statistical post-processing (e.g., Wood and Schaake, 2008).One line of research has examined the potential benefits of using simulated watershed state variables -either from hydrologic or land surface modelsas predictors for statistical models (e.g., Rosenberg et al., 2011;Robertson et al., 2013).Another popular technique consists of incorporating climate information within ESP frameworks, either deriving input sequences of mean areal precipitation and temperature from current climate or climate forecast considerations (e.g., Werner et al., 2004;Wood and Lettenmaier, 2006;Luo and Wood, 2008;Gobena and Gan, 2010;Yuan et al., 2013) -referred to as pre-ESP -or ESP weighting (also referred to as post-ESP) based on climate information (e.g., Smith et al., 1992;Werner et al., 2004;Najafi et al., 2012;Bradley et al., 2015).Werner et al. (2004) found that the post-ESP method (termed "trace weighting") was more effective than pre-ESP for improving forecast skill.
The combination of outputs from different models has also been shown to benefit seasonal hydroclimatic forecasting (e.g., Hagedorn et al., 2005).Although several studies have demonstrated that statistical multi-model techniques applied on dynamical models tend to outperform the "best" single model (e.g., Georgakakos et al., 2004;Duan et al., 2007), fewer insights have been gained on combining statistical or dynamical models in seasonal streamflow forecasting.Recently, Najafi and Moradkhani (2015) tested multi-model combination techniques of different complexities from both statistical and dynamical forecasts, concluding that model combination generally outperforms the best individual forecast model.Many sophisticated seasonal forecasting frameworks can be found in the literature, some of which incorporate DA techniques (e.g., DeChant and Moradkhani, 2011), a topic not discussed here.For this reason, the hydrology community may benefit from a broad assessment of the marginal benefits of choices made in a range of seasonal streamflow forecasting frameworks.

Study domain
Our test domain is the US Pacific Northwest (PNW) region (Fig. 1), which relies heavily on winter snow accumulation and spring snowmelt to meet water needs during spring and summer (e.g., Mote, 2003;Maurer et al., 2004;Wood et al., 2005).We select catchments contributing to five reservoirs: Dworshak (DWRI1), Howard Hanson (HHDW1), Hungry Horse (HHWM8), Libby (LYDM8), and Prineville (PRVO).Two of them -the Hungry Horse and Prineville reservoirsare owned and operated by the US Bureau of Reclamation (USBR), while the rest are operated by the US Army Corps of Engineers (USACE).The main physical and hydroclimatic characteristics of the case study basins are summarized in Table 1.These basins cover a wide range of runoff ratios (from 0.13 at Prineville to 0.78 at Howard Hanson) and dryness indices (from 0.63 at Howard Hanson to 3.83 at Prineville).Relatively high basinaveraged elevations condition a pronounced seasonal temperature pattern, with minimum values below the freezing point between December and February, and maximum temperatures during June-September (not shown).These topographic and hydroclimatic features favor snowpack development in the months of October-April, stressing the seasonal behavior of other water storages and fluxes.This is illustrated in Fig. 2, including model precipitation (i.e., observed precipitation with a snow correction factor, SCF) and monthly averages of hydrologic variables simulated with the Sacramento Soil Moisture Accounting (SAC-SMA; Burnash et al., 1973) and SNOW-17 (Anderson, 1973) watershed models (see Sect. 4).Although seasonal precipitation patterns may differ, water starts accumulating in October as snow water equivalent (SWE) and/or soil moisture (SM) in all basins.Increases in SM and runoff in most basins are driven by snowmelt at the beginning of spring with the exception of Howard Hanson, where the bulk of annual streamflow occurs in November-May.Among these basins, Dworshak, Hungry Horse, and Libby share similar SWE, soil moisture, and runoff cycles, although precipitation is relatively uniform in the last one throughout the year.The hydroclimatology of the PNW region is affected by a number of large-scale climate teleconnections.The warm (cold) phase of El Niño-Southern Oscillation (ENSO) is typically associated with above (below) average temperatures and below (above) average precipitation during winter (e.g., Redmond and Koch, 1991) and therefore decreased (increased) snowpack (Clark et al., 2001) and spring/summer runoff (e.g., Piechota et al., 1997).The Pacific Decadal Oscillation (PDO; Mantua et al., 1997) -which reflects the dominant mode in decadal variability of sea surface temperatures (SSTs) -has also been found a relevant driver for the hydroclimatology of the PNW (e.g., McCabe and Dettinger, 2002).The joint influence of ENSO and PDO on North American climate conditions, snowpack, and spring/summer runoff has been also well recognized and documented (e.g., Hamlet and Lettenmaier, 1999).As a consequence, many authors have explored the incorporation of large-scale climate information for seasonal streamflow forecasting in the PNW -using either standard indices (e.g., Hamlet and Lettenmaier, 1999;Maurer et al., 2004), custom indices from reanalysis fields (e.g., Opitz-Stapleton et al., 2007;Tootle et al., 2007), both (e.g., Moradkhani and Meier, 2010), or downscaled climate forecasts (e.g., Wood et al., 2005) -finding improved predictability for lead times longer than 2 months and particularly in years of strong anomalies in climate oscillations such as ENSO.

Experimental design
We use several decades of seasonal streamflow hindcasts to assess a suite of methods (Fig. 3), focusing on April-July streamflow (runoff) volume, the most common western US water supply forecast predictand.Probabilistic (ensemble) WSFs for this period are generated the first day of each month from October to April, in every year of the hindcast period 1981-2015.For the methods involving statistical prediction, we use a leave-three-out cross validation at all stages of the forecast process.This procedure is repeated for consecutive 3-year periods (e.g., 1981-1983, 1984-1986), except for the last time window (2014)(2015).
The techniques assessed here are categorized as follows.The first group, IHC-based methods, includes two approaches (referred to as benchmark methods) -ESP and IHC-based statistical -currently used operationally in the western US (both harnessing only IHC information), and a simple ESP post-processor to reduce systematic biases.A second group, climate-only methods, includes statistical techniques harnessing climate information from two different sources -standard indices (e.g., Niño3.4,PDO, AMO) or variables extracted from the Climate Forecast System Reanalysis (CFSR; Saha et al., 2010).A third group of hybrid or hierarchical methods includes subgroups of techniques that (i) combine watershed predictors (IHCs) and climate predictors (either indices or CFSR variables) within a statistical framework, (ii) use climate information to post-process outputs from a dynamical method (i.e., ESP), or (iii) combine purely climate-based ensembles with purely watershedbased ensembles.
In operational practice, ESP produces an ensemble of streamflow estimates, whereas statistical water supply forecasting yields a statistical distribution.In this study, we generate ensembles of the final predictand for all methods.An ensemble size 500 is used -wherein the members are generated through a resampling (in some cases weighted) of the predictive distributions -except for the ESP and biascorrected ESP methods, for which 32 members are generated (i.e., 35 total historical years less the three test years left out).In the statistical approaches, seasonal flows are logtransformed, and predictor and predictand data are normalized before training statistical method parameters or weights (i.e., z scores are computed using z = (x − µ)/σ , where x represents the original variable, and µ and σ represent the mean and standard deviation of x, respectively).The statis- tical models were applied in log-standard-normal space for forecast generation, then predictands were transformed from z scores to log space (i.e., apply x = zσ +µ, with x = log(Q)), and finally transformed back to streamflow space.In practice, forecasters use a variety of transforms such as linear, square root, cube root, log, and log-sinh (Wang et al., 2012).We did not explore alternative transforms, using the log consistently throughout, but recognize that the choice of transform can affect the quality of the forecast.

Ensemble streamflow prediction
The traditional ensemble streamflow prediction (ESP) method (Day, 1985) relies on deterministic hydrologic model simulations forced with observed meteorological inputs up to the initialization time of the forecast.The approach assumes that meteorological data and model are perfect -i.e., there are no errors in IHCs -and that historical meteorological conditions during the simulation period can be used to represent climate forecast conditions.For hindcast verification purposes, the meteorological input traces associated with forecast years must be excluded.
The hydrology models used in this study were the NWS Snow-17, SAC-SMA, and a unit-hydrograph routing model, all implemented in lumped fashion with 2-3 snow elevation zones per watershed.The models were calibrated via an automated multi-objective parameter estimation to reproduce observed daily streamflow.Hydrologic model forcings were drawn from a 1/16 • real-time implementation of the ensemble forcing generation method described in Newman et al. (2015).Naturalized flow data were obtained from a combination of sources, including the Bonneville Power Administration (BPA, 2011), the USBR Hydromet historical data access system, and the USACE Data Query System.
Figure 4 shows simulated and observed monthly time series of streamflow for the period October 1990-September 2000.In this paper, results are reported in nonmetric units because of their greater familiarity to readers from the US water management community.With the exception of Prineville, where neither meteorology nor flow are well measured, all basins show values of the Nash-Sutcliffe efficiency (NSE) and r higher than 0.76 and 0.87, respectively.Further, the climatological seasonality of streamflow is reproduced well in all basins.

Statistical forecasting using initial hydrologic conditions (Stat-IHC)
This method mimics the approach of the US Natural Resources Conservation Service (NRCS) but differs in using model-simulated basin-averaged SWE and SM as surrogates for ground-based observations of SWE, precipitation, and streamflow used operationally by the NWS and NRCS (as demonstrated in Rosenberg et al., 2011).A linear regression equation is developed between normalized log-transformed seasonal runoff and IHCs represented by the sum of simu-

Bias-corrected ensemble streamflow prediction
ESP predictions often exhibit a systematic bias due to inadequate model parameters and/or other sources or error (e.g., input forcing selection, model structure).If the ESP approach provides a consistent hindcast, as it does here, postprocessing in the form of a simple bias-corrected ensemble streamflow prediction (BC-ESP) can be applied.This is achieved by multiplying the raw ESP forecasts by a mean scaling factor that is obtained by computing the ratio between the mean of observed seasonal runoff volumes (i.e., the predictand) and the mean of ESP forecast median volumes for each initialization time.Each scaling factor calculation and application is cross validated.

Statistical forecasting harnessing only climate information
Multiple linear regression using standard climate indices (Stat-Ind) This method evaluates 12 standard climate indices as candidate predictors (Table 2).For each initialization time (e.g., 1 November) and climate index (e.g., Niño3.4), the 3-month time window that maximizes the correlation coefficient between a preceding seasonal (e.g., August-October) predictor average and seasonal streamflow volume over the training period is selected.Once this procedure is repeated for all potential predictors, the best possible time series are obtained for the 12 climate indices, and ensemble forecasts are produced for a given initialization through the following steps: 1. Several combinations of predictors are selected subject to the constraint that no pairs of predictors with an intercorrelation larger than C thresh = 0.3 should be included.

2.
Stepwise multiple linear regression (MLR) models are fit for all combinations of predictors identified in step 1, and the set of predictors that minimizes the Bayesian information criterion (BIC) score (Akaike, 1974) over the training period is selected.
3. An ensemble forecast is generated (as for Stat-IHC) with the MLR model from step 2.
We choose MLR over more parameterized regression methods (e.g., local polynomial regression) since these were found to perform poorly in cross validation, mainly due to the limited sample sizes available in the seasonal hydrologic prediction context.

Partial least squares regression using reanalysis fields (Stat-CFSR)
The teleconnections captured in off-the-shelf climate indices are not influential everywhere.Therefore, we also assess the potential of custom climate predictor indices derived from reanalysis data.Following Tootle et al. (2007), we use partial least squares regression (PLSR; Wold, 1966) to extract information from climate fields.PLSR decomposes a set of independent variables X and dependent variables Y into a small number of principal components that explain as much covariance as possible between the two variable sets (Abdi, 2010).PLSR components are formed from CFSR 700 mb geopotential height (Z700) and SSTs over the domain 20 2. A regression model is fitted to the resulting PLSR components (predictors), accepting each additional component only when its mean partial correlation with volume runoff is above a threshold.We used a threshold of 0.30 throughout the study after finding that nearby valuese.g., 0.25, 0.35 -did not substantially change the results.The small sample size and low predictability supported at most two components.
3. A mean runoff volume forecast is computed using the regression model obtained in step 2, and an ensemble is generated by adding 500 Gaussian random numbers with zero mean and a standard deviation equal to the root mean squared error of prediction (RMSEP) obtained from leave-three-out cross validation within the training period.
4. Ensemble forecasts are transformed from z scores to log space, and finally exponentiated for conversion to flow space.
The main implication of developing PLSR components and the subsequent estimation of regression coefficients in cross validation -as conducted here -is that climate information P. A. Mendoza et al.: Intercomparison of seasonal streamflow forecasting approaches from the target prediction period is not used at all, as is the case in real-time systems.This is a key methodological difference versus past studies that used all historical available information to define custom reanalysis predictor fields (e.g., Grantz et al., 2005;Regonda et al., 2006;Bracken et al., 2010;Mendoza et al., 2014)

ESP trace weighting scheme
A well-known strategy for incorporating climate information into ESP forecasts is called trace weighting (Smith et al., 1992;Werner et al., 2004), where forecasted flow probabilities are corrected by weighting each ensemble member according to the similarity between a climate-related feature of the current year (e.g., PDO) and the meteorological year used to generate that member.Here, for a given basin and forecast period, either climate indices or CFSR-based components are selected based on their training period performance (i.e., RMSE) and used to weight each trace obtained from BC-ESP (see Appendix A for further details).

Equally weighted ensembles and RMSE-weighted ensembles
An equally weighted ensemble (EWE) combines the bestperforming climate-only hindcast (i.e., Stat-Ind or Stat-CFSR, based on RMSE over the training period) with the best watershed-only hindcast (either Stat-IHC or BC-ESP), resampling ensemble members equally from each source to form a new 500-member ensemble forecast.A variation of this combination approach, an RMSE-weighted ensemble (RWE), instead performs a weighted resampling from the two forecast sources according to their skill during the training period.That is, two weights RMSE −1 are obtained, where RMSE is the root mean squared error of the forecast ensemble median.These weights are normalized to make them sum up to 1, and finally obtain the fraction of the new 500-member ensemble coming from each forecast source.For example, if the resulting normalized weights are 0.4 and 0.6 for the best climate-only and best watershed-only forecasts, respectively, the RWE will contain 200 and 300 members from each prediction.

Bayesian model averaging and quantile model averaging
These methods combine the best-performing climate-only hindcast with the best-performing watershed-only hindcast.While Bayesian model averaging (BMA; Raftery et al., 2005) attempts to provide a weighted average of forecast probability densities, quantile model averaging (QMA; Schepen and Wang, 2015) applies a weighted average to forecast values (quantiles) for a given cumulative probability.A notable difference between the two approaches is that QMA produces smoother and consistently unimodal distributions compared to potentially bimodal BMA outputs (Schepen and Wang, 2015).More details on these techniques are provided in Appendix B.

Forecast evaluation
Forecast method performance was evaluated using the metrics listed in Table 3.These include some standard metrics used in hydrology, such as correlation coefficient (r), root mean squared error (RMSE), and percent bias, and also probabilistic measures to assess skill and reliability.Skill is obtained using the continuous ranked probability score (CRPS; Hersbach, 2000), which measures the temporal average error between forecast CDF and that from the observation.
Deterministic metric that varies [−1,1] with a perfect score of 1.It measures the linear association between forecasts and observations independent of the mean and variance of the marginal distributions.
%Bias Percent bias %Bias × 100 Deterministic metric that varies (−∞, ∞) with a perfect score of 0. It measures the difference between the mean of the forecasts and the mean of observations.RMSE Root mean squared error Deterministic metric that varies [0,∞) with a perfect score of 0.

CRPSS
Continuous ranked probability skill score Probabilistic metric that varies (−∞,1], with perfect score of 1.It measures the skill of CRPS relative to a reference forecast (Hersbach, 2000).CRPS quantifies the difference between the cumulative distribution (CDF) function of a forecast (F ) and the corresponding CDF of the observations (F o ).
It quantifies the closeness between the empirical CDF of sample p values with the CDF of a uniform distribution.A value of 0 is the worst, and 1 reflects perfect reliability (Renard et al., 2010).
q m,i : forecast ensemble median for year i. q m : temporal average over forecast ensemble medians.o i : observation for year i. o: temporal average of observations.P i (o i ): non-exceedance probability of o i using ensemble forecasts at year i.U i (o i ): non-exceedance probability of o i using the uniform distribution U [0,1].
spread to represent the uncertainty in observations -is evaluated using an index from the predictive quantile-quantile (QQ) plot (Renard et al., 2010).QQ plots compare the empirical CDF of forecast p values (i.e., P i (o i ), where P i and o i are the forecast CDF and observation at year i) with that of a uniform distribution U [0,1] (Laio and Tamea, 2007).
Confidence intervals for the verification statistics are created using bootstrapping with replacement.In each resampling step, N pairs of ensemble forecasts and observations were resampled from the original joint distribution (N is the total number of events for which probabilistic forecasts are available).This process is repeated 1000 times, and all statistics are then computed for each realization and ranked in order to obtain 95 % confidence limits.

Deterministic evaluation
We first compare methods using the WSF median, a critical predictand for many water decisions (e.g., Lake Powell releases on the Colorado River in the western US).Fig-ure 5 displays correlation coefficients (r) between forecast median and observed April-July runoff volumes for the five case study basins.As expected, near-zero or negative r values were obtained for 1 October and 1 November WSFs with the IHC-based methods.Negative correlation scores arise in very low-skill situations as an artifact of cross validation (e.g., leaving a high predictand out of a training sample biases the resulting prediction in the opposite direction).The seasonality of SM and SWE in the basins of interest (Fig. 2) does not yield watershed moisture accumulations with predictive power until December or January.In contrast, r values as high as 0.48 for Dworshak and 0.49 for Hungry Horse could be attained on 1 October using only information from climate indices (Stat-Ind).Generally, but not everywhere, methods harnessing predictability from the climate (with the exception of the trace weighting scheme -TWS) enhance skill in comparison to IHC-based methods at initializations early in the water year.TWS is unable to shift the parent ESP distribution sufficiently to impart much climate skill at this time of year.
After January, the hydrologic model begins to capture a useful moisture variability signal from the watershed; thus, IHCs start to become a dominant source of predictability in all basins.Indeed, watershed information is particularly relevant at Libby and Prineville (Fig. 5d and e), where correlations within the range 0.39-0.47 are achieved as early as 1 December with the three IHC-based techniques.In these basins, standard climate indices do not provide useful longlead predictability, although CFSR-based predictors do support a consistent improvement.For example, the correlation from Stat-Ind for Libby (Prineville) on 1 December is −0.23 (0.02), while the r value from Stat-CFSR is 0.19 (0.30).These differences between Stat-Ind and Stat-CFSR remain at these basins for subsequent monthly initializations.
Figure 5 reveals several notable outcomes that are evident in many of the results plots.First, a linear regression against IHCs can provide similar r values than the more computationally expensive ESP method, especially at late initializations (i.e., 1 March or 1 April).Likewise, straightforward ensemble combination techniques (e.g., EWE or RWE) may outperform more complex methods such as BMA (e.g., 1 February-1 April) at all basins.From a correlation skill perspective, on the other hand, ESP generally outperforms the rest of the methods in late winter and spring.For example, ESP provides the highest r values for Dworshak (0.82) and Howard Hanson (0.67) on April 1.Notably, EWE was found to be the best method on 1 April for Hungry Horse (r = 0.88) and Prineville (r = 0.79) based on correlation.This indicates that, although simple post-processing can provide substantial forecast improvement, the small sample size available for training during the cross-validation process results in noisy parameter estimates that can undermine the potential correlation skill achievable with techniques that are more complex.
RMSEs for ensemble forecast medians (Fig. 6) show that despite some discrepancies between techniques, intermethod differences are not as large as for correlation.In most basins, errors can be reduced at earlier initializations (i.e., 1 October and 1 November) by introducing climate information.For instance, on 1 October, Stat-Ind and Stat-Ind-IHC generate respective reductions in RMSE of 10 and 13 % at Dworshak, 23 and 16 % at Howard Hanson, and 14 and 12 % at Hungry Horse, relative to the best IHC-based method in each basin.These benefits are seen in most initializations and catchments except at Libby, where the best results were mostly achieved using ESP (1 October) and Stat-IHC (1 December and 1 February-1 April).In agreement with Beckers et al. (2016), this study was unable to find encouraging climate teleconnections at Libby despite its relative proximity to Hungry Horse.
Figure 6 underscores that from a median error perspective, intuitive ensemble combination approaches (i.e., EWE and RWE, shown in dark green) can be effective for reducing forecast errors once the watershed begins to provide useful predictability (i.e., after 1 January).For instance, EWE was the best-performing method in Hungry Horse and Prineville for forecasts initialized on 1 March and 1 April.Further, Figure 6 illustrates that the best (or worst) techniques when looking at RMSE vary with each basin, although it is clear that TWS and climate-only methods perform poorly at early and late initializations, respectively.The joint inspection of Figs. 5 and 6 shows that inter-method agreement in correlation does not necessarily translate into similar forecast median errors.For example, while ESP and HESP provide close r values at Dworshak (0.74 and 0.73) on 1 March, larger discrepancies are obtained in RMSE, with values of 0.58 million acre feet -equivalent to 0.72 billion m 3 (BCM) -and 0.79 MAF (0.97 BCM) for ESP and HESP, respectively.
Another interesting result is that no substantial reductions in RMSE were achieved at Howard Hanson between 1 October and 1 April, in contrast to the gradual growth of hydrologic predictability to support forecast skill in other basins.Indeed, the best-performing techniques for 1 October (Stat-Ind) and 1 April (BC-ESP) forecasts provide similar RMSE values (∼ 0.064 MAF (0.079 BCM) and 0.065 MAF (0.08 BCM), respectively).This outcome can be attributed to the relatively more rainfall-dominated hydrograph of Howard Hanson in comparison to the rest of the catchments (Table 1; Fig. 2), and sustained runoff variability generated by seasonally high SM and fall-winter precipitation.
Figure 7 (forecast median bias) shows that raw ESP outputs have the largest biases through most initializations at Howard Hanson, Libby and Prineville.In particular, absolute biases at Prineville -which is the worst simulated basin in the group -increase to 53 % on 1 October before decreasing to 20 % on 1 April.Further, relatively large biases (in comparison to the rest of techniques) were obtained at late initializations in Dworshak and Hungry Horse.Excepting Prineville, inter-method differences were not substantial, and none of the methods exceeded a 16 % bias at any initialization.The simple bias correction applied in this study was able to reduce absolute biases to less than ±3 % at Prineville and less than ±1 % at the rest of the basins.Hence, from a bias reduction perspective, BC-ESP was the best technique for most basins/initializations, with the exceptions of Dworshak on 1 February and Prineville on 1 March and 1 April, for which Stat-CFSR-IHC and TWS provided the best results.

Probabilistic verification
Figure 8 displays continuous ranked probability skill scores computed with mean climatology as a reference (CRPSS clim ).Consistent with the correlation analysis results (Fig. 5), better skill values are obtained for long lead times (i.e., 1 October and 1 November) if climate predictors are incorporated in the forecasting framework.For example, Stat-Ind-IHC augments skill by 56 % in HHDM1 and 7 % in Hungry Horse with respect to Stat-IHC (i.e., the best benchmark in terms of CRPSS clim ) when forecasts are initialized on 1 November.The skill of IHC-based methods generally increases from 1 October to 1 April.Nevertheless, at late initializations it is still possible to outperform these techniques in some basins (e.g., Stat (CFSR + IHC) and EWE in Hungry Horse provide skill increases of 7 and 5 % in 1 April forecasts over the best IHC-based technique).For late season initializations -when IHC predictability is strong -it is expected that climate-only forecasts are not suitable and underperform other methods.This progression of relative predictabilities from climate and watershed moisture conditions (Figs. 5 and 8) is consistent with previous findings for the region (e.g., Pagano and Garen, 2006).The results from Fig. 8 corroborate several findings alluded to in Sect.5.1.Climate predictors applied to lowskilled (BC-)ESP forecasts in a TWS framework are less effective than when applied in a separate statistical method.Additionally, less complex multi-model schemes can perform better than more complex approaches (e.g., BMA), supporting previous findings by Najafi and Moradkhani (2015).Among the three hybrid regression methods (Fig. 3), Stat-CFSR-IHC was in most cases the worst performer.This result may be determined by the relative strength of standard (in particular ENSO) indices for the PNW region.When used in combination with other stronger predictors, the parameter estimation cost of the CFSR-PLSR relative to an off-the-shelf index may be more exposed (leading to greater shrinkage of skill after cross validation).The skill results in this study are subject to large uncertainties due to limited sample size (i.e., only 35 years for forecast generation and verification).
Overall, the results presented in Figs. 5 and 8 suggest a division of the study basins into two groups showing different relative predictabilities -i.e., driven by watershed conditions versus climate -from October to January.The first group is formed by Dworshak, Howard Hanson, and Hun-gry Horse, where the state of the climate is the dominant source of predictability from 1 October to 1 December, and IHCs start providing useful information on 1 January.The second group is formed by Libby and Prineville, where little or no skill can be found from any source until 1 December, when some predictability can be harnessed from IHCs.This is illustrated in Fig. 9, where time series with cross-validated seasonal streamflow forecasts -initialized on 1 December during the period 1981-2015 -are shown for two IHC-based methods (BC-ESP and Stat-IHC), and two climate-based statistical methods (i.e., Stat-Ind and Stat-CFSR).At such initialization, there is not enough information in the watershed (IHCs) to predict interannual variations in April-July streamflow at Dworshak (Fig. 9a) or Howard Hanson (Fig. 9b); nevertheless, climate predictors are more successful, a result that is also reflected in positive correlation results (Fig. 5) and skill scores (e.g., CRPSS clim increases from 0.23 with Stat-IHC to 0.39 with Stat-Ind at Howard Hanson).For the particular case of Hungry Horse (Fig. 9c), some predictability is provided by watershed information alone (i.e., BC-ESP), although with smaller correlation and skill than Stat-Ind or Stat-CFSR.Finally, the ensemble forecast time series displayed for Libby (Fig. 9d) and Prineville (Fig. 9e) portray the relative predictive power of IHCs in these basins compared to climate predictors alone.Indeed, at the 1 December initialization in these basins, watershed information alone supports r values of 0.43 (Libby) and 0.39 (Prineville) from BC-ESP, and r values of 0.47 from Stat-IHC.
Forecast reliability can be critical to support risk-based decision making in which actions may be tied to the forecast distribution rather than the median.The reliability index α (Fig. 10), which measures the closeness between the empirical CDF of forecast p values with a theoretical CDF of U [0,1] (Table 3) shows that -although (BC-)ESP forecast correlation (Fig. 5) and skill (Fig. 8) generally increase during the year, forecast reliability from the ESP methods degrades (i.e., toward lower α) as the initializations approach 1 April.For such lead times, the uncertainty in ESP streamflow forecasts is underestimated due to reliance on a single modeled IHC that does not account for modeling errors (Wood and Schaake, 2008), such that forecast spread derives only from uncertainty represented by the ensemble of future forcings.Because TWS is constrained by ESP spread, it cannot provide substantial enhancements to poor late-season reliability indices obtained with (BC-)ESP.
In general, forecasts involving statistical calibration (which helps to improve spread and bias) are most reliable.Indeed, regression-based forecasting methods (e.g., Stat-IHC, Stat-Ind, Stat-Ind-IHC) stand out in all basins, suggesting that the ensemble generation approach used in this paper (based on the standard error of the cross-validated hindcasts) is capable of providing statistically consistent ensembles.Multi-model techniques appear to inherit this characteristic, with only small discrepancies apparent between them (green lines in Fig. 10).Similar inter-method differ- ences across multiple initializations were found when looking at the ε reliability index (not shown) defined by Renard et al. (2010).
Although HESP was only found to be the "most reliable" method in a limited number of cases (e.g., α = 0.95 at Dworshak on 1 October; α = 0.96 at Libby on 1 April), relatively high α values were consistently attained in all basins and forecast lead times.This suggests -in conjunction with the results shown in Figs.5-8 -that HESP has strong potential for operational streamflow forecasting at all initialization dates, since it is capable of flexibly harnessing seasonally varying sources of predictability.Figure 11 illustrates this idea through time series of cross-validated ensemble forecasts obtained with HESP for three initialization times (1 October, 1 January, and 1 April).Forecasts issued on 1 October provide positive skill with respect to climatology in Dworshak, Howard Hanson, and Hungry Horse, and although CRPSS relative to ESP does not necessarily improve, the associated correlation coefficients (0.42, 0.37, and 0.47, respectively) are a clear enhancement over negative r values obtained from IHC-based methods.The lower probabilistic skill and near-zero correlation in Libby and Prineville reflect the lack of predictability from either the watershed or climate conditions at such a long lead time.Higher values of CRPSS clim for ensemble forecasts initialized on 1 January and 1 April reflect the increasing power of IHCs, while smaller (and sometimes negative) CRPSS esp values in some basins reflect the increasing difficulty to outperform ESP as IHCs provide more forecast signal.Overall, HESP provides positive skill with respect to mean climatology in all cases, relatively high r values, and statistically consistent forecast ensembles.

Wet/dry year forecasts
Summary statistics provide an overview of forecast performance, but additional insights can be gained from exploring extreme years in the record -in which forecasts can have disproportionate value to help water managers negotiate atypical challenges -and from visualizing the behavior of the forecasting methods as individual seasons progress.We therefore performed a retrospective comparison of all techniques for two regionally wet (1997 and 2011) and dry (1987 and 2001) water years at Hungry Horse (Fig. 12), one of the most teleconnected basins in our study domain.Figure 12 illustrates how SWE and SM, the primary sources of predictability for IHC-based methods, progressively gain influence on ensemble forecasts (e.g., HESP and TWS outputs) as the beginning of the snowmelt season approaches (i.e., 1 April).These single-year forecast evolution plots highlight the contrast for late season (i.e., 1 February onwards) between overconfident predictions exhibiting poor reliability (e.g., ESP, BC-ESP, TWS), and underconfident forecasts (e.g., EWE and RWE).
Figure 12a, b show that climate information is required to reduce forecast errors in wet years at very long lead times (i.e., 1 October and 1 November), either alone or combined with watershed information through hybrid approaches.For example, the technique that provided the smallest forecast median error on 1 October 1997 was TWS.For shorter lead times (i.e., forecasts initialized on 1 March or 1 April) and WY 1997, the incorporation of IHCs helps to provide a better match with observations compared to methods that only use climate information.Interestingly, reanalysis fields at Hungry Horse provide considerable predictive power for WY 2011 (Fig. 12b) at short lead times (e.g., Stat-CFSR provides a forecast median error of 2.7 % on 1 March).
In the two dry years, Fig. 12c illustrates that climate predictors alone had considerable predictive power at long lead times (i.e., 1 October and 1 November) in WY 1987.However, this was not the case for WY 2001 (Fig. 12d), when the method providing smallest forecast median volume errors at all initialization times (i.e., either BC-ESP or TWS) always required knowledge on watershed moisture conditions.This was also the case for other pilot study basins (not shown).
The above results suggest that despite the value of largescale climate information for this study domain, enhanced hydrologic predictability is critical for accurate streamflow Hydrol.Earth Syst.Sci., 21, 3915-3935, 2017 www.hydrol-earth-syst-sci.net/21/3915/2017/ volumes in snowmelt-dominated regions under extreme climatic conditions, especially during dry years.Past and ongoing efforts that aimed to improve basin-scale meteorological forcing datasets, pursue realistic process representations in hydrologic models, advance parameter calibration, and improve DA techniques for better IHC estimates have built a robust platform to accelerate progress in this area.However, a long-term retrospective implementation (that is consistent with the real-time deployment) of these various modeling decisions and sources of information is critical to understand their performance and benchmark methodological choices.

Conclusions
Generating accurate water supply forecasts is an ongoing challenge for improving water resources operations and plan-ning.Despite substantial work on seasonal streamflow forecasting methods applied worldwide, the marginal value of increased complexity and combining different sources of information via different strategies has not been systematically assessed.In this paper, we compare a range of techniques that leverage predictability from watershed hydrologic conditions and/or large-scale climate information.The forecast intercomparison showed that hybrid techniques that leverage hindcasts to combine both sources of predictability could lead to improved skill compared to current operational approaches.Additional key findings that may be relevant beyond the study domain -due to the inclusion of both teleconnected and non-teleconnected basins -are as follows: -In basins showing strong teleconnections between large-scale climate and local meteorology, the use of large-scale climate information can be an effective strat- egy for improving seasonal streamflow predictability, potentially providing skillful forecasts at times when watershed predictability is limited.
-Standard climate indices provide useful information, and custom climate predictors from reanalyses were also an effective complementary strategy for extracting the signal from climate fields (e.g., SST and geopotential height).
-The relative importance of watershed IHC versus climate information to predict streamflow was found to vary even within a small region, depending on subdomain catchment hydroclimatological characteristics.
-The ESP trace weighting method only provided promising results at forecast lead times where ESP raw forecasts contained moderate skill, indicating that climate information cannot adequately shift the prior ESP forecast if it lacks forecast resolution or contains significant bias.
-Increasing methodological complexity does not necessarily translate into better ensemble forecast quality (e.g., Stat-IHC versus BC-ESP; EWE versus BMA), in part because the small sample sizes associated with seasonal hindcasts preclude reliable parameter estimation for more elaborate methods.There can be a trade-off between improving one forecast characteristic (e.g., bias) and degrading another (e.g., correlation skill).
-Cross validation is an essential part of seasonal forecast development and implementation, particularly where multiple predictions may be combined based on their purported relative strengths and predictive uncertainty must be accurately estimated.In the small-sample context of seasonal streamflow prediction, cross validation reveals significant limitations in the supportable complexity of statistical forecasting elements.
The often equivocal comparison of methods through multiple verification metrics (e.g., correlation, reliability) for individual wet and dry years, and for different basins, starkly illustrated the challenge of selecting a single method that will provide optimal results for all forecast initialization dates.There is a significant tension between optimizing forecast qualities through a mixture of methods and data sources that vary seasonally and across basins, and an oft-stated preference from forecasters and users for a consistent forecasting methodology.With this in mind, we developed HESP as a flexible data-driven framework to harness skill across varying predictability regimes, although it admittedly departs from the constraint of predictor uniformity.
A notable omission from this intercomparison study is the derivation of climate predictors from global climate model forecasts, a strategy that has also been pursued in this context (e.g., see Crochemore et al. 2016).The experiment summarized here did assess the skill of CFSv2 9-month climate forecasts at an earlier stage, but such evaluation has been excluded from this paper because the results did not show significantly higher skill from the CFSv2 forecasts than the CFSR-based empirical predictions, as is consistent with prior skill assessments (e.g., Yuan et al., 2011).Nonetheless, the topic of augmenting hydrologic predictability from dynamical climate forecasts remains an appealing area for future study and comparison, as does the potential for including IHC data assimilation to enhance watershed modelbased predictability (e.g., DeChant and Moradkhani, 2011;Huang et al., 2017).Future work can also explore alternative methodological choices such as multiple hydrological models, different climate datasets, or smaller details such as alternative variable transformations in statistical approaches (e.g., Wang et al., 2012).

Figure 1 .
Figure 1.Location map with the pilot basins included in this study.

Figure 2 .
Figure 2. Corrected precipitation P (i.e., observed precipitation multiplied by a SCF) and simulated water balance variables -active SM, SWE, and runoff (RO) -for the five study basins: (a) Dworshak reservoir inflow (DWRI1), (b) Howard Hanson reservoir inflow (HHDW1), (c) Hungry Horse reservoir inflow (HHWM8), (d) Libby dam inflow (LYDM8), and (e) Prineville reservoir inflows (PRVO).For model SM, we subtract the lowest mean monthly value of the year so that the plotted values show only the active range of variation.

Figure 3 .
Figure 3. Schematic figure showing all seasonal streamflow forecasting methods included in the intercomparison framework.The benchmark methods are operationally implemented in the western United States, and they are solely based on hydrologic predictability.
• S-80 • N, 130 • E-10 • W. For dates beyond 2010, we merged the 1979-2010 CFSR data with monthly analysis fields from the Climate Forecast System version 2 (CFSv2; Saha et al., 2014), aggregating the latter product to 2.0 • × 2.0 • horizontal resolution.Similar to the Stat-Ind method, we use 3-month averages of these variables.The seasonal forecasts are generated for each initialization by following these steps: 1.The principal components are computed from the combined SST and Z700 gridded values for each training sample and the left-out prediction years.

Figure 6 .
Figure 6.The same as Fig. 5 but for RMSE -in million acre feet (MAF) -of ensemble forecast medians versus observations.See text for further details.

Figure 7 .
Figure 7.The same as Fig. 5 but for percent bias (% bias) in forecast ensemble medians versus observations.See text for further details.
Figure 8. Continuous ranked probability skill score of the forecast ensembles with respect to mean observed climatology (CRPSS clim ).See text for further details.

Figure 9 .
Figure 9.Time series with cross-validated hindcasts initialized on 1 December, obtained with two watershed-based methods (BC-ESP and Stat-IHC) and two climate-based techniques (Stat-Ind and Stat-CFSR) for the five case study locations (a-e).The verification metrics CRPSS clim and CRPSS esp denote continuous ranked probability skill scores using the mean climatology and raw ESP output as the reference, respectively.Black dashed lines represent 10, 50, and 90 % flows from the observed climatology, and boxplots show the 10th, 30th, 50th, 70th, and 90th hindcast percentiles.The red line represents the observed flow volumes.

Figure 12 .
Figure 12.April-July water supply forecasts obtained at the Hungry Horse reservoir (HHWM8) with different methods for two wet years -(a) 1997 and (b) 2011 -and two dry years -(c) 1987 and (d) 2001.The red dashed line represents the observed flow, while black dashed lines represent 10, 50, and 90 % flows from observed climatology, and boxplots show the 10th, 30th, 50th, 70th, and 90th hindcast percentiles.

Table 1 .
List of basin characteristics.Hydrologic variables correspond to the period October 1980 to September 2015.P , R, PE, RR, and DI denote basin-averaged mean annual values of precipitation, runoff, potential evapotranspiration, runoff ratio, and dryness index, respectively.

Table 2 .
List of climate indices included as potential predictors.
, yielding a moderate yet erroneous boost in predictability.We applied two statistical methods that combine climate and dynamical watershed model predictors: Stat-Ind-IHC (which uses climate indices and IHCs) and Stat-CFSR-IHC (which uses CFSR-based PLSR components and IHCs).These approaches are implemented in identical fashion to Stat-Ind, except that IHCs are added to the potential suite of climate predictors.
Hierarchical ensemble streamflow predictionThe underlying idea of hierarchical ensemble streamflow prediction (HESP) is that the two main sources of predictability -watershed IHCs and climate -may best be addressed sequentially to ensure that only climate uncertainty is related to climate predictors.This may not be the case if a climate variable enters first into a regression model that attempts to explain streamflow variance from both IHCs and climate, possibly leading to a sub-optimal predictor selection.HESP is thus a hierarchical regression approach in which streamflow is first related to IHCs by fitting Q = f (IHC predictors) +ε climate , given sufficient IHC predictor strength.The residual uncertainty is then related to climate predictors (again if possible) by fitting ε climate = g(climate predictors) +ε residual , such that the final forecast equation takes the form Q = f (IHC predictors)+g(climate predictors)+ε residual .(1)

Table 3 .
Performance metrics used to assess and compare seasonal streamflow forecasting methods.