Interactive comment on “ An intercomparison of approaches for improving predictability in operational seasonal streamflow forecasting ”

The manuscript compares different methods for seasonal water supply forecasts in several catchments in the Pacific Northwest region of the US. A large variety of different models was applied: purely statistical methods, methods based on watershed modelling as well as hybrid approaches using initial hydrologic conditions and / or climate information as input. Additionally different post-processing and merging methods are tested. The snow-dominated test catchments cover a wide range of hydrometeorological conditions and different atmospheric teleconnection signals.

We are very pleased that this reviewer appreciates the contributions of this study.
Title: I don't think "predictability" is the right word for the title.It implies something that's immutable and intrinsic, in the sense of theoretical maximum predictability, which is not something that could be "improved".Predictive skill of certain techniques or a forecasting enterprise can be improved, however.
The reviewer makes a good point.To avoid confusion on the concept of "predictability", we have modified the title to "An intercomparison of approaches for improving operational seasonal streamflow forecasts" (L1-2).
Line 57 "current operational practice in the US still takes little to no advantage of largescale climate information for realtime seasonal streamflow forecasting" and later line 64-65 "these [operational] approaches rely solely on the predictability of [initial hydrologic conditions] and do not leverage any type of large-scale current or future climate information".From my experience as a forecaster, there were only very limited locations and leadtimes where the climate information provided substantial benefits.Things like El Nino indices were used in pacific northwest and southwest US for early (i.e.January) and pre-season (i.e.October-December) forecasts.I think it's strong to say that there was no use at all of climate information.
We have modified the text to reflect the reviewer's experience on this topic, though in truth the vast majority of statistical forecasting locations in the western US do not use climate indices to our knowledge -even in the Pacific Northwest (PNW).The paragraph now reads (L59-69): "Despite generally promising findings from this body of work and from a number of agency development efforts (Weber et al., 2012;Demargne et al., 2014), the use of large-scale climate information for real-time seasonal streamflow forecasting in the US remains rare.In the western United States, where snowmelt commonly dominates the annual cycle of runoff, official WSFs are produced via two main approaches: (i) statistical models leveraging in situ watershed moisture measurements such as snow water equivalent (SWE), accumulated precipitation and streamflow (Garen, 1992;Pagano et al., 2004); and (ii) outputs from the National Weather Service (NWS) Ensemble Streamflow Prediction method (ESP; Day, 1985), which is based on watershed modeling.For the overwhelming majority of forecast locations, these approaches rely solely on the predictability from IHCs (measured or modelled).A small number of locations can be found, however, where climate indices also serve as predictors in the statistical framework, and the NWS has recently implemented techniques through which climate model forecasts may eventually be applied to ESP (Demargne et al., 2014)" Line 89 Following the list of statistical water supply forecasting techniques.It may be useful to include in that list http://onlinelibrary. wiley.com/doi/10.1111/j.1752-1688.2009.00321.x/abstractbecause it also includes z-score regression and describes operational products.
Certainly!This was an oversight as we are aware of that work, thus we have included the aforementioned reference, following the reviewer´s suggestion (L94).
Line 172 The universal use of the log transform on all the predictands.Operationally, forecasters use linear, square root, cube root and log transform statistical models, with log being the most extreme.The use of log everywhere wouldn't have been my first choice, and is probably responsible for "forecast blowouts" like 1993 in the Apr-1 / e panel on figure 11 (far lower right corner, only the lower whisker is visible on the chart).But since it's applied the same everywhere, it means that the intercomparison is valid in a relative rather than absolute sense.You might reassure the reader that you tried other transforms and the results were insensitive.
We regret that we did not try other transformations as we were focused on relative outcomes, though this would have been a reasonable thing to do.In truth, we did place a great deal of importance on the transformation when the work was done, though since then our interactions with CSIRO has opened our eyes to the variation in the effectiveness of different transformations (including, for instance, the log-sinh).We do not have the bandwidth to go back and explore this issue, but for now we will highlight it for the readers based of the text of the comment.Hence, we have added the following sentences (L179-182): "In practice, forecasters use a variety of transforms such as linear, square root, cube root, log and log-sinh (Wang et al. 2012).We did not explore alternative transforms, using the log consistently throughout, but recognize that the choice of transform can affect the quality of the forecast." Line 261 The use of stepwise approach to model building.I think what you're describing here is the case where El Nino is predicting fall precipitation, and by the time January 1 comes around the precipitation is "in the bank" and so continued use of El Nino as a seasonal streamflow predictor after January 1 is redundant, if the equation also includes IHC variables.This was a common operational challenge in the US and a frustration to forecasters.
The reviewer correctly identifies the motivation for the technique, in the sense that HESP intends to handle seasonally varying sources of predictability separately, applying the climate predictors only to the portion of the flow variation that has not already been explained by the IHCs, if possible.If the signal from watershed moisture conditions becomes strong and is redundant, e climate = Qf(IHC) (i.e., the residual from that relationship) cannot be explained robustly by climate information and HESP just defaults to Stat-IHC.
tation_for_view=5hdY14AAAAAJ:YsMSGLbcyi4C For many years 1 February was the start of the operational forecasting season and so there is little surprise that hydrologists were underwhelmed with what El Nino had to offer them.It wasn't until leadtimes were pushed back to 1 January, and then back to 1 October that hydrologists became more operationally interested.Indeed, our skill plots (Figures 5 and 8) align with the findings by Pagano and Garen (2006) and other researchers as to this point, specifically with the progression of seasonal streamflow forecast skill provided in their Figure 1.We thank the reviewer for this observation about the original initialization dates in February, which is encouraging if it indicates a trend toward even earlier start times where the climate information is relatively more important.We add the following sentence (L400-401): "This progression of relative predictabilities from climate and watershed moisture conditions (Figures 5 and 8) is consistent with previous findings for the region (e.g., Pagano and Garen 2006)." Line 410 On explanations of why ESP is under-dispersive-The common way of explaining this is that NWS-style ESP does not consider parameter, data or model uncertainty, only uncertainty of future forcings.This is a good point.Indeed, ESPs are particularly under-dispersive at late forecast initializations, when uncertainty in IHCs dominates the total streamflow forecast uncertainty (Wood and Schaake 2008).We thank the reviewer for this observation, and have modified the text accordingly (L432-435): "For such lead times, the uncertainty in ESP streamflow forecasts is underestimated due to reliance on a single modeled IHC that does not account for modeling errors (Wood and Schaake 2008), such that forecast spread derives only from uncertainty represented by the ensemble of future forcings." Line 477 Generating custom climate indices beyond El Nino, creates useful information.I feel like this contradicts the statements on lines 383-385 where you say that this technique was the worst performer.
The reviewer refers to a comparison between the three hybrid regression techniques (Stat-Ind-IHC, Stat-CFSR-IHC, and HESP) in terms of probabilistic skill.We did find that for some basins (e.g., Dworshak and Hungry Horse) Stat-CFSR provides higher skill than using custom climate indices (Stat-Ind), outperforming also benchmark techniques at early initializations.Nevertheless, our results also show that when custom-indices are used in combination with stronger predictors, attempting to explain smaller amounts of variance, they are not as robust as using standard climate indices.We add a sentence to help explain this context (L407-409): "When used in combination with other, stronger predictors, the parameter estimation cost of the CFSR-PLSR relative to an off-the-shelf index may be more exposed (leading to greater shrinkage of skill after cross-validation)."The literature review of the most commonly used methods in seasonal streamflow forecasting is exhaustive and the results are nicely presented and compared.The real value of this study is the comparison of a large variety of methods based on a common hindcast/verification framework using rigorous three years out cross validation.Using such a common framework an objective comparison of the performance of the different methods is possible.The paper is well written and should be foreseen for publication in HESS after minor revisions.
We are very pleased that this reviewer appreciates the contributions of this study.

You should probably use SI units instead of KCFS (Thousands of Cubic Feet Per Second) and MAF (Million acre Feet) (Standard in HESS)
We appreciate this sentiment.However, we consider that the paper would have much more value if the results are presented in units that are familiar to water managers in the US, since this is basically a water supply forecasting study.Therefore, we kindly ask the Editor and Publisher for permission to preserve the current flow units throughout the manuscript.We have added a short sentence to prepare readers for non-metric units (L199-L200): "In this paper, results are reported in non-metric units because of their greater familiarity to readers from the US water management community."Additionally, we have added SI units where possible, in parenthesis (L372-373, L377).
P 5 line 155: missing first three year period in the brackets could be confusing why it is missing, add period: "... (e.g., 1981-1983, 1984-1986, 1987-1989, 1990-1992, etc.),..." We have added the first three-year period (1981)(1982)(1983), following the reviewers' suggestion, and taken out the part of the series after the first two (L160 The reviewer is correct: z-scores are computed using z = (x -µ)/s, where x represents the original variable, and µ and s represent the mean and standard deviation of the population, respectively.We clarify this procedure in the revised manuscript (L174-179).
Explain why you have used log-transformation of the predictant data and no other transformation method (e.g.BoxCox, ...).
We regret that we did not try other transformations as we were focused on relative outcomes, though this would have been a reasonable thing to do.In truth, we did place a great deal of importance on the transformation when the work was done, though since then our interactions with CSIRO has opened our eyes to the variation in the effectiveness of different transformations (including, for instance, the log-sinh).We do not have the bandwidth to go back and explore this issue, but for now we will highlight it for the readers based of the text of the comment.Hence, we have added the following sentences (L179-182): "In practice, forecasters use a variety of transforms such as linear, square root, cube root, log and log-sinh (Wang et al. 2012).We did not explore alternative transforms, using the log consistently throughout, but recognize that the choice of transform can affect the quality of the forecast." P 6 line 201: If the predictand was normalized (subtracting its expected value and dividing the difference by its standard deviation) before as stated above, the predictand has to be multiplied with the standard deviation and the mean has to be added before exponentiation.Is this correct?In this case the explanation of this procedure should be added.
The reviewer is correct, and the text has been modified to reflect this (L207, L211).

P 7 line 210: Replace MRL with MLR
We have corrected the text (L220), following the reviewer´s suggestion.The reviewer is correct: MLR is applied to log-transformed streamflow, and then both predictand (flow in log space) and predictors (e.g., climate indices) are normalized.The general procedure used in this paper for statistical approaches is clarified, following this and a previous comment from this reviewer (L174-179): "In the statistical approaches, seasonal flows are log-transformed, and predictor and predictand data are normalized before training statistical method parameters or weights (i.e., z-scores are computed using z = (x -µ)/s, where x represents the original variable, and µ and s represent the mean and standard deviation of x, respectively).The statistical models were applied in log-standard-normal space for forecast generation, then predictands are transformed from z-scores to log space (i.e., apply x = zs + µ, with x = log(Q)), and finally transformed back to streamflow space".

P 8 line 246: Re-transformation of predicted streamflow should be added as additional step
The step suggested by the reviewer is actually done, and therefore it has been added as part of the method description (L257-258): "Ensemble forecasts are transformed from z-scores to log space, and finally exponentiated for conversion to flow space".
P 9 line 284: Please explain shortly how the weighted resampling using the weights 1/RMSE works.
RWE performs a weighted resampling from the two forecast sources (i.e., the best climate-only and best watershed-only forecasts) according to their skill during the training period.I.e., two weights 1/RMSE are obtained, where RMSE the root mean squared error of the ensemble median.These weights are normalized to make them sum 1, and finally obtain the fraction of the new 500-member ensemble coming from each forecast source.For example, if the resulting normalized weights are 0.4 and 0.6 for the best climate-only and best watershed-only forecasts, respectively, the RWE ensemble will contain 200 and 300 members from each prediction.This explanation has been added to the text in the revised manuscript (L296-L301). P

Introduction
The operational hydrology community has long grappled with the challenge of producing skillful seasonal streamflow forecasts to support water supply operations and planning.Proactive water management has become critical for many regions in the world that are susceptible to water stress associated with the intensification of the water cycle.Paradoxically, although we have seen important technological advances -including increased computing power, the broader availability to climate reanalysis, forecasts and reforecasts, and more complex process-based hydrologic models (Pagano et al., 2016), the skill of operational seasonal runoff predictions in the US, termed water supply forecasts (WSFs), has shown little or no improvement over time (e.g., Pagano et al., 2004;Harrison and Bales, 2016).Hence, there is both a scientific and practical need to understand the potential of new datasets, modeling resources and methods to accelerate progress towards more skillful and reliable operational seasonal streamflow forecasts.
There is general consensus in the research community on the main opportunities to improve seasonal streamflow prediction skill (e.g., Maurer et al., 2004;Wood and Lettenmaier, 2008;Yossef et al., 2013).These include improving knowledge of: (i) the amount of water stored in the catchment -hereinafter referred to as initial hydrologic conditions (IHCs), and (ii) weather and climate outcomes during the forecast period.Our ability to leverage the first predictability source (i.e., hydrologic predictability) depends on the accuracy of watershed observations and models, including model input forcings (e.g., precipitation and temperature), process representations, and the effectiveness of hydrologic data assimilation (DA) methods.Our ability to leverage the second source (climate predictability) depends both on how well we can characterize and predict the state of the climate and on how effectively we can incorporate this information into streamflow forecasting methods.This idea has been explored in different frameworks using standard indices -e.g., Niño3.4,the Pacific Decadal Oscillation (PDO) -and/or custom (i.e., watershed-specific) climate indices derived from climate reanalyses (e.g., Grantz et al., 2005;Bradley et al., 2015), or using seasonal climate forecasts to run hydrologic model simulations (e.g., Wood et al., 2005;Yuan et al., 2013).
Despite generally promising findings from this body of work and from a number of agency development efforts (Weber et al., 2012;Demargne et al., 2014), the use of large-scale climate information for real-time seasonal streamflow forecasting in the US remains rarecurrent operational practice in the US still takes little to no advantage of large-scale climate information for real-time seasonal streamflow forecasting.Clear examples can be found i In the western United States, a large where snowmelt dominated regioncommonly dominates the annual cycle of runoff, where official WSFs are produced via two main approaches: (i) statistical models leveraging in situ watershed moisture measurements such as snow water equivalent (SWE), accumulated precipitation and streamflow (Garen, 1992;Pagano et al., 2004); and (ii) outputs from the National Weather Service (NWS) Ensemble Streamflow Prediction method (ESP; Day, 1985), which is based on watershed modeling.For the overwhelming majority of forecast locations, tThese approaches rely solely on the predictability from IHCs (measured or modelled).A small number of locations can be found, however, where climate indicesexes also serve as predictors in the statistical framework, and the NWS has recently implemented techniques through which climate model forecasts may eventually be applied to ESP (Demargne et al., 2014) (Demargne et al., 2014).and do not leverage any type of large-scale current or future climate state information that might influence the forecasted hydrologic outcomes.
This paper presents an assessment of several seasonal streamflow prediction approaches in harnessing both watershed and climate related predictability.The methods are applied to seasonal WSFs and span a range of complexity, from purely statistical to purely dynamical and hybrid statistical/dynamical approaches.In this paper, 'increased complexity' indicates a gradient from purely data-driven techniques (e.g., linear regression) to the use of dynamical watershed models (Plummer et al., 2009), the outputs of which may be further processed using additional statistical approaches.Although most of the techniques evaluated here are not new, the intercomparison offers new insights for researchers and developers in the operational community because: (1) the experiment is broader than prior efforts and benchmarks alternative methods against current operational ones; and (2) the methods are chosen to be operationally feasible, avoiding the use of data that cannot be obtained in real-time.In addition, the work uses a hindcast/verification framework and follows more rigorous standards for cross-validation than were used in some of the prior studies.
The remainder of this paper is organized as follows.Section 2 describes prior methodological work and context for statistical, dynamical and hybrid approaches to seasonal streamflow forecasting.The study domain is described in Section 3. Datasets, experimental design, individual methods, and forecast verification measures are detailed in Section 4. Results and discussion are presented in Section 5, followed by the main conclusions of this study (Section 6).

Background
Seasonal streamflow forecasting methods can be categorized as dynamical, statistical, or hybrid, and span different degrees of complexity and information requirements.Dynamical methods use time-stepping simulation models to represent hydrologic processes.They describe future climate using either historical meteorology or inputs derived from seasonal climate forecasts (e.g., Beckers et al., 2016).On the other hand, statistical or purely datadriven methods rely on empirical relationships between seasonal streamflow volumes, and large-scale climate variables and/or in situ watershed observations.Several statistical approaches can be found in the literature, encompassing different degrees of complexity (e.g., Garen, 1992;Piechota et al., 1998;Grantz et al., 2005;Tootle et al., 2007;Pagano et al., 2009;Wang et al., 2009;Moradkhani and Meier, 2010).Other studies have tested multimodel combination techniques for purely statistical seasonal forecasts, using objective performance criteria (e.g., Regonda et al., 2006), both performance and predictor state information (Devineni et al., 2008), and Bayesian model averaging (e.g., Mendoza et al., 2014), among others.
Hybrid methods strive to combine the strengths from both dynamical and statistical techniques.For instance, uncertainties in dynamical predictions indicate that dynamical forecasts can benefit from statistical post-processing (e.g., Wood and Schaake, 2008).One line of research has examined the potential benefits of using simulated watershed state variables -either from hydrologic or land surface models -as predictors for statistical models (e.g., Rosenberg et al., 2011;Robertson et al., 2013).Another popular technique consists in incorporating climate information within ESP frameworks, either deriving input sequences of mean areal precipitation and temperature from current climate or climate forecast considerations (e.g., Werner et al., 2004;Wood and Lettenmaier, 2006;Luo and Wood, 2008;Gobena and Gan, 2010;Yuan et al., 2013) -referred to as pre-ESP -, or ESP weighting (also referred to as post-ESP) based on climate information (e.g., Smith et al., 1992;Werner et al., 2004;Najafi et al., 2012;Bradley et al., 2015).Werner et al. (2004) found that the post-ESP method (termed 'trace weighting') was more effective than pre-ESP to improve forecast skill.
The combination of outputs from different models has also been shown to benefit seasonal hydroclimatic forecasting (e.g., Hagedorn et al., 2005).Although several studies have demonstrated that statistical multimodel techniques applied on dynamical models tend to outperform the 'best ' single model (e.g., Georgakakos et al., 2004;Duan et al., 2007)

Study Domain
Our test domain is the U.S. Pacific Northwest (PNW) region (Figure 1), which relies heavily on winter snow accumulation and spring snowmelt to fulfill meet water needs during spring and summer (e.g., Mote, 2003;Maurer et al., 2004;Wood et al., 2005).We select catchments contributing to five reservoirs: Dworshak (DWRI1), Howard Hanson (HHDW1), Hungry Horse (HHWM8), Libby (LYDM8) and Prineville (PRVO).Two of them -Hungry Horse and Prineville reservoirs -are owned and operated by the U.S. Bureau of Reclamation (USBR), while the rest are operated by the U.S. Army Corps of Engineers (USACE).
The main physical and hydroclimatic characteristics of the case study basins are summarized in Table 1.These basins cover a wide range of runoff ratios (from 0.13 at Prineville to 0.

Experimental Design
We use several decades of seasonal streamflow hindcasts to assess a suite of methods (Figure 3), focusing on April-July streamflow (runoff) volume, the most common western US water supply forecast predictand.
Probabilistic (ensemble) WSFs for this period are generated the first day of each month from October to April, in every year of the hindcast period 1981-2015.For the methods involving statistical prediction, we use a leave-threeout cross validation at all stages of the forecast process.This procedure is repeated for consecutive 3-year periods (e.g., 1981-1983, 1984-1986, 1987-1989, 1990-1992, etc.), except for the last time window (2014)(2015).
The techniques assessed here are categorized as follows.In operational practice, ESP produces an ensemble of streamflow estimates whereas statistical water supply forecasting yields a statistical distribution.In this study, we generate ensembles of the final predictand for all methods.An ensemble size 500 is used -wherein the members are generated through a resampling (in some cases weighted) of the predictive distributions -except for the ESP and bias-corrected ESP methods, for which 32 members are generated (i.e., 35 total historical years less the three out of sample test years).In the statistical approaches, seasonal flows are log-transformed, and predictor and predictand data are normalized before training statistical method parameters or weights (i.e., z-scores are computed using z = (x -µ)/s, where x represents the original variable, and µ and s represent the mean and standard deviation of x, respectively).The statistical models were applied in log-standard-normal space for forecast generation, then predictands weare transformed from zscores to log space (i.e., apply x = zs + µ, with x = log(Q)), and finally transformed back to streamflow space.In practice, forecasters use a variety of transforms such as linear, square root, cube root, log and log-sinh (Wang et al., 2012).We did not explore alternative transforms, using the log consistently throughout, but recognize that the choice of transform can affect the quality of the forecast.

Ensemble Streamflow Prediction (ESP)
The traditional ESP method (Day, 1985) relies on deterministic hydrologic model simulations forced with observed meteorological inputs up to the initialization time of the forecast.The approach assumes that meteorological data and model are perfect -i.e., there are no errors in IHCs, and that historical meteorological conditions during the simulation period can be used to represent climate forecast conditions.For hindcast verification purposes, the meteorological input traces associated with forecast years must be excluded.In this paper, results are reported in non-metric units because of their greater familiarity to readers from the US water management community.With the exception of Prineville, where neither meteorology nor flow are well measured, all basins show values of NSE and r higher than 0.76 and 0.87, respectively.Further, the climatological seasonality of streamflow is reproduced well in all basins.

Statistical forecasting using initial hydrologic conditions (Stat-IHC)
This method mimics the approach of the U.S. Natural Resources Conservation Service (NRCS), but differs in The predictions are then transformed from z-scores to log space, and finally exponentiated.

Bias Corrected Ensemble Streamflow Prediction (BC-ESP)
ESP predictions often exhibit a systematic bias due to inadequate model parameters and/or other sources or error (e.g., input forcing selection, model structure).If the ESP approach provides a consistent hindcast, as it does here, post-processing in the form of a simple bias-correction (BC-ESP) can be applied.This is achieved by multiplying the raw ESP forecasts by a mean scaling factor that is obtained by computing the ratio between the mean of observed seasonal runoff volumes (i.e., the predictand) and the mean of ESP forecast median volumes, for each initialization time.Each scaling factor calculation and application is cross-validated.

Multiple linear regression (MRLMLR) using standard climate indices (Stat-Ind)
This method evaluates 12 standard climate indices as candidate predictors (Table 2).For each initialization time (e.g., November 1) and climate index (e.g., Niño3.4), the 3-month time window that maximizes the correlation coefficient between a preceding seasonal (e.g., August-October) predictor average and seasonal streamflow volume over the training period is selected.Once this procedure is repeated for all potential predictors, the best possible time series are obtained for the 12 climate indices, and ensemble forecasts are produced for a given initialization through the following steps: 1. Several combinations of predictors are selected subject to the constraint that no pairs of predictors with an inter-correlation larger than C thresh = 0.3 should be included.

2.
Stepwise MLR models are fit for all combinations of predictors identified in Step 1, and the set of predictors that minimizes the Bayesian Information Criterion (BIC) score (Akaike, 1974) over the training period is selected.
3. An ensemble forecast is generated (as for Stat-IHC) with the MLR model from Step 2.
We choose MLR over more parameterized regression methods (e.g., local polynomial regression) since these were found to perform poorly in cross-validation, mainly due to the limited samples sizes available in the seasonal hydrologic prediction context.

Partial Least Squares Regression using reanalysis fields (Stat-CFSR)
The teleconnections captured in off-the-shelf climate indices are not influential everywhere.Therefore, we also The main implication of developing PLSR components and the subsequent estimation of regression coefficients in cross validation -as conducted here -is that climate information from the target prediction period is not used at all, as is the case in real-time systems.This is a key methodological difference versus past studies that used all historical available information to define custom reanalysis predictor fields (e.g., Grantz et al., 2005;Regonda et al., 2006;Bracken et al., 2010;Mendoza et al., 2014), yielding a moderate yet erroneous boost in predictability.

Stepwise MLRs using IHCs and climate predictors
We applied two statistical methods that combine climate and dynamical watershed model predictors: Stat-Ind-IHC (which uses climate indices and IHCs), and Stat-CFSR-IHC (which uses CFSR-based PLSR components and IHCs).These approaches are implemented in identical fashion to Stat-Ind, except that IHCs are added to the potential suite of climate predictors.

Hierarchical Ensemble Streamflow Prediction (HESP)
The underlying idea of HESP is that the two main sources of predictability -watershed

Forecast evaluation
Forecast method performance was evaluated using the metrics listed in Table 3.These include some standard metrics used in hydrology, such as correlation coefficient (r), root mean squared error (RMSE), and percent bias, and also probabilistic measures to assess skill and reliability.Skill is obtained using the continuous ranked probability score (CRPS;Hersbach, 2000), which measures the temporal average error between forecast CDF with that from the observation.Forecast reliability -i.e., adequacy of the forecast ensemble spread to represent the uncertainty in observations -is evaluated using an index from the predictive quantile-quantile (QQ) plot (Renard et al., 2010).QQ plots compare the empirical CDF of forecast p-values (i.e.P i (o i ), where P i and o i are the forecast CDF and observation at year i) with that from a uniform distribution U[0,1] (Laio and Tamea, 2007).
Confidence intervals for the verification statistics are created using bootstrapping with replacement.In each resampling step, N pairs of ensemble forecasts and observations were resampled from the original joint distribution (N is the total number of events for which probabilistic forecasts are available).This process is repeated 1000 times, and all statistics are then computed for each realization and ranked in order to obtain 95 % confidence limits.

Deterministic evaluation
We first compare methods using the WSF median, a critical predictand for many water decisions (e.g., Lake Powell releases on the Colorado River in the western US).After January, the hydrologic model begins to capture a useful moisture variability signal from the watershed, thus IHCs start to become a dominant source of predictability in all basins.Indeed, watershed information is particularly relevant at Libby and Prineville (Figure 5d and 5e), where correlations within the range 0.39-0.47 are achieved as early as December 1 with the three IHC-based techniques.In these basins, standard climate indices do not provide useful long-lead predictability, although CFSR-based predictors do support a consistent improvement.
For example, the correlation from Stat-Ind for Libby (Prineville) on December 1 is -0.23 (0.02), while the r value from Stat-CFSR is 0.19 (0.30).These differences between Stat-Ind and Stat-CFSR remain at these basins for subsequent monthly initializations.Notably, EWE was found to be the best method on April 1 for Hungry Horse (r = 0.88) and Prineville (r = 0.79) based on correlation.This indicates that, although simple post-processing can provide substantial forecast improvement, the small sample size available for training during the cross-validation process results in noisy parameter estimates that can undermine the potential correlation skill achievable with techniques that are more complex.
Root mean squared errors (RMSE) for ensemble forecast medians (Figure 6) show that despite some discrepancies between techniques, inter-method differences are not as large as for correlation.In most basins, errors can be reduced at earlier initializations (i.e., Oct 1 and Nov 1) by introducing climate information.For instance, on  1; Figure 2), and sustained runoff variability generated by seasonally high SM and fall-winter precipitation.
Figure 7 (forecast median bias) shows that raw ESP outputs have the largest biases through most initializations at Howard Hanson, Libby and Prineville.In particular, absolute biases at Prineville -which is the worst simulated basin in the group -increase to 53% on October 1 before decreasing to 20% on April 1.Further, relatively large biases (in comparison to the rest of techniques) were obtained at late initializations in Dworshak and Hungry Horse.
Excepting Prineville, inter-method differences were not substantial, and none of the methods exceeded a 16% bias at any initialization.The simple bias correction applied in this study was able to reduce absolute biases to less than +/-3% at Prineville, and less than +/-1% at the rest of the basins.Hence, from a bias reduction perspective, BC-ESP was the best technique for most basins/initializations, with the exceptions of Dworshak on Feb 1 and Prineville on Mar 1 and Apr 1, for which Stat.CFSR+IHC and TWS provided the best results.

Probabilistic verification
Figure 8 displays continuous ranked probability skill scores computed with mean climatology as a reference (CRPSS clim ).Consistent with the correlation analysis results (Figure 5 This progression of relative predictabilities from climate and watershed moisture conditions (Figures 5 and 8) is consistent with previous findings for the region (eg, (e.g., Pagano and Garen, 2006).
The results from Figure 8 corroborate several findings alluded to in Section 5.1.Climate predictors applied to low-skilled (BC-)ESP forecasts in a TWS framework are less effective than when applied in a separate statistical method.Additionally, less complex multi-model schemes can perform better than more complex approaches (e.g., BMA), supporting previous findings by Najafi and Moradkhani (2015).Among the three hybrid regression methods (Figure 3), Stat-CFSR-IHC was in most cases the worst performer.This result may be determined by the relative strength of standard (in particular ENSO) indices for the PNW region.When used in combination with other, stronger predictors, Namely, there is less of an opportunity for custom predictor components to fill a climate influence gap, and tthe parameter estimation cost of the CFSR-PLSR relative to an off-the-shelf index may be more exposed (leading to greater shrinkage of skill after cross-validation).It should also be noted that sThe skill resultsespecially those making use of ESP output -in this study are subject to large uncertainties due to limited sample size (i.e., only 35 years for forecast generation and verification).
Overall, the results presented in Figures 5 and 8 suggest a division of the study basins into two groups showing different relative predictabilities -i.e., driven by watershed conditions versus climate -from October to January.
The first group is formed by Dworshak, Howard Hanson and Hungry Horse, where the state of the climate is the dominant source of predictability from Oct 1 to Dec 1, and IHCs start providing useful information on Jan 1.The second group is formed by Libby and Prineville, where little or no skill can be found from any source until Dec 1, when some predictability can be harnessed from IHCs.This is illustrated in Figure 9, where time series with crossvalidated seasonal streamflow forecasts -initialized on December 1, period 1981-2015 -are shown for two IHCbased methods (BC-ESP and Stat-IHC), and two climate-based statistical methods (i.e.Stat-Ind and Stat-CFSR).At such initialization, there is not enough information in the watershed (IHCs) to predict interannual variations in April-July streamflow at Dworshak (Figure 9a) or Howard Hanson (Figure 9b); nevertheless, climate predictors are more successful, a result that is also reflected in positive correlation results (Figure 5) and skill scores (e.g., CRPSS clim increases from 0.23 with Stat-IHC to 0.39 with Stat-Ind at Howard Hanson).For the particular case of Hungry Horse (Figure 9c), some predictability is provided by watershed information alone (i.e., BC-ESP), although with smaller correlation and skill than Stat-Ind or Stat-CFSR.Finally, the ensemble forecast time series displayed for Libby (Figure 9d) and Prineville (Figure 9e) portray the relative predictive power of IHCs in these basins compared to climate predictors alone.Indeed, at the December 1 initialization in these basins, watershed information alone supports r values of 0.43 (Libby) and 0.39 (Prineville) from BC-ESP, and r values of 0.47 from Stat-IHC.
Forecast reliability can be critical to support risk-based decision making, in which actions may be tied to the forecast distribution rather than the median.The reliability index α (Figure 10), which measures the closeness between the empirical CDF of forecast p-values with a theoretical CDF of U[0,1] (Table 3) shows that -although (BC-)ESP forecast correlation (Figure 5) and skill (Figure 8) generally increase during the year, forecast reliability from the ESP methods degrades (i.e., toward lower α) as the initializations approach Apr 1.For such lead times, the uncertainty in ESP streamflow forecasts is dominated by (smaller) uncertainty inunderestimated due to reliance on a single modelled IHC that does not account for modelling errorss (Wood and Schaake, 2008), such thatand forecast spread derives only from uncertainty represented inby the ensemble ofby future forcings (i.e., resampled historical meteorology).Because TWS is constrained by ESP spread, it cannot provide substantial enhancements to poor lateseason reliability indices obtained with (BC-)ESP.
In general, forecasts involving statistical calibration (which helps to improve spread and bias) are most reliable.
Indeed, regression-based forecasting methods (e.g., Stat-IHC, Stat-Ind, Stat.Ind+IHC) stand out in all basins, suggesting that the ensemble generation approach used in this paper (based on the standard error of the crossvalidated hindcasts) is capable of providing statistically consistent ensembles.Multi-model techniques appear to inherit this characteristic, with only small discrepancies apparent between them (green lines in Figure 10).Similar inter-method differences across multiple initializations were found when looking at the ε reliability index (not shown) defined by Renard et al. (2010).
Although HESP was only found to be the 'most reliable' method in a limited number of cases (e.g., α = 0.95 at Dworshak on Oct 1; α = 0.96 at Libby on Apr 1), relatively high α values were consistently attained in all basins and forecast lead times.This suggests -in conjunction with the results shown in Figures 5-8 -that HESP has strong potential for operational streamflow forecasting at all initialization dates, since it is capable of flexibly harnessing seasonally varying sources of predictability.CRPSS esp values in some basins reflect the increasing difficulty to outperform ESP as IHCs provide more forecast signal.Overall, HESP provides positive skill with respect to mean climatology in all cases, relatively high r values, and statistically consistent forecast ensembles.

Wet/dry year forecasts
Summary statistics provide an overview of forecast performance, but additional insights can be gained from exploring extreme years in the record -in which forecasts can have disproportionate value to help water managers negotiate atypical challenges -and from visualizing the behavior of the forecasting methods as individual seasons progress.We therefore performed a retrospective comparison of all techniques for two regionally wet (1997 and 2011) and dry (1987 and 2001) water years at Hungry Horse (Figure 12), one of the most teleconnected basins in our study domain.Figure 12 illustrates how SWE and SM, the primary sources of predictability for IHC-based methods, progressively gain influence on ensemble forecasts (e.g., HESP and TWS outputs) as the beginning of the snowmelt season approaches (i.e.April 1).These single-year forecast evolution plots highlight the contrast for late season (i.e. Feb 1 onwards) between overconfident predictions exhibiting poor reliability (e.g., ESP, BC-ESP, TWS), and underconfident forecasts (e.g.EWE and RWE).
Figure 12a,b show that climate information is required to reduce forecast errors in wet years at very long lead times (i.e., Oct 1 and Nov 1), either alone or combined with watershed information through hybrid approaches.For example, the technique that provided the smallest forecast median error on Oct. 1 1997 was TWS.For shorter lead times (i.e., forecasts initialized on March 1 or Apr 1) and WY 1997, the incorporation of IHCs helps to provide a better match with observations compared to methods that only use climate information.Interestingly, reanalysis fields at Hungry Horse provide considerable predictive power for WY 2011 (Figure 12b) at short lead times (e.g., Stat-CFSR provides a forecast median error of 2.7 % on March 1).
In the two dry years, Figure 12c illustrates that climate predictors alone had considerable predictive power at long lead times (i.e., Oct 1 and Nov 1) in WY 1987.However, this was not the case for WY 2001 (Figure 12d), when the method providing smallest forecast median volume errors at all initialization times (i.e., either BC-ESP or TWS) always required knowledge on watershed moisture conditions.This was also the case for other pilot study basins (not shown).
The above results suggest that despite the value of large-scale climate information for this study domain, enhanced hydrologic predictability is critical for accurate streamflow volumes in snowmelt-dominated regions under extreme climatic conditions, especially during dry years.Past and ongoing efforts aimed to improve basin-scale meteorological forcing datasets, pursue realistic process representations in hydrologic models, advance parameter calibration, and improve DA techniques for better IHC estimates have built a robust platform to accelerate progress in this area.However, a long-term retrospective implementation (that is consistent with the real-time deployment) of these various modeling decisions and sources of information is critical to understand their performance, and benchmark methodological choices.

Conclusions
Generating accurate water supply forecasts is an ongoing challenge for improving water resources operations and planning.Despite substantial work on seasonal streamflow forecasting methods applied worldwide, the marginal value of increased complexity and combining different sources of information via different strategies has not been systematically assessed.In this paper, we compare a range of techniques that leverage predictability from watershed hydrologic conditions and/or large-scale climate information.The forecast intercomparison showed that hybrid techniques that leverage hindcasts to combine both sources of predictability could lead to improved skill compared to current operational approaches.Additional key findings that may be relevant beyond the study domain -due to the inclusion of both teleconnected and non-teleconnected basins -are as follows: • In basins showing strong teleconnections between large-scale climate and local meteorology, the use of large-scale climate information can be an effective strategy to improve seasonal streamflow predictability, potentially providing skillful forecasts at times when watershed predictability is limited.
• Standard climate indices provide useful information, and custom climate predictors from reanalyses were also an effective complementary strategy for extracting the signal from climate fields (e.g., SST and geopotential height).
• The relative importance of watershed IHC versus climate information to predict streamflow was found to vary even within a small region, depending on sub-domain catchment hydroclimatological characteristics.
• The ESP trace weighting method only provided promising results at forecast lead times where ESP raw forecasts contained moderate skill, indicating that climate information cannot adequately shift the prior ESP forecast if it lacks forecast resolution or contains significant bias.
• Increasing methodological complexity does not necessarily translate into better ensemble forecast quality (e.g., Stat-IHC versus BC-ESP; EWE versus BMA), in part because the small sample sizes associated with seasonal hindcasts preclude reliable parameter estimation for more elaborate methods.There can be a tradeoff between improving one forecast characteristic (e.g., bias) and degrading another (e.g., correlation skill).
• Cross-validation is an essential part of seasonal forecast development and implementation, particularly where multiple predictions may be combined based on their purported relative strengths and predictive uncertainty must be accurately estimated.In the small-sample context of seasonal streamflow prediction, cross-validation reveals significant limitations in the supportable complexity of statistical forecasting elements.
The often equivocal comparison of methods through multiple verification metrics (e.g., correlation, reliability) for individual wet and dry years, and for different basins, starkly illustrated the challenge of selecting a single method that will provide optimal results for all forecast initialization dates.There is a significant tension between optimizing forecast qualities through a mixture of methods and data sources that vary seasonally and across basins, and an oft-stated preference from forecasters and users for a consistent forecasting methodology.With this in mind, we developed HESP as a flexible data-driven framework to harness skill across varying predictability regimes, although it admittedly departs from the constraint of predictor uniformity.(q q )(o o)  [ ]

List of Figures
Probabilistic metric that varies (-∞,1], with perfect score of 1.It measures the skill of CRPS relative to a reference forecast (Hersbach, 2000).CRPS quantifies the difference between the cumulative distribution (CDF) function of a forecast (F), and the corresponding CDF of the observations (F o ).,i q m : Forecast ensemble median for year i.
q m : Temporal average over forecast ensemble medians.
o i : Observation for year i.

P (o )
78 at Howard Hanson) and dryness indices (from 0.63 at Howard Hanson to 3.83 at Prineville).Relatively high basin-averaged elevations condition a pronounced seasonal temperature pattern, with minimum values below the freezing point between December and February, and maximum temperatures during June-September (not shown).These topographic and hydroclimatic features favor snowpack development in the months October-April, stressing the seasonal behavior of other water storages and fluxes.This is illustrated in Figure2, including model precipitation (i.e., observed precipitation with a snow correction factor, SCF) and monthly averages of hydrologic variables simulated with the Sacramento Soil Moisture Accounting (SAC-SMA,Burnash et al., 1973) andSNOW-17 (Anderson, 1973)  watershed models (see Section 4).Although seasonal precipitation patterns may differ, water starts accumulating in October as snow water equivalent (SWE) and/or soil moisture (SM) in all basins.Increases in SM and runoff in most basins are driven by snowmelt at the beginning of spring with the exception of Howard Hanson, where the bulk of annual streamflow occurs in November-May.Among these basins, Dworshak, Hungry Horse and Libby share similar SWE, soil moisture, and runoff cycles, although precipitation is relatively uniform in the last one throughout the year.
The hydrology models used in this study were the NWS Snow-17, SAC-SMA and a unit-hydrograph routing model, all implemented in lumped fashion with 2-3 snow elevation zones per watershed.The models were calibrated via an automated multi-objective parameter estimation to reproduce observed daily streamflow.Hydrologic model forcings were drawn from a 1/16 degree real-time implementation of the ensemble forcing generation method described inNewman et al. (2015).Naturalized flow data was obtained from a combination of sources, including the Bonneville Power Administration (BPA, 2011), the USBR Hydromet historical data access system, and the USACE Data Query System.

Figure 4
Figure 4 shows simulated and observed monthly time series of streamflow for the period Oct/1990 -Sep/2000.
using model-simulated basin-averaged SWE and SM as surrogates for ground-based observations of SWE, precipitation and streamflow used operationally by the NWS and NRCS (as demonstrated in Rosenberg et al., 2011).A linear regression equation is developed between normalized log-transformed seasonal runoff and IHCs represented by the sum of simulated basin-averaged SWE and SM.The training period equations are used to issue a deterministic runoff volume prediction for each year left out, and ensembles are generated by adding 500 Gaussian random numbers with zero mean and a standard deviation equal to the standard error of the individual prediction.
assess the potential of custom climate predictor indices derived from reanalysis data.FollowingTootle et al. (2007), we use Partial Least Squares Regression(PLSR; Wold, 1966)  to extract information from climate fields.PLSR decomposes a set of independent variables X and dependent variables Y into a small number of principal components that explain as much covariance as possible between the two variable sets(Abdi, 2010).PLSR components are formed from CFSR 700 mb geopotential height (Z700) and sea surface temperatures (SSTs) over the domain 20°S-80°N; 130°E-10°W.For dates beyond 2010, we merged the 1979-2010 CFSR data with monthly analysis fields from the Climate Forecast System version 2 (CFSv2;Saha et al., 2014), aggregating the latter product to 2.0° × 2.0° horizontal resolution.Similar to the Stat-Ind method, we use 3-month averages of these variables.The seasonal forecasts are generated for each initialization by following these steps:1.Compute principal components from the combined SST and Z700 gridded values for each training sample and the left-out prediction years.2.Fit a regression model to the resulting PLSR components (predictors), accepting each additional component only when its mean partial correlation with volume runoff is above a threshold.We used a threshold of 0.30 throughout the study after finding that nearby values -e.g., 0.25, 0.35 -did not substantially change the results.The small sample size and low predictability supported at most two components.3.Compute a mean runoff volume forecast using the regression model obtained in Step 2, and generate an ensemble by adding 500 Gaussian random numbers with zero mean and a standard deviation equal to the root mean squared error of prediction (RMSEP) obtained from leave-three-out cross validation within the training period.3.4.Ensemble forecasts are transformed from z-scores to log space, and finally exponentiated for conversion to flow space.
Figure5displays correlation coefficients (r) between forecast median and observed April-July runoff volumes for the five case study basins.As expected, near-zero or negative r values were obtained for October 1 and November 1 WSFs with the IHC-based methods.Negative correlation scores arise in very low-skill situations as an artifact of cross-validation (e.g., leaving a high predictand out of a training sample biases the resulting prediction in the opposite direction).The seasonality of SM and SWE in the basins of interest (Figure2) does not yield watershed moisture accumulations with predictive power until December or January.In contrast, r values as high as 0.48 for Dworshak and 0.49 for Hungry Horse could be attained on October 1 using only information from climate indices (Stat-Ind).Generally, but not everywhere, methods harnessing predictability from the climate (with the exception of TWS) enhance skill in comparison to IHC-based methods at initializations early in the water year.TWS is unable to shift the parent ESP distribution sufficiently to impart much climate skill at this time of year.

Figure 5
Figure5reveals several notable outcomes that are evident in many of the results plots.First, a linear regression against IHCs can provide similar r values than the more computationally expensive ESP method, especially at late initializations (i.e.March 1 or April 1).Likewise, straightforward ensemble combination techniques (e.g., EWE or RWE) may outperform more complex methods such as BMA (e.g., February 1 -April 1) at all basins.From a correlation skill perspective, on the other hand, ESP generally outperforms the rest of the methods in late winter and spring.For example, ESP provides the highest r values for Dworshak (0.82) and Howard Hanson (0.67) on April 1.
Figure 11 illustrates this idea through time series of cross-validated ensemble forecasts obtained with HESP for three initialization times (Oct 1, Jan 1, and Apr 1).Forecasts issued on Oct 1 provide positive skill with respect to climatology in Dworshak, Howard Hanson and Hungry Horse, and although CRPSS relative to ESP does not necessarily improve, the associated correlation coefficients (0.42, 0.37 and 0.47, respectively) are a clear enhancement over negative r values obtained from IHC-based methods.The lower probabilistic skill and near-zero correlation in Libby and Prineville reflect the lack of predictability from either the watershed or climate conditions at such a long lead time.Higher values of CRPSS clim for ensemble forecasts initialized on Jan 1 and Apr 1 reflect the increasing power of IHCs, while smaller (and sometimes negative)

ii:
Non-exceedance probability of o i using ensemble forecasts at year i.
exceedance probability of o i using the uniform distribution U[0,1].

Figure 1 :
Figure 1: Location map with the pilot basins included in this study.

Figure 2 :
Figure 2: Corrected precipitation P (i.e.observed precipitation multiplied by a snow correction factor SCF) and simulated water balance variables-active SM, SWE, and runoff (RO)-for the five study basins: (a) Dworshak Reservoir inflow (DWRI1), (b) Howard Hanson reservoir inflow (HHDW1), (c) Hungry Horse reservoir inflow (HHWM8), (d) Libby dam inflow (LYDM8), and (e) Prineville reservoir inflows (PRVO).For model SM, we subtract the lowest mean monthly value of the year so that the plotted values show only the active range of variation.

Figure 3 :
Figure 3: Schematic figure showing all seasonal streamflow forecasting methods included in the inter-comparison framework.The benchmark methods are operationally implemented in the Western United States, and they are solely based on hydrologic predictability.

Figure 6 :
Figure 6: Same as in Figure 5, but for root mean squared error (RMSE) -in million acre feet (MAF) -of ensemble forecast medians versus observations.See text for further details.

Figure 7 :
Figure 7: Same as in Figure 5, but for percent bias (% bias) in forecast ensemble medians versus observations.See text for further details.

Figure 8 :
Figure 8: Continuous Ranked Probability Skill Score of the forecast ensembles with respect to mean observed climatology (CRPSS clim ).See text for further details.

Figure 9 :
Figure 9: Time series with cross-validated hindcasts initialized on December 1, obtained with two watershed-based methods (BC-ESP and Stat-IHC) and two climatebased techniques (Stat-Ind and Stat-CFSR) for the five case study locations (a-e).The verification metrics CRPSS clim and CRPSS esp denote continuous ranked probability skill scores using the mean climatology and raw ESP output as the reference, respectively.Black dashed lines represent 10%, 50% and 90% flows from the observed climatology, and boxplots show the 10 th , 30 th , 50 th , 70 th and 90 th hindcast percentiles.The red line represents the observed flow volumes.

Figure 10 :
Figure 10: The α reliability index for the hindcast ensembles for five case study locations.See text for further details.

Figure 12 :
Figure 12: April-July water supply forecasts obtained at the Hungry Horse reservoir (HHWM8) with different methods for two wet years -(a) 1997, and (b) 2011 -and two dry years -(c) 1987, and (d) 2001.The red dashed line represents the observed flow, while black dashed lines represent 10%, 50% and 90% flows from observed climatology, and boxplots show the 10 th , 30 th , 50 th , 70 th and 90 th hindcast percentiles.

An intercomparison of approaches for improving predictability in operational seasonal streamflow forecasting"
Pablo A. Mendoza, Andrew W.Wood, Elizabeth Clark, Eric Rothwell, Martyn P. Clark, BartNijssen, Levi D. Brekke, and Jeffrey R. ArnoldWe thank this reviewer for his time in commenting on our paper.We provide responses to each individual point below.For clarity, comments are given in italics, and our responses are given in plain text.
applied: purely statistical methods, methods based on watershed modelling as well as hybrid approaches using initial hydrologic conditions and / or climate information as input.Additionally, different post-processing and merging methods are tested.The snow-dominated test catchments cover a wide range of hydrometeorological conditions and different atmospheric teleconnection signals.
24 Table 1: In the table and in the main section the abbreviation RE (runoff efficiency) is used, in the table caption runoff ratio RR is used, please harmonizeWe have replaced runoff efficiency (RE) by runoff ratio (RR) in Table1.Thanks for noting this.
, fewer insights have been gained on combining statistical or dynamical models in seasonal streamflow forecasting.Recently, Najafi and Moradkhani (2015) tested multimodel combination techniques of different complexities from both statistical and dynamical forecasts, concluding that model combination generally outperforms the best individual forecast model.Many sophisticated seasonal forecasting frameworks can be found in the literature, some of which incorporate DA techniques (e.g., Dechant and Moradkhani, 2011), a topic not discussed here.For this reason, the hydrology community may benefit from a broad assessment of the marginal benefits of choices made in a range of seasonal streamflow forecasting frameworks.
Opitz-Stapleton et al., 2007;Tootle et al., 2007)d by a number of large-scale climate teleconnections.The warm (cold) phase of El Niño Southern Oscillation (ENSO) is typically associated with above (below) average temperatures and below (above) average precipitation during winter (e.g.,Redmond and Koch, 1991), and therefore decreased (increased) snowpack(Clark et al., 2001)and spring/summer runoff (e.g.,Piechota et al., 1997).The Pacific Decadal Oscillation (PDO; Mantua et al., 1997) -which reflects the dominant mode in decadal variability of SSTs -has also been found a relevant driver for the hydroclimatology of the PNW (e.g.,McCabe and Dettinger,    2002).The joint influence of ENSO and PDO on North American climate conditions, snowpack and spring/summer runoff has been also well recognized and documented (e.g.,Hamlet and Lettenmaier, 1999).As a consequence, many authors have explored the incorporation of large-scale climate information for seasonal streamflow forecasting in the PNW -using either standard indices (e.g.,Hamlet and Lettenmaier, 1999;Maurer et al., 2004), custom indices from reanalysis fields (e.g.,Opitz-Stapleton et al., 2007;Tootle et al., 2007), both (e.g., Moradkhani and Meier, 2010), or downscaled climate forecasts (e.g.,Wood et al., 2005)-finding improved predictability for lead times longer than 2 months, and particularly in years of strong anomalies in climate oscillations such as ENSO.
Here, the predictor pool used to explain ε climate may include standard climate indices or reanalysis PLSR components, depending on the performance obtained during the training period.Absent IHC predictability, HESP is equivalent to Stat-Ind or Stat-CFSR; whereas without climate predictability, it defaults to Stat-IHC.Lacking both Wang, 2015) al., 2005)climate predictors) + ε residual , such that the final forecast equation takes the form: != $(IHC predictors) + 4(climate predictors) + 8 9:;<=>?@(1)Equally weighted ensembles (EWE) and RMSE-weighted ensembles (RWE) EWE combines the best-performing climate-only hindcast (i.e., Stat-Ind or Stat-CFSR, based on RMSE over the training period) with the best watershed-only hindcast (either Stat-IHC or BC-ESP), resampling ensemble members equally from each source to form a new 500-member ensemble forecast.A variation of this combination approach (RWE) instead performs a weighted resampling from the two forecast sources according to their skill during the training period.: Ii.e., the two weights equal 1/RMSE are obtained, where RMSE the root mean squared error of the ensemble median.These weights are normalized to make them sum 1, and finally obtain the fraction of the new 500-member ensemble coming from each forecast source.For example, if the resulting normalized weights are 0.4 and 0.6 for the best climate-only and best watershed-only forecasts, respectively, the RWE ensemble will contain 200 and 300 members from each prediction.Bayesian Model Averaging (BMA) and Quantile model averaging (QMA)These methods combine the best-performing climate-only hindcast with the best performing watershed-only hindcast.While BMA(Raftery et al., 2005)attempts to provide a weighted average of forecast probability densities, QMA (Schepen andWang, 2015)applies a weighted average to forecast values (quantiles) for a given cumulative probability.A notable difference between the two approaches is that QMA produces smoother and consistently unimodal distributions compared to potentially bimodal BMA outputs (Schepen andWang, 2015).More details on these techniques are provided in section 7.2.
October 1, Stat-Ind and Stat.Ind+IHC generate respective reductions in RMSE of 10% and 13% at Dworshak, 23% and 16% at Howard Hanson, and 14% and 12% at Hungry Horse, relative to the best IHC-based method in each basin.These benefits are seen in most initializations and catchments except at Libby, where the best results were mostly achieved using ESP (Oct 1) and Stat-IHC (Dec 1, and Feb 1 -Apr 1).In agreement with Beckers et al.
(2016), this study was unable to find encouraging climate teleconnections at Libby, despite its relative proximity to Hungry Horse.Figure6underscores that from a median error perspective, intuitive ensemble combinations approaches (i.e., EWE and RWE, shown in dark green) can be effective for reducing forecast errors once the watershed begins to provide useful predictability (i.e. after January 1).For instance, EWE was the best performing method in Hungry Horse and Prineville for forecasts initialized on March 1 and Apr 1. Further, Figure6illustrates that the best (or worst) techniques when looking at RMSE vary with each basin, although it is clear that TWS and only-climate methods perform poorly at early and late initializations, respectively.The joint inspection of Figures5 and 6shows that inter-method agreement in correlation does not necessarily translate into similar forecast median errors.For example, while ESP and HESP provide close r values at Dworshak (0.74 and 0.73) on March 1, larger discrepancies are obtained in RMSE, with values of 0.58 million-acre-feet (MAF) -equivalent to 0.72 billion cubic meters (BCM) -and 0.79 MAF (0.97 BCM) for ESP and HESP, respectively.Another interesting result is that no substantial reductions in RMSE were achieved at Howard Hanson between October 1 and April 1, in contrast to the gradual growth of hydrologic predictability to support forecast skill in other basins.Indeed, the best performing techniques for October 1 (Stat-Ind) and April 1 (BC-ESP) forecasts provide similar RMSE values (~0.064MAF [0.079 BCM] and 0.065 MAF [0.08 BCM], respectively).This outcome can be attributed to the relatively more rainfall-dominated hydrograph of Howard Hanson in comparison to the rest of the catchments (Table ), better skill values are obtained for long lead times (i.e.Oct 1 and Nov 1) if climate predictors are incorporated in the forecasting framework.For example, Stat.(Ind+IHC) augments skill by 56% in HHDM1 and 7% in Hungry Horse with respect to Stat-IHC (i.e., the best benchmark in terms of CRPSS clim ) when forecasts are initialized on November 1.The skill of IHC-based methods generally increases from October 1 to April 1.Nevertheless, at late initializations it is still possible to outperform these techniques in some basins (e.g., Stat (CFSR+IHC) and EWE in Hungry Horse provide skill increases of 7% and 5% in April 1 forecasts over the best IHC-based technique).For late season initializations -when IHC predictability is strong -it is expected that climate-only forecasts are not suitable, and underperform other methods.

Table 1 : List of basin characteristics. Hydrologic variables correspond to the period October 1980 to September 2015. P, R, PE, RR, and DI denote basin-averaged mean annual values of precipitation, runoff, potential evapotranspiration, runoff ratio, and dryness index, respectively.
*Potential evapotranspiration using the Priestley-Taylor method