For much of the last century, forecasting centers around the world have offered seasonal streamflow predictions to support water management. Recent work suggests that the two major avenues to advance seasonal predictability are improvements in the estimation of initial hydrologic conditions (IHCs) and the incorporation of climate information. This study investigates the marginal benefits of a variety of methods using IHCs and/or climate information, focusing on seasonal water supply forecasts (WSFs) in five case study watersheds located in the US Pacific Northwest region. We specify two benchmark methods that mimic standard operational approaches – statistical regression against IHCs and model-based ensemble streamflow prediction (ESP) – and then systematically intercompare WSFs across a range of lead times. Additional methods include (i) statistical techniques using climate information either from standard indices or from climate reanalysis variables and (ii) several hybrid/hierarchical approaches harnessing both land surface and climate predictability. In basins where atmospheric teleconnection signals are strong, and when watershed predictability is low, climate information alone provides considerable improvements. For those basins showing weak teleconnections, custom predictors from reanalysis fields were more effective in forecast skill than standard climate indices. ESP predictions tended to have high correlation skill but greater bias compared to other methods, and climate predictors failed to substantially improve these deficiencies within a trace weighting framework. Lower complexity techniques were competitive with more complex methods, and the hierarchical expert regression approach introduced here (hierarchical ensemble streamflow prediction – HESP) provided a robust alternative for skillful and reliable water supply forecasts at all initialization times. Three key findings from this effort are (1) objective approaches supporting methodologically consistent hindcasts open the door to a broad range of beneficial forecasting strategies; (2) the use of climate predictors can add to the seasonal forecast skill available from IHCs; and (3) sample size limitations must be handled rigorously to avoid over-trained forecast solutions. Overall, the results suggest that despite a rich, long heritage of operational use, there remain a number of compelling opportunities to improve the skill and value of seasonal streamflow predictions.
The operational hydrology community has long grappled with the challenge of producing skillful seasonal streamflow forecasts to support water supply operations and planning. Proactive water management has become critical for many regions in the world that are susceptible to water stress associated with the intensification of the water cycle. Paradoxically, although we have seen important technological advances, including increased computing power, the broader availability to climate reanalysis, forecasts and reforecasts, and more complex process-based hydrologic models (Pagano et al., 2016), the skill of operational seasonal runoff predictions in the US, termed water supply forecasts (WSFs), has shown little or no improvement over time (e.g., Pagano et al., 2004; Harrison and Bales, 2016). Hence, there is both a scientific and practical need to understand the potential of new datasets, modeling resources, and methods to accelerate progress towards more skillful and reliable operational seasonal streamflow forecasts.
There is general consensus in the research community on the main opportunities to improve seasonal streamflow prediction skill (e.g., Maurer et al., 2004; Wood and Lettenmaier, 2008; Yossef et al., 2013). These include improving knowledge of (i) the amount of water stored in the catchment, hereinafter referred to as initial hydrologic conditions (IHCs), and (ii) weather and climate outcomes during the forecast period. Our ability to leverage the first predictability source (i.e., hydrologic predictability) depends on the accuracy of watershed observations and models, including model input forcings (e.g., precipitation and temperature), process representations, and the effectiveness of hydrologic data assimilation (DA) methods. Our ability to leverage the second source (climate predictability) depends both on how well we can characterize and predict the state of the climate and on how effectively we can incorporate this information into streamflow forecasting methods. This idea has been explored in different frameworks using standard indices, e.g., Niño3.4, the Pacific Decadal Oscillation (PDO), and/or custom (i.e., watershed-specific) climate indices derived from climate reanalyses (e.g., Grantz et al., 2005; Bradley et al., 2015), or using seasonal climate forecasts to run hydrologic model simulations (e.g., Wood et al., 2005; Yuan et al., 2013).
Despite generally promising findings from this body of work and from a number of agency development efforts (Weber et al., 2012; Demargne et al., 2014), the use of large-scale climate information for real-time seasonal streamflow forecasting in the US remains rare. In the western United States, where snowmelt commonly dominates the annual cycle of runoff, official WSFs are produced via two main approaches: (i) statistical models leveraging in situ watershed moisture measurements such as snow water equivalent (SWE), accumulated precipitation, and streamflow (Garen, 1992; Pagano et al., 2004); and (ii) outputs from the National Weather Service (NWS) ensemble streamflow prediction method (ESP; Day, 1985), which is based on watershed modeling. For the overwhelming majority of forecast locations, these approaches rely solely on the predictability from IHCs (measured or modeled). A small number of locations can be found, however, where climate indices also serve as predictors in the statistical framework, and the NWS has recently implemented techniques through which climate model forecasts may eventually be applied to ESP (Demargne et al., 2014).
This paper presents an assessment of several seasonal streamflow prediction approaches in harnessing both watershed and climate-related predictability. The methods are applied to seasonal WSFs and span a range of complexity, from purely statistical to purely dynamical and hybrid statistical/dynamical approaches. In this paper, “increased complexity” indicates a gradient from purely data-driven techniques (e.g., linear regression) to the use of dynamical watershed models (Plummer et al., 2009), the outputs of which may be further processed using additional statistical approaches. Although most of the techniques evaluated here are not new, the intercomparison offers new insights for researchers and developers in the operational community because (1) the experiment is broader than prior efforts and benchmarks alternative methods against current operational ones; and (2) the methods are chosen to be operationally feasible, avoiding the use of data that cannot be obtained in real time. In addition, the work uses a hindcast/verification framework and follows more rigorous standards for cross validation than were used in some of the prior studies.
The remainder of this paper is organized as follows. Section 2 describes prior methodological work and context for statistical, dynamical, and hybrid approaches to seasonal streamflow forecasting. The study domain is described in Sect. 3. Datasets, experimental design, individual methods, and forecast verification measures are detailed in Sect. 4. Results and discussion are presented in Sect. 5, followed by the main conclusions of this study (Sect. 6).
Seasonal streamflow forecasting methods can be categorized as dynamical, statistical, or hybrid, and span different degrees of complexity and information requirements. Dynamical methods use time-stepping simulation models to represent hydrologic processes. They describe future climate using either historical meteorology or inputs derived from seasonal climate forecasts (e.g., Beckers et al., 2016). On the other hand, statistical or purely data-driven methods rely on empirical relationships between seasonal streamflow volumes, and large-scale climate variables and/or in situ watershed observations. Several statistical approaches can be found in the literature, encompassing different degrees of complexity (e.g., Garen, 1992; Piechota et al., 1998; Grantz et al., 2005; Tootle et al., 2007; Pagano et al., 2009; Wang et al., 2009; Moradkhani and Meier, 2010). Other studies have tested multi-model combination techniques for purely statistical seasonal forecasts, using objective performance criteria (e.g., Regonda et al., 2006), both performance and predictor state information (Devineni et al., 2008), and Bayesian model averaging (e.g., Mendoza et al., 2014), among others.
Hybrid methods strive to combine the strengths from both dynamical and statistical techniques. For instance, uncertainties in dynamical predictions indicate that dynamical forecasts can benefit from statistical post-processing (e.g., Wood and Schaake, 2008). One line of research has examined the potential benefits of using simulated watershed state variables – either from hydrologic or land surface models – as predictors for statistical models (e.g., Rosenberg et al., 2011; Robertson et al., 2013). Another popular technique consists of incorporating climate information within ESP frameworks, either deriving input sequences of mean areal precipitation and temperature from current climate or climate forecast considerations (e.g., Werner et al., 2004; Wood and Lettenmaier, 2006; Luo and Wood, 2008; Gobena and Gan, 2010; Yuan et al., 2013) – referred to as pre-ESP – or ESP weighting (also referred to as post-ESP) based on climate information (e.g., Smith et al., 1992; Werner et al., 2004; Najafi et al., 2012; Bradley et al., 2015). Werner et al. (2004) found that the post-ESP method (termed “trace weighting”) was more effective than pre-ESP for improving forecast skill.
The combination of outputs from different models has also been shown to benefit seasonal hydroclimatic forecasting (e.g., Hagedorn et al., 2005). Although several studies have demonstrated that statistical multi-model techniques applied on dynamical models tend to outperform the “best” single model (e.g., Georgakakos et al., 2004; Duan et al., 2007), fewer insights have been gained on combining statistical or dynamical models in seasonal streamflow forecasting. Recently, Najafi and Moradkhani (2015) tested multi-model combination techniques of different complexities from both statistical and dynamical forecasts, concluding that model combination generally outperforms the best individual forecast model. Many sophisticated seasonal forecasting frameworks can be found in the literature, some of which incorporate DA techniques (e.g., DeChant and Moradkhani, 2011), a topic not discussed here. For this reason, the hydrology community may benefit from a broad assessment of the marginal benefits of choices made in a range of seasonal streamflow forecasting frameworks.
Our test domain is the US Pacific Northwest (PNW) region (Fig. 1), which relies heavily on winter snow accumulation and spring snowmelt to meet water needs during spring and summer (e.g., Mote, 2003; Maurer et al., 2004; Wood et al., 2005). We select catchments contributing to five reservoirs: Dworshak (DWRI1), Howard Hanson (HHDW1), Hungry Horse (HHWM8), Libby (LYDM8), and Prineville (PRVO). Two of them – the Hungry Horse and Prineville reservoirs – are owned and operated by the US Bureau of Reclamation (USBR), while the rest are operated by the US Army Corps of Engineers (USACE).
Location map with the pilot basins included in this study.
List of basin characteristics. Hydrologic variables correspond to
the period October 1980 to September 2015.
The main physical and hydroclimatic characteristics of the case study basins are summarized in Table 1. These basins cover a wide range of runoff ratios (from 0.13 at Prineville to 0.78 at Howard Hanson) and dryness indices (from 0.63 at Howard Hanson to 3.83 at Prineville). Relatively high basin-averaged elevations condition a pronounced seasonal temperature pattern, with minimum values below the freezing point between December and February, and maximum temperatures during June–September (not shown). These topographic and hydroclimatic features favor snowpack development in the months of October–April, stressing the seasonal behavior of other water storages and fluxes. This is illustrated in Fig. 2, including model precipitation (i.e., observed precipitation with a snow correction factor, SCF) and monthly averages of hydrologic variables simulated with the Sacramento Soil Moisture Accounting (SAC-SMA; Burnash et al., 1973) and SNOW-17 (Anderson, 1973) watershed models (see Sect. 4). Although seasonal precipitation patterns may differ, water starts accumulating in October as snow water equivalent (SWE) and/or soil moisture (SM) in all basins. Increases in SM and runoff in most basins are driven by snowmelt at the beginning of spring with the exception of Howard Hanson, where the bulk of annual streamflow occurs in November–May. Among these basins, Dworshak, Hungry Horse, and Libby share similar SWE, soil moisture, and runoff cycles, although precipitation is relatively uniform in the last one throughout the year.
Corrected precipitation
The hydroclimatology of the PNW region is affected by a number of large-scale climate teleconnections. The warm (cold) phase of El Niño–Southern Oscillation (ENSO) is typically associated with above (below) average temperatures and below (above) average precipitation during winter (e.g., Redmond and Koch, 1991) and therefore decreased (increased) snowpack (Clark et al., 2001) and spring/summer runoff (e.g., Piechota et al., 1997). The Pacific Decadal Oscillation (PDO; Mantua et al., 1997) – which reflects the dominant mode in decadal variability of sea surface temperatures (SSTs) – has also been found a relevant driver for the hydroclimatology of the PNW (e.g., McCabe and Dettinger, 2002). The joint influence of ENSO and PDO on North American climate conditions, snowpack, and spring/summer runoff has been also well recognized and documented (e.g., Hamlet and Lettenmaier, 1999). As a consequence, many authors have explored the incorporation of large-scale climate information for seasonal streamflow forecasting in the PNW – using either standard indices (e.g., Hamlet and Lettenmaier, 1999; Maurer et al., 2004), custom indices from reanalysis fields (e.g., Opitz-Stapleton et al., 2007; Tootle et al., 2007), both (e.g., Moradkhani and Meier, 2010), or downscaled climate forecasts (e.g., Wood et al., 2005) – finding improved predictability for lead times longer than 2 months and particularly in years of strong anomalies in climate oscillations such as ENSO.
Schematic figure showing all seasonal streamflow forecasting methods included in the intercomparison framework. The benchmark methods are operationally implemented in the western United States, and they are solely based on hydrologic predictability.
We use several decades of seasonal streamflow hindcasts to assess a suite of methods (Fig. 3), focusing on April–July streamflow (runoff) volume, the most common western US water supply forecast predictand. Probabilistic (ensemble) WSFs for this period are generated the first day of each month from October to April, in every year of the hindcast period 1981–2015. For the methods involving statistical prediction, we use a leave-three-out cross validation at all stages of the forecast process. This procedure is repeated for consecutive 3-year periods (e.g., 1981–1983, 1984–1986), except for the last time window (2014–2015).
The techniques assessed here are categorized as follows. The first group, IHC-based methods, includes two approaches (referred to as benchmark methods) – ESP and IHC-based statistical – currently used operationally in the western US (both harnessing only IHC information), and a simple ESP post-processor to reduce systematic biases. A second group, climate-only methods, includes statistical techniques harnessing climate information from two different sources – standard indices (e.g., Niño3.4, PDO, AMO) or variables extracted from the Climate Forecast System Reanalysis (CFSR; Saha et al., 2010). A third group of hybrid or hierarchical methods includes subgroups of techniques that (i) combine watershed predictors (IHCs) and climate predictors (either indices or CFSR variables) within a statistical framework, (ii) use climate information to post-process outputs from a dynamical method (i.e., ESP), or (iii) combine purely climate-based ensembles with purely watershed-based ensembles.
In operational practice, ESP produces an ensemble of streamflow estimates,
whereas statistical water supply forecasting yields a statistical
distribution. In this study, we generate ensembles of the final predictand
for all methods. An ensemble size 500 is used – wherein the members are
generated through a resampling (in some cases weighted) of the predictive
distributions – except for the ESP and bias-corrected ESP methods, for
which 32 members are generated (i.e., 35 total historical years less the
three test years left out). In the statistical approaches, seasonal
flows are log-transformed, and predictor and predictand data are normalized
before training statistical method parameters or weights (i.e.,
The traditional ensemble streamflow prediction (ESP) method (Day, 1985) relies on deterministic hydrologic model simulations forced with observed meteorological inputs up to the initialization time of the forecast. The approach assumes that meteorological data and model are perfect – i.e., there are no errors in IHCs – and that historical meteorological conditions during the simulation period can be used to represent climate forecast conditions. For hindcast verification purposes, the meteorological input traces associated with forecast years must be excluded.
The hydrology models used in this study were the NWS Snow-17, SAC-SMA, and a
unit-hydrograph routing model, all implemented in lumped fashion with 2–3 snow
elevation zones per watershed. The models were calibrated via an
automated multi-objective parameter estimation to reproduce observed daily
streamflow. Hydrologic model forcings were drawn from a
Figure 4 shows simulated and observed monthly time
series of streamflow for the period October 1990–September 2000. In this paper,
results are reported in non-metric units because of their greater
familiarity to readers from the US water management community. With the
exception of Prineville, where neither meteorology nor flow are well
measured, all basins show values of the Nash–Sutcliffe efficiency (NSE) and
Monthly streamflow simulations (red) and observations
(black) for the period October 1980–September 2000. Left panels display monthly
time series, with NSE and
This method mimics the approach of the US Natural Resources Conservation
Service (NRCS) but differs in using model-simulated basin-averaged SWE and
SM as surrogates for ground-based observations of SWE, precipitation, and
streamflow used operationally by the NWS and NRCS
(as demonstrated in Rosenberg et al., 2011). A
linear regression equation is developed between normalized log-transformed
seasonal runoff and IHCs represented by the sum of simulated basin-averaged
SWE and SM. The training period equations are used to issue a deterministic
runoff volume prediction for each year left out, and ensembles are generated
by adding 500 Gaussian random numbers with zero mean and a standard
deviation equal to the standard error of the individual prediction. The
predictions are then transformed from
ESP predictions often exhibit a systematic bias due to inadequate model parameters and/or other sources or error (e.g., input forcing selection, model structure). If the ESP approach provides a consistent hindcast, as it does here, post-processing in the form of a simple bias-corrected ensemble streamflow prediction (BC-ESP) can be applied. This is achieved by multiplying the raw ESP forecasts by a mean scaling factor that is obtained by computing the ratio between the mean of observed seasonal runoff volumes (i.e., the predictand) and the mean of ESP forecast median volumes for each initialization time. Each scaling factor calculation and application is cross validated.
This method evaluates 12 standard climate indices as candidate predictors
(Table 2). For each initialization time (e.g.,
1 November) and climate index (e.g., Niño3.4), the 3-month time window
that maximizes the correlation coefficient between a preceding seasonal
(e.g., August–October) predictor average and seasonal streamflow volume over
the training period is selected. Once this procedure is repeated for all
potential predictors, the best possible time series are obtained for the 12 climate
indices, and ensemble forecasts are produced for a given
initialization through the following steps:
Several combinations of predictors are selected subject to the constraint
that no pairs of predictors with an intercorrelation larger than
Stepwise multiple linear regression (MLR) models are fit for all combinations of predictors identified in
step 1, and the set of predictors that minimizes the Bayesian information
criterion (BIC) score (Akaike, 1974) over the training period
is selected. An ensemble forecast is generated (as for Stat-IHC) with the MLR model from
step 2.
We choose MLR over more parameterized regression methods (e.g., local
polynomial regression) since these were found to perform poorly in
cross validation, mainly due to the limited sample sizes available in the
seasonal hydrologic prediction context.
List of climate indices included as potential predictors.
The teleconnections captured in off-the-shelf climate indices are not
influential everywhere. Therefore, we also assess the potential of custom
climate predictor indices derived from reanalysis data. Following
Tootle et al. (2007), we use partial least squares
regression (PLSR; Wold, 1966) to extract information from
climate fields. PLSR decomposes a set of independent variables The principal components are computed from the combined SST and Z700 gridded values
for each training sample and the left-out prediction years. A regression model is fitted to the resulting PLSR components (predictors),
accepting each additional component only when its mean partial correlation
with volume runoff is above a threshold. We used a threshold of 0.30
throughout the study after finding that nearby values – e.g., 0.25, 0.35 –
did not substantially change the results. The small sample size and low
predictability supported at most two components. A mean runoff volume forecast is computed using the regression model obtained in
step 2, and an ensemble is generated by adding 500 Gaussian random numbers with
zero mean and a standard deviation equal to the root mean squared error of
prediction (RMSEP) obtained from leave-three-out cross validation within the
training period. Ensemble forecasts are transformed from
The main implication of developing PLSR components and the subsequent
estimation of regression coefficients in cross validation – as conducted
here – is that climate information from the target prediction period is not
used at all, as is the case in real-time systems. This is a key
methodological difference versus past studies that used all historical
available information to define custom reanalysis predictor fields
(e.g., Grantz et al., 2005;
Regonda et al., 2006; Bracken et al., 2010; Mendoza et al., 2014), yielding
a moderate yet erroneous boost in predictability.
We applied two statistical methods that combine climate and dynamical watershed model predictors: Stat-Ind-IHC (which uses climate indices and IHCs) and Stat-CFSR-IHC (which uses CFSR-based PLSR components and IHCs). These approaches are implemented in identical fashion to Stat-Ind, except that IHCs are added to the potential suite of climate predictors.
The underlying idea of hierarchical ensemble streamflow prediction (HESP) is that the two main sources of predictability
– watershed IHCs and climate – may best be addressed sequentially to
ensure that only climate uncertainty is related to climate predictors. This
may not be the case if a climate variable enters first into a regression model
that attempts to explain streamflow variance from both IHCs and climate,
possibly leading to a sub-optimal predictor selection. HESP is thus a
hierarchical regression approach in which streamflow is first related to
IHCs by fitting
A well-known strategy for incorporating climate information into ESP forecasts is called trace weighting (Smith et al., 1992; Werner et al., 2004), where forecasted flow probabilities are corrected by weighting each ensemble member according to the similarity between a climate-related feature of the current year (e.g., PDO) and the meteorological year used to generate that member. Here, for a given basin and forecast period, either climate indices or CFSR-based components are selected based on their training period performance (i.e., RMSE) and used to weight each trace obtained from BC-ESP (see Appendix A for further details).
An equally weighted ensemble (EWE) combines the best-performing climate-only hindcast (i.e., Stat-Ind or
Stat-CFSR, based on RMSE over the training period) with the best
watershed-only hindcast (either Stat-IHC or BC-ESP), resampling ensemble
members equally from each source to form a new 500-member ensemble forecast.
A variation of this combination approach, an RMSE-weighted ensemble (RWE), instead performs a weighted
resampling from the two forecast sources according to their skill during the
training period. That is, two weights RMSE
These methods combine the best-performing climate-only hindcast with the best-performing watershed-only hindcast. While Bayesian model averaging (BMA; Raftery et al., 2005) attempts to provide a weighted average of forecast probability densities, quantile model averaging (QMA; Schepen and Wang, 2015) applies a weighted average to forecast values (quantiles) for a given cumulative probability. A notable difference between the two approaches is that QMA produces smoother and consistently unimodal distributions compared to potentially bimodal BMA outputs (Schepen and Wang, 2015). More details on these techniques are provided in Appendix B.
Performance metrics used to assess and compare seasonal streamflow forecasting methods.
Forecast method performance was evaluated using the metrics listed in
Table 3. These include some standard metrics used
in hydrology, such as correlation coefficient (
Confidence intervals for the verification statistics are created using
bootstrapping with replacement. In each resampling step,
We first compare methods using the WSF median, a critical predictand for
many water decisions (e.g., Lake Powell releases on the Colorado River in
the western US). Figure 5 displays correlation
coefficients (
Correlation coefficients of forecast ensemble medians
versus observations obtained from all methods at different initialization
dates. The error bars define 95 % confidence limits obtained through
bootstrapping with replacement. Results are displayed for
After January, the hydrologic model begins to capture a useful moisture
variability signal from the watershed; thus, IHCs start to become a dominant
source of predictability in all basins. Indeed, watershed information is
particularly relevant at Libby and Prineville
(Fig. 5d and e), where correlations within the
range 0.39–0.47 are achieved as early as 1 December with the three IHC-based
techniques. In these basins, standard climate indices do not provide useful
long-lead predictability, although CFSR-based predictors do support a
consistent improvement. For example, the correlation from Stat-Ind for Libby
(Prineville) on 1 December is
Figure 5 reveals several notable outcomes that are
evident in many of the results plots. First, a linear regression against
IHCs can provide similar
RMSEs for ensemble forecast medians (Fig. 6) show that despite some discrepancies between techniques, inter-method differences are not as large as for correlation. In most basins, errors can be reduced at earlier initializations (i.e., 1 October and 1 November) by introducing climate information. For instance, on 1 October, Stat-Ind and Stat-Ind-IHC generate respective reductions in RMSE of 10 and 13 % at Dworshak, 23 and 16 % at Howard Hanson, and 14 and 12 % at Hungry Horse, relative to the best IHC-based method in each basin. These benefits are seen in most initializations and catchments except at Libby, where the best results were mostly achieved using ESP (1 October) and Stat-IHC (1 December and 1 February–1 April). In agreement with Beckers et al. (2016), this study was unable to find encouraging climate teleconnections at Libby despite its relative proximity to Hungry Horse.
The same as Fig. 5 but for RMSE – in million acre feet (MAF) – of ensemble forecast medians versus observations. See text for further details.
Figure 6 underscores that from a median error
perspective, intuitive ensemble combination approaches (i.e., EWE and RWE,
shown in dark green) can be effective for reducing forecast errors once the
watershed begins to provide useful predictability (i.e., after 1 January).
For instance, EWE was the best-performing method in Hungry Horse and
Prineville for forecasts initialized on 1 March and 1 April. Further,
Figure 6 illustrates that the best (or worst)
techniques when looking at RMSE vary with each basin, although it is clear
that TWS and climate-only methods perform poorly at early and late
initializations, respectively. The joint inspection of Figs. 5 and 6 shows
that inter-method agreement in correlation does not necessarily translate
into similar forecast median errors. For example, while ESP and HESP provide
close
Another interesting result is that no substantial reductions in RMSE were
achieved at Howard Hanson between 1 October and 1 April, in contrast to the
gradual growth of hydrologic predictability to support forecast skill in
other basins. Indeed, the best-performing techniques for 1 October
(Stat-Ind) and 1 April (BC-ESP) forecasts provide similar RMSE values
(
Figure 7 (forecast median bias) shows that raw ESP
outputs have the largest biases through most initializations at Howard
Hanson, Libby and Prineville. In particular, absolute biases at Prineville
– which is the worst simulated basin in the group – increase to 53 % on
1 October before decreasing to 20 % on 1 April. Further, relatively large
biases (in comparison to the rest of techniques) were obtained at late
initializations in Dworshak and Hungry Horse. Excepting Prineville,
inter-method differences were not substantial, and none of the methods
exceeded a 16 % bias at any initialization. The simple bias correction
applied in this study was able to reduce absolute biases to less than
The same as Fig. 5 but for percent bias (% bias) in forecast ensemble medians versus observations. See text for further details.
Figure 8 displays continuous ranked probability
skill scores computed with mean climatology as a reference (CRPSS
Continuous ranked probability skill score of the
forecast ensembles with respect to mean observed climatology
(CRPSS
The results from Fig. 8 corroborate several findings alluded to in Sect. 5.1. Climate predictors applied to low-skilled (BC-)ESP forecasts in a TWS framework are less effective than when applied in a separate statistical method. Additionally, less complex multi-model schemes can perform better than more complex approaches (e.g., BMA), supporting previous findings by Najafi and Moradkhani (2015). Among the three hybrid regression methods (Fig. 3), Stat-CFSR-IHC was in most cases the worst performer. This result may be determined by the relative strength of standard (in particular ENSO) indices for the PNW region. When used in combination with other stronger predictors, the parameter estimation cost of the CFSR-PLSR relative to an off-the-shelf index may be more exposed (leading to greater shrinkage of skill after cross validation). The skill results in this study are subject to large uncertainties due to limited sample size (i.e., only 35 years for forecast generation and verification).
Time series with cross-validated hindcasts initialized
on 1 December, obtained with two watershed-based methods (BC-ESP and
Stat-IHC) and two climate-based techniques (Stat-Ind and Stat-CFSR) for the
five case study locations
Overall, the results presented in Figs. 5 and 8 suggest a division of the
study basins into two groups showing different relative predictabilities –
i.e., driven by watershed conditions versus climate – from October to
January. The first group is formed by Dworshak, Howard Hanson, and Hungry
Horse, where the state of the climate is the dominant source of
predictability from 1 October to 1 December, and IHCs start providing useful
information on 1 January. The second group is formed by Libby and Prineville,
where little or no skill can be found from any source until 1 December, when some
predictability can be harnessed from IHCs. This is illustrated in Fig. 9, where time series with cross-validated
seasonal streamflow forecasts – initialized on 1 December during the period 1981–2015
– are shown for two IHC-based methods (BC-ESP and Stat-IHC), and two
climate-based statistical methods (i.e., Stat-Ind and Stat-CFSR). At such
initialization, there is not enough information in the watershed (IHCs) to
predict interannual variations in April–July streamflow at Dworshak
(Fig. 9a) or Howard Hanson
(Fig. 9b); nevertheless, climate predictors are
more successful, a result that is also reflected in positive correlation
results (Fig. 5) and skill scores (e.g.,
CRPSS
Forecast reliability can be critical to support risk-based decision making
in which actions may be tied to the forecast distribution rather than the
median. The reliability index
The
In general, forecasts involving statistical calibration (which helps to
improve spread and bias) are most reliable. Indeed, regression-based
forecasting methods (e.g., Stat-IHC, Stat-Ind, Stat-Ind-IHC) stand out in
all basins, suggesting that the ensemble generation approach used in this
paper (based on the standard error of the cross-validated hindcasts) is
capable of providing statistically consistent ensembles. Multi-model
techniques appear to inherit this characteristic, with only small
discrepancies apparent between them (green lines in Fig. 10). Similar inter-method differences across
multiple initializations were found when looking at the
Time series with cross-validated hindcasts obtained
with the HESP approach,
initialized on (left) 1 October, (center) 1 January, and (right) 1 April.
Results are displayed for the five case study locations:
Although HESP was only found to be the “most reliable” method in a limited
number of cases (e.g.,
Summary statistics provide an overview of forecast performance, but additional insights can be gained from exploring extreme years in the record – in which forecasts can have disproportionate value to help water managers negotiate atypical challenges – and from visualizing the behavior of the forecasting methods as individual seasons progress. We therefore performed a retrospective comparison of all techniques for two regionally wet (1997 and 2011) and dry (1987 and 2001) water years at Hungry Horse (Fig. 12), one of the most teleconnected basins in our study domain. Figure 12 illustrates how SWE and SM, the primary sources of predictability for IHC-based methods, progressively gain influence on ensemble forecasts (e.g., HESP and TWS outputs) as the beginning of the snowmelt season approaches (i.e., 1 April). These single-year forecast evolution plots highlight the contrast for late season (i.e., 1 February onwards) between overconfident predictions exhibiting poor reliability (e.g., ESP, BC-ESP, TWS), and underconfident forecasts (e.g., EWE and RWE).
April–July water supply forecasts obtained at the
Hungry Horse reservoir (HHWM8) with different methods for two wet years –
Figure 12a, b show that climate information is required to reduce forecast errors in wet years at very long lead times (i.e., 1 October and 1 November), either alone or combined with watershed information through hybrid approaches. For example, the technique that provided the smallest forecast median error on 1 October 1997 was TWS. For shorter lead times (i.e., forecasts initialized on 1 March or 1 April) and WY 1997, the incorporation of IHCs helps to provide a better match with observations compared to methods that only use climate information. Interestingly, reanalysis fields at Hungry Horse provide considerable predictive power for WY 2011 (Fig. 12b) at short lead times (e.g., Stat-CFSR provides a forecast median error of 2.7 % on 1 March).
In the two dry years, Fig. 12c illustrates that climate predictors alone had considerable predictive power at long lead times (i.e., 1 October and 1 November) in WY 1987. However, this was not the case for WY 2001 (Fig. 12d), when the method providing smallest forecast median volume errors at all initialization times (i.e., either BC-ESP or TWS) always required knowledge on watershed moisture conditions. This was also the case for other pilot study basins (not shown).
The above results suggest that despite the value of large-scale climate information for this study domain, enhanced hydrologic predictability is critical for accurate streamflow volumes in snowmelt-dominated regions under extreme climatic conditions, especially during dry years. Past and ongoing efforts that aimed to improve basin-scale meteorological forcing datasets, pursue realistic process representations in hydrologic models, advance parameter calibration, and improve DA techniques for better IHC estimates have built a robust platform to accelerate progress in this area. However, a long-term retrospective implementation (that is consistent with the real-time deployment) of these various modeling decisions and sources of information is critical to understand their performance and benchmark methodological choices.
Generating accurate water supply forecasts is an ongoing challenge for
improving water resources operations and planning. Despite substantial work
on seasonal streamflow forecasting methods applied worldwide, the marginal
value of increased complexity and combining different sources of information
via different strategies has not been systematically assessed. In this
paper, we compare a range of techniques that leverage predictability from
watershed hydrologic conditions and/or large-scale climate information. The
forecast intercomparison showed that hybrid techniques that leverage
hindcasts to combine both sources of predictability could lead to improved
skill compared to current operational approaches. Additional key findings
that may be relevant beyond the study domain – due to the inclusion of both
teleconnected and non-teleconnected basins – are as follows:
In basins showing strong teleconnections between large-scale climate and
local meteorology, the use of large-scale climate information can be an
effective strategy for improving seasonal streamflow predictability,
potentially providing skillful forecasts at times when watershed
predictability is limited. Standard climate indices provide useful information, and custom climate
predictors from reanalyses were also an effective complementary strategy for
extracting the signal from climate fields (e.g., SST and geopotential
height). The relative importance of watershed IHC versus climate information to
predict streamflow was found to vary even within a small region, depending
on subdomain catchment hydroclimatological characteristics. The ESP trace weighting method only provided promising results at forecast
lead times where ESP raw forecasts contained moderate skill, indicating that
climate information cannot adequately shift the prior ESP forecast if it
lacks forecast resolution or contains significant bias. Increasing methodological complexity does not necessarily translate into
better ensemble forecast quality (e.g., Stat-IHC versus BC-ESP; EWE versus
BMA), in part because the small sample sizes associated with seasonal
hindcasts preclude reliable parameter estimation for more elaborate methods.
There can be a trade-off between improving one forecast characteristic
(e.g., bias) and degrading another (e.g., correlation skill). Cross validation is an essential part of seasonal forecast development and
implementation, particularly where multiple predictions may be combined
based on their purported relative strengths and predictive uncertainty must
be accurately estimated. In the small-sample context of seasonal streamflow
prediction, cross validation reveals significant limitations in the
supportable complexity of statistical forecasting elements.
The often equivocal comparison of methods through multiple verification
metrics (e.g., correlation, reliability) for individual wet and dry years,
and for different basins, starkly illustrated the challenge of selecting a
single method that will provide optimal results for all forecast
initialization dates. There is a significant tension between optimizing
forecast qualities through a mixture of methods and data sources that vary
seasonally and across basins, and an oft-stated preference from forecasters
and users for a consistent forecasting methodology. With this in mind, we
developed HESP as a flexible data-driven framework to harness skill across
varying predictability regimes, although it admittedly departs from the
constraint of predictor uniformity.
A notable omission from this intercomparison study is the derivation of climate predictors from global climate model forecasts, a strategy that has also been pursued in this context (e.g., see Crochemore et al. 2016). The experiment summarized here did assess the skill of CFSv2 9-month climate forecasts at an earlier stage, but such evaluation has been excluded from this paper because the results did not show significantly higher skill from the CFSv2 forecasts than the CFSR-based empirical predictions, as is consistent with prior skill assessments (e.g., Yuan et al., 2011). Nonetheless, the topic of augmenting hydrologic predictability from dynamical climate forecasts remains an appealing area for future study and comparison, as does the potential for including IHC data assimilation to enhance watershed model-based predictability (e.g., DeChant and Moradkhani, 2011; Huang et al., 2017). Future work can also explore alternative methodological choices such as multiple hydrological models, different climate datasets, or smaller details such as alternative variable transformations in statistical approaches (e.g., Wang et al., 2012).
Finally, this work is part of a larger project that explores the potential of an automated (i.e., “over-the-loop”) forecasting workflow as a viable strategy for operational streamflow prediction that can open the door to potential scientific and technical advances in streamflow forecasting (Pagano et al., 2016). In this context, a critical lesson is that the entire study, in particular the assessment of approach alternatives, depends on the automation of the forecast workflow to enable the generation of hindcasts that are consistent with real-time forecasts. Demonstrating that such over-the-loop methods – all of which were implemented in real time by the authors during the study period (2015–2017) – can yield credible predictions should be regarded as a strong argument for exploring this objective paradigm in real-world operational agency settings.
Daily streamflow data used in this study can be obtained from
the Bonneville Power Administration (
The trace weighting scheme used here involves the following steps
(Werner et al., 2004):
Compute a vector Sort the vector Compute weights using the following equation: where Normalize weights and construct a cumulative distribution function (CDF)
based on these values and the ESP hindcast. Resample from the CDF obtained in step 4 using 500 uniform random numbers.
The principle of BMA (Raftery et al., 2005) is that given an
ensemble forecast with
In this paper, the weights for the two models (best climate-based and best
watershed-based models) are estimated by maximum likelihood, assuming that the
conditional PDFs of log(Q) are approximated by a normal distribution. The
likelihood is maximized using the expectation-maximization (EM) algorithm
(Dempster et al., 1977) which is implemented in the R package
ensembleBMA
(
The quantile model averaging (QMA) forecast values are obtained from the weighted average of forecast quantiles from all models. Schepen and Wang (2015) recently found that nearly identical skill results can be obtained with BMA and QMA, and that very similar performance can be achieved either by calibrating QMA weights or by using BMA weights within a QMA framework. Therefore, we obtain the QMA forecast using the same weights obtained from the BMA calibration by sorting the ensemble members from the best climate and best watershed forecast approaches, and computing the weighted average of equally ranked ensemble members from the two sources.
The authors declare that they have no conflict of interest.
This article is part of the special issue “Sub-seasonal to seasonal hydrological forecasting”. It is a result of the HEPEX workshop on seasonal hydrological forecasting, in Norrköping, Sweden, on 21–23 September 2015.
This work was supported through a contract with the US Army Corps of Engineers and through a cooperative agreement with the US Bureau of Reclamation. Edited by: Ilias Pechlivanidis Reviewed by: Thomas Pagano and Bastian Klein