Evaluation of snow data assimilation using the ensemble Kalman filter for seasonal streamflow prediction in the western United States

In this study, we examine the potential of snow water equivalent data assimilation (DA) using the ensemble Kalman filter (EnKF) to improve seasonal streamflow predictions. There are several goals of this study. First, we aim to examine some empirical aspects of the EnKF, namely the observational uncertainty estimates and the observation transformation operator. Second, we use a newly created ensemble forcing dataset to develop ensemble model states that provide an estimate of model state uncertainty. Third, we examine the impact of varying the observation and model state uncertainty on forecast skill. We use basins from the Pacific Northwest, Rocky Mountains, and California in the western United States with the coupled Snow-17 and Sacramento Soil Moisture Accounting (SAC-SMA) models. We find that most EnKF implementation variations result in improved streamflow prediction, but the methodological choices in the examined components impact predictive performance in a non-uniform way across the basins. Finally, basins with relatively higher calibrated model performance (> 0.80 NSE) without DA generally have lesser improvement with DA, while basins with poorer historical model performance show greater improvements.

Abstract. In this study, we examine the potential of snow water equivalent data assimilation (DA) using the ensemble Kalman filter (EnKF) to improve seasonal streamflow predictions. There are several goals of this study. First, we aim to examine some empirical aspects of the EnKF, namely the observational uncertainty estimates and the observation transformation operator. Second, we use a newly created ensemble forcing dataset to develop ensemble model states that provide an estimate of model state uncertainty. Third, we examine the impact of varying the observation and model state uncertainty on forecast skill. We use basins from the Pacific Northwest, Rocky Mountains, and California in the western United States with the coupled Snow-17 and Sacramento Soil Moisture Accounting (SAC-SMA) models. We find that most EnKF implementation variations result in improved streamflow prediction, but the methodological choices in the examined components impact predictive performance in a non-uniform way across the basins. Finally, basins with relatively higher calibrated model performance (> 0.80 NSE) without DA generally have lesser improvement with DA, while basins with poorer historical model performance show greater improvements.

Introduction
In the snow-dominated watersheds of the western US, spring snowmelt is a major source of runoff (Barnett et al., 2005;Clark and Hay, 2004;Singh and Kumar, 1997;Slater and Clark, 2006). In such basins, the initial conditions of the basin, primarily in the form of snow water equivalent (SWE), drive predictability out to seasonal timescales (Wood et al., 2005;Wood and Lettenmaier, 2008;Mahanama et al. 2012;Staudinger and Seibert, 2014;Wood et al., 2016). Thus, better estimates of basin mean initial SWE should lead to better seasonal streamflow predictions (Arheimer et al., 2011;Clark and Hay, 2004;Slater and Clark, 2006;Wood et al., 2016). For various reasons (e.g., the uncertainty in model parameters, forcing data, model structures), simulated SWE in hydrological models can be very different from reality (Pan et al., 2003). Fortunately, a variety of snow observations (including point gauge and spatial satellite data) contain valuable information (Andreadis and Lettenmaier, 2006;Barrett, 2003;Engeset et al., 2003;Mitchell et al., 2004;Su et al., 2010;Sun et al., 2004).
Many studies have explored the role of snow data assimilation in different modeling frameworks (Moradkhani, 2008;Takala et al., 2011;McGuire et al., 2006;Wood and Lettenmaier, 2006). Of particular focus here are papers that have examined the impact of SWE data assimilation (DA) on runoff modeling and prediction (e.g., Bergeron et al., 2016;Griessinger et al., 2016;Wood and Lettenmaier, 2006;Franz et al., 2014;Jörg-Hess et al., 2015;Moradkhani, 2008;Slater and Clark, 2006). Among the major challenges facing SWE-based DA are that the time-space resolution of remote sensing SWE data are too coarse or period limited for many watershed-scale hydrological applications in mountainous regions (Dietz et al., 2012;Jörg-Hess et al., 2015), and point gauge snow data have sparse and uneven spatial coverage (Slater and Clark, 2006). For point measurements, spatial interpolation of SWE measurements is typically used Published by Copernicus Publications on behalf of the European Geosciences Union.
Here, we use the ensemble Kalman filter (EnKF) method for DA using an implementation that allows for seasonally varying estimates of observation and model error variances (Evensen, 1994(Evensen, , 2003Evensen et al., 2007). The EnKF framework has been successfully implemented in research basins in several previous studies (Clark et al., 2008;Franz et al., 2014;Moradkhani et al., 2005;Slater and Clark, 2006;Vrugt et al., 2006). The EnKF provides an objective analytical framework to optimize the update of model states based on observed values and their corresponding uncertainties. While the EnKF approach has a formal theory, its overall objectivity in an application (contrasting with an arbitrary DA approach such as direct insertion) nonetheless depends on several methodological choices that are often empirical when applied to SWE DA.
Following Slater and Clark (2006), this study uses two slightly different approaches to estimate ensemble SWE observations with point gauge SWE data from surrounding gauge sites for study basins. When using calibrated hydrologic modeling systems, model SWE states may exhibit systematic biases from observed SWE estimates for a number of reasons -e.g., all hydrologic models must simplify real watershed physics and structure, and model parameter estimation (calibration) may result in SWE behavior that in part compensates for forcing or model errors (e.g., Slater and Clark, 2006). Therefore, transformation of snow observations to model space is needed before they are used to update the model states to ensure that the model ingests SWE estimates that are as close to unbiased, relative to the model climatology, as possible. We explore two variations on an approach using cumulative density function (CDF) transformations of observations to model space (following Wood and Lettenmaier, 2006, among others). Additionally, we undertake a sensitivity analysis to highlight the importance of robust observations and model uncertainty estimates. We focus on the impacts of updates made just once per snow accumulation season, noting that an important choice that is not examined as a result is the selection of DA dates and frequency. For a given generally optimal selection of the EnKF approach, the ensemble streamflow prediction (ESP) approach is used to test the impact of SWE DA on subsequent streamflow forecasts.
For context, operational seasonal streamflow forecasts in the US currently do not use formalized DA. If the initial states of the model are suspected to contain error (He et al., 2012), DA is performed through subjective forecaster intervention. Manual adjustments (termed MODs; e.g., Anderson, 2002) to model states (e.g., SWE) are applied repeatedly throughout the water year, and particularly before initializing seasonal forecasts. This manual nature of the correction hinders the ability to scale up DA procedures to many basins, to benchmark DA performance, and quantify improvements to the forecast system, as skill depends on the forecaster's experience (Seo et al., 2003).
The central motivating aim of this study is thus to assess the potential benefits of objective, automated SWE DA against a reference model configuration to identify forecast improvement opportunities. We apply the EnKF DA approach to nine river basins in the western US that have a range of basin features and environmental conditions, over a period of multiple decades. This experimental scope differs from many previous studies that focus on one or two basins (e.g., Clark et al., 2008;Franz et al., 2014;He et al., 2012;Moradkhani et al., 2005) or assess DA performance over shorter periods. We also use ensemble simulations driven by a new probabilistic forcing dataset (Newman et al., 2015b, c) as a basis for estimating model SWE uncertainty, in contrast to prior studies that relied on more arbitrary distributional assumptions. This range of basins permits us to explore the question of "In what types of basins might automated SWE DA improve seasonal streamflow forecasts?".
Additionally, as discussed throughout the introduction, the EnKF approach has several empirical components that require tuning. We therefore examine performance sensitivities related to three elements: (1) the estimation of watershed mean SWE from surrounding point measurements; (2) the transformation operator that relates watershed mean SWE to The following sections discuss the study basins and datasets, and the model and EnKF DA approach, before the presenting study results and discussion, and a summary.

Study basins and data
In this study, nine basins across the western US are selected for SWE DA evaluation. They are in the Pacific Northwest, California (Sierra Nevada Mountains), and central Rocky Mountains. We focus on these three areas as they span a range of snow accumulation and melt conditions of the western US and are in areas with active seasonal streamflow prediction and water resource management. We do not examine rain-driven low-lying basins because they do not have significant SWE contributions to runoff. The locations of the basins and nearby SWE gauge sites are shown in Fig. 1, illustrating that all of the study watersheds have SWE measurements distributed in and/or around the basins. The main features of these basins are shown in Table 1. The basin areas range from 16 to 1163 km 2 , and the mean elevations of the basins range from 998 to 3459 m with a large spread in basin mean slopes (as estimated from a fine-resolution digital elevation model) and forest percentage.
Two sources of SWE observations are used in this study: (1) the widely used snow telemetry network (SNOTEL) for Natural Resources Conservation Service (NRCS), which covers most of the western US (NRCS, https://www.wcc. nrcs.usda.gov/snow/); and (2) the California Department of Water Resources (DWR, denoted as CADWR sites hereafter), which maintains a snow pillow network for California (CADWR, http://cdec.water.ca.gov/snow/). The SWE data from CADWR sites have frequent missing data and some unrealistic extreme values; thus, extensive manual quality control was required before using the CADWR data in the study.

Models and calibration
The Snow-17 temperature index snow model is coupled to the Sacramento Soil Moisture Accounting (SAC-SMA) conceptual hydrologic model (Anderson, 2002(Anderson, , 1973Burnash and Singh, 1995;Burnash et al., 1973;Franz et al., 2014;Newman et al., 2015a) to simulate streamflow in this study. This model combination has been in operational use by US National Weather Service (NWS) River Forecast Centers (RFCs) since the 1970s (Anderson, 1972(Anderson, , 1973. The Snow-17 model is a conceptual snow pack model that employs an air temperature index to partition precipitation into rain and snow and parameterize energy exchange and snowpack evolution processes. The only required forcing inputs are nearsurface air temperature and precipitation. The output rainplus-snowmelt (RAIM) time series from Snow-17 is part of the forcing input of the SAC-SMA model. SAC-SMA is a conceptual hydrologic model that uses five moisture zones to describe the movement of water through watersheds. The required forcing input is the potential evaporation and the surface water input from Snow-17.
Daily streamflow data from United States Geological Survey (USGS) National Water Information System server (http: //waterdata.usgs.gov/usa/nwis/sw) are used to calibrate 20 parameters of Snow-17 and SAC-SMA model. The calibration is obtained using the shuffled complex evolution global search algorithm (SCE; Duan et al., 1992) via minimizing daily simulation root mean square error (RMSE). USGS streamflow data are also used to verify the model predictions.
Model uncertainty arises from model parameter and structural uncertainty (e.g., Clark et al., 2008) and forcing input uncertainty (e.g., Carpenter and Georgakakos, 2004). Focusing on the latter, we drive the hydrology models with 100 equally likely members of the meteorological data ensemble generated as described in Newman et al. (2015b, c), producing an 100-member ensemble of model moisture states, including SWE, and streamflow. The daily varying spread of the ensemble model states serves as the estimate of model uncertainty. Because this method estimates SWE uncertainty without also considering sources other than forcing input uncertainty, and therefore may underestimate model uncertainty in initial SWE (e.g., Franz et al., 2014), we also include a sensitivity analysis to explore the sensitivity of DA results to variations in the estimated observation and model uncertainty magnitudes.

Generating ensembles of estimated observed watershed SWE
Since the SWE gauge observations are point measurements that do not represent the watershed mean conditions and have observation error, observation uncertainty needs to be robustly estimated to ensure reasonable DA performance. In this study, we follow Slater and Clark (2006) to generate ensemble estimated catchment SWE from gauge observations using a multiple linear regression in which the predictors are the attributes of SWE gauge sites (longitude, latitude, and elevation). The observation uncertainty is estimated by leaveone-out (LOO) cross validation; i.e., each station is left out of the regression training and then its SWE is predicted and verified against its actual measurement. For reducing interpolation uncertainty caused by spatial heterogeneity of SWE gauge sites, the SWE values are transformed into percentiles or Z scores (e.g., standard normal deviates) before the regression is performed, and the corresponding inverse transformations are used to convert them back to SWE values. These two approaches are denoted as percentile and Z score interpolation, respectively, and detailed descriptions for them are as follows. . Evaluation metrics for April-July ensemble mean streamflow from the percentile-based interpolation method for the nine case basins using perfect forcing. The verification metrics from upper left to lower right are as follows: R-RMSE is the relative (normalized) root mean squared error, R is the linear (Pearson) correlation coefficient, NSE is the Nash-Sutcliffe efficiency, bias is the same as mean error, and CRPS is the continuous ranked probability skill score.

Percentile interpolation
First, the non-exceedance percentile p o y (k) of each SWE observation (observation-based values noted with superscript o) at gauge site k on the DA date in year y is calculated based on its rank, or percentile, within a sample of all SWE observations in all years at the same site within a time window of ±n days centered on the date of the observation in each year.
Then, we use the percentiles to do linear regression on geographic features latitude, longitude, and elevation to estimate the SWE percentile for the target basin:p o y , where the hat indicates the basin mean estimate. By LOO cross validation, the interpolation error of the linear regression is estimated aŝ e o y . We sample from normal distribution N (p o y ,ê o y ) to get the ensemble percentiles {p o y (j )}, where j = 1, . . . , 100 represents the ensemble member.
Finally, we take the correspondingp o y (j ) percentile from the full ensemble model SWE within the time window of ±n days centered on the DA date each year in all years, denoted asŜ f y (j ). The final ensemble SWE observations on the DA date in year y for the target basin are {Ŝ f y (j )}, where j = 1, . . . , 100.

Z score interpolation
First, we use the observed SWE at gauge site k on the DA date in year y to calculate the Z score: where S o (k) and σ (S o (k)) are the long-term mean and standard deviation of a sample of all non-zero SWE observations at the same site within a time window of ±n days centered on the date of the observation, respectively. Here, we use the Z score in the linear regression and again use LOO cross validation to estimate the mean and interpolation error of the Z score for a target basin. Then, we sample from the normal distribution to get ensemble Z scores for the target basin, denoted as {Ẑ score o y (j )}, where j = 1, . . . , 100 represents the ensemble member. Finally, we use the following equation to transform Z score back to SWE values: where S f (k) and σ S f (k) are the long-term non-zero mean and standard deviation of the full ensemble model SWE within the time window of ±n days centered on the DA date each year in all years, respectively. The final ensemble SWE . Evaluation metrics for April-July ensemble mean streamflow from the Z score interpolation for the nine case basins using perfect forcing. The verification metrics from upper left to lower right are as follows: R-RMSE is the relative (normalized) root mean squared error, R is the linear (Pearson) correlation coefficient, NSE is the Nash-Sutcliffe efficiency, bias is the same as mean error, and CRPS is the continuous ranked probability skill scores.
observations on the DA date in year y for the target basin are {Ŝ o y (j )}, where j = 1, . . . , 100. Both percentile and Z score transformations normalize the original SWE values to decrease their spatial variability (Slater and Clark, 2006;Wood and Lettenmaier, 2006). The latter ensures the ensemble observations have the same mean as the ensemble model SWE and the variance of ensemble observations is proportional to ensemble model SWE variance. The former emphasizes the shape of the observation time series. SWE observations in and near a watershed but at different elevations may have greatly varying values, but their percentile and Z score statistics will show reduced variation because they arise from similar relative weather conditions with respect to conditions in other years. Using normalized statistics significantly reduces the interpolation uncertainty and systematic biases relative to the watershed's SWE climatology.

EnKF approach and experimental design
For evaluating the relative performance of DA and for reinitializing the soil moisture of DA runs at the beginning of each water year (WY), an open-loop or "control" retrospective simulation (denoted No DA) is performed using the calibrated model parameters with ensemble forcing data. This control run is one continuous simulation per ensemble mem-ber for the entire hindcasting and evaluation period (1981-201X) for each basin, where "201X" is the last simulation year available between 2010 and 2014. Because this study focuses on assessing variations in methodological aspects of the DA approach rather than differences in performance throughout a forecasting season, we apply DA updates only once per year, using the date on which the SWE correlation with future runoff is highest for the study basin, but no later than 1 April, a common date for initiation of spring seasonal runoff forecasts.
The EnKF method used in this study is a time-discrete forecast and linear observation system described by two relationships (generally following the notation of Ide et al., 1997 andWu et al., 2012): where i is the time step; M is the coupled Snow-17 and SAC-SMA model; x is the state variable and y is the observation variable (in this study both x and y are the one-dimensional vectors containing basin mean SWE for the target watershed across all ensemble members); the superscripts t and o stand for truth and observed, respectively; η and ε are the model and observation errors, respectively; and h is the observation operator that maps the model states to the observation variable. In this study, h is simply the identity vector, as we re- gard the SWE estimates that have been transformed to model space as observation y as a preprocessing step. The SWE DA approach is implemented via the following procedure: 1. The watershed model is run once for each ensemble forcing member from the beginning of a WY until the DA date with initial states x 0 taken from the retrospective control runs, producing the ensemble forecast states x f i . The superscript f denotes forecast.
2. The ensemble analysis states are calculated as follows: where superscript a means analysis, o and s are the observed and model simulation error variances (estimated by the variance of ensemble observations and model states, respectively), and the innovation vector (residual) is calculated as 3. The Snow-17 SWE states are updated with the analysis states to use for initialization of forecasts through the end of the WY.
Steps 1-3 are repeated for all WYs available in the hindcast period (1981-201X). Soil states are re-initialized using the states from the retrospective (No DA) run at the start of every WY (October 1), when there is no SWE. To summarize, we calculate an analysis via Eq. (5) and use that analysis to update the Snow-17 SWE states. We then run the model with the updated states until the end of the WY.

Model and observation error variance
In this study, only the uncertainty of the forcing data is taken into account in our model uncertainty, and uncertainty that arises from model structural and parameter errors could cause the true model error to be larger. Thus, we assess the impacts of inflating model error variance to evaluate the relative size of observed and forecast error variance. We simply set the model SWE error variance to half of and 2 times the original size to see how the DA performances change. If increasing the model error variance results in DA performance improvements, it would indicate that the model error variance is underestimated, and vice versa. This sensitivity analysis underscores the importance of a careful effort to properly estimate both model and observational uncertainty when using the EnKF -a challenge that is well known in the DA community. Figure 6. Incremental change in evaluation statistics for ESP and perfect forcing forecasts using percentile-based interpolation for the nine case basins. R is the linear (Pearson) correlation coefficient, NSE is the Nash-Sutcliffe efficiency, and CRPS is the continuous ranked probability skill score.

Seasonal ensemble streamflow prediction
Although the impacts of the SWE DA on forecast accuracy can be assessed through verification of post-adjustment simulations using "perfect" future forcing, we demonstrate the performance of SWE DA by initializing seasonal ESP forecasts for a streamflow forecast product that is widely used in water management, the snowmelt period runoff volume from April through July. ESP uses historical climate data to represent the future climate conditions each year from the starting point of the forecast period to predict streamflow. Two typical ESP applications are tested in this study. Because we have an ensemble of historical forcing instead of the traditional application in which only a single historical forcing time series is available, there are different ways to construct an ESP. We adopt two: (1) we construct the ESP forcing en- semble by randomly selecting 1 year of the historical ensemble forcing data for each historical member of the ESP; and (2) we use all historical years of ensemble mean forcing data for each ESP historical year member, yielding a 30 × 100member ensemble for an ESP based on meteorology from 1981 to 2010 (variations are noted as ens forcing and ens mean forcing, respectively, in subsequent figures discussing ESP results).

Verification metrics
In this study, five frequently used statistics are calculated for April through July seasonal streamflow volume expressed as runoff (mm) for evaluating the two DA approaches. The bias, correlation coefficient (R), relative root mean squared error (R-RMSE), and Nash-Sutcliffe efficiency (NSE) are based on the ensemble averages. The continuous ranked probability score (CRPS) is a measurement of error for probabilistic prediction (Murphy and Winkler, 1987). It is defined as the integrated squared difference between the CDF of forecasts and observations: where F f and F o are CDFs for forecasts and observations of streamflow, respectively. Small CRPS values mean more   Table 1) for the case basins is shown in Fig. 2. The distributions of SWE from the model ensemble and from the percentile and Z score interpolation methods differ in ways that are not consistent across all watersheds. The variance of the estimated observed SWE for both methods is generally largest for the 1-year window, an effect that is more pronounced for the Z score interpolation. However, we also note that the ensemble observations of 7-day window can have a larger variance than the 3-month window, and as large as the 1-year window in some cases. For more details, see the percentile interpolation for the Payette River for the 7-day window in Fig. 2 where the 7-day window interquartile range is about 250 mm and the 1-year window range is 300 mm, while the 3-month window is only about 120 mm. This is likely due to the more limited sample size for the regression, which can reduce the positive impact of DA performance. For example, the SF Payette River and the Greys River have positive DA impact for both the 7-day and 3-month windows but for the 7-day window the positive impact is reduced by roughly half in both basins for most metrics (Tables S1 and S3 in the Supplement). Increased estimated observation variance decreases the weight of the observations in an EnKF approach and thus decreases the impact of the observations. In this study, a 3-month window of SWE observations generally gives the best performance. However, in some basins, a different window length may bring larger improvements. Longer windows mean that the transformation is more statistically representative of the long-term model-observation climatology. Shorter time windows imply that the model SWE values used for transformation are more relevant to a specific seasonal time period, avoiding aliasing for seasonality, but have much smaller sample sizes and may not properly represent the relationship between model and observation climatologies. The window length must be a balance between these two considerations. Therefore, a 3-month window is recommended for both approaches. The evaluation statistics for simulated streamflow using perfect forcing after DA with ensemble SWE observations estimated by the percentile and Z score interpolation ap-proaches for the 3-month window are shown in Figs. 3 and 4. They are also compiled in Tables S1-6. In those tables, the second column shows the forecast error variance used to calculate analysis states, where "No DA" means the open-loop control run (see Sect. 3.3), and the P , 1/2P and 2P refer to the DA runs with the model error variance estimated by 1, one-half, and 2 times the original size of the ensemble model variance. Both percentile and Z score interpolation approaches exhibit enhanced DA performance among the case basins, indicating that both approaches are effective in adding observation-based information to the model simulations. Overall, using the original model variance estimate (case P ), the mean improvement for the percentile interpolation method (Z score method) is a reduction in relative RMSE (R-RMSE) of about 11 % (12 %) and an increase in NSE of 0.03 (0.05). The percentile interpolation and Z score interpolation methods vary in performance across the basins with both performing better in some basins and not others (e.g., percentile interpolation performs slightly better than Z score interpolation in the Greys River using NSE as the evaluation metric (0.94 versus 0.93) and slightly worse than that in the SF Tolt River (0.82 versus 0.88)). Using NSE, percentile interpolation performs better in the Greys River, while Z score interpolation performs better in the Vallecito, south fork of the Tolt, Merced, and Smith rivers. To the hundredth NSE value (0.01), both methods are equivalent in the south fork of the Payette River, and General and Blackwood creeks.
The results of forecast error variance inflation shows that for both percentile and Z score interpolation, 2P has better performance than P in most of the case basins -i.e., increasing the model error variance leads the assimilation to trust observations more and improves the DA performance (circles in both figures generally have improved evaluation metrics than squares or triangles). Using NSE, the percentile (Z score) interpolation 2P case is on average another 0.01 (0.01) better than the P case across the nine basins. This sensitivity analysis of model uncertainty impacts on DA performance suggests that either the forcing-alone-based estimation of model errors underestimates the total model error variance, or the observed SWE error estimation approaches (interpolation plus the SWE regression) tend to overestimate observation uncertainty, or both. It is likely we are underestimating model uncertainty because we have not taken model structural and parameter uncertainty into consideration. Both approaches bring incremental enhancements to the ensemble mean streamflow hindcast in most basins when evaluated across the R-RMSE, R, and NSE metrics; however, DA does not help correct forecast biases in these simulations. Postprocessing procedures (e.g., bias correction) could be used to further enhance the forecast performance, but this is not a focus of this study. These figures also show that forecasts without DA ("No DA" in figures, "NoDA" in text) that have relatively better performance, mostly due to better simulations of forecast initial conditions, benefit less from DA. Three of the basins have a NoDA seasonal runoff NSE of less than 0.8, with an average improvement of 0.05 for the percentile regression and 0.12 for the Z score regression versus 0.03 and 0.05 across all nine basins. Four basins have seasonal runoff NSE values of at least 0.89 and the two DA methods result in minimal improvement, 0.02 for both methods. With a sample size of nine, little statistical significance can be attached to these results, but they do suggest DA is more beneficial in poorly calibrated basins. Future work will examine the potential for DA based on NoDA (open-loop) model performances and the characteristics of nearby observed SWE data. Figure 5 summarizes the ESP evaluation statistics. For simplicity, only the percentile interpolation approach with a 3-month window is shown without forecast error inflation. It shows that for both ESP forcing methodologies used (Sect. 3.5) in all the case study watersheds, SWE DA enhances seasonal runoff prediction skill, including the probabilistic prediction metric CRPS. Again, higher skill NoDA watersheds saw smaller DA improvements. The DA evaluation metric improvement increment versus the corresponding NoDA evaluation metric score for the case basins is shown in Fig. 6. The DA improvements in all evaluation metrics have a generally weak negative correlation with NoDA perfor-mance, which again highlights that better-simulated basins benefit less from SWE DA.

Broader DA potential
In general, the incremental DA improvements are relatively smaller where the NoDA model performance is relatively better. However, specific basin performance is dependent on many factors, including (1) representativeness of nearby observations to basin conditions; (2) quality of observations; and (3) specific basin characteristics of the calibrated hydrologic model. Because we use calibrated, watershed-scale hydrologic models, transferability of performance characteristics of the DA approach without implementation in each basin is limited. That being said, Fig. 7 displays the difference between the rank correlation of SWE and runoff for the calibrated model (NoDA) and highest correlated observation site (from the nearest 10 sites). It highlights the same general spatial patterns seen in the nine basins simulated here. The potential for larger DA improvement appears to be in the Pacific Northwest (upper left of figure). Basins in the Dakotas (upper right basins) are far from SNOTEL sites and have little areal SWE; basins along the far southern US have little SWE and runoff as well. Throughout the central Rockies (central basins), model-observation correlation differences are small, potentially indicating reduced DA improvement potential, in agreement with the results seen above.

Case study analyses
To provide a more in-depth examination of the SWE DA impacts to the watershed model states and fluxes, time series of runoff and SWE are shown in Figs. 8, 9, and 10 for three example basins, one for each region (the same figures for the other six basins are included in the Supplement), and for one hindcast year. The feedback from the change of SWE on the DA date to seasonal runoff is readily apparent. Increasing the ensemble model SWE through DA will lead to increased model runoff, and vice versa. For basins with a strong seasonal cycle of streamflow (e.g., Greys and Merced rivers), SWE DA may improve daily runoff forecasts in years when seasonal volume forecast improvements are seen, although this is not true in every watershed (e.g., Tolt River). For example, the daily NSE for the Greys River in 1997 after DA was improved from 0.53 to 0.80 in the perfect forcing example, and this is via bias reduction, as the daily flow timing is essentially unchanged. In Fig. 9, the NSE of the daily flow prediction of the Tolt River is essentially unchanged (0.54 for DA, 0.53 for NoDA) even though the seasonal volume prediction is improved (1990mm observed, 1968mm DA, 1534. In this case, improvements to bias did not improve NSE as the bias improvements did not improve the squared daily flow differences (e.g., RMSE: 7.76 versus 7.88 for DA versus NoDA). Figures 11,12,and 13 show several scatterplots of forecast period runoff for the ESP ensemble forcing and perfect forcing forecasts versus observed runoff, in the three case basins for all of the hindcast years. The left two columns show the comparison for NoDA-and DA-simulated seasonal runoff versus observed runoff for perfect (top row) and ESP ensemble forcing (bottom row), respectively. The 1 : 1 lines are shown as grey dashed lines, and regression lines for the results are shown as green solid lines. The results after DA have higher correlation and are generally closer to the 1 : 1 line, which indicates that for both forcing types SWE DA improves seasonal runoff simulation and prediction skill. The rightmost columns in these three figures show the scatterplots of SWE increment (i.e., SWE analyses states minus model SWE without DA) versus runoff error (i.e., the simulated seasonal runoff without DA minus the observed seasonal runoff). If the runoff errors are positive (the seasonal Figure 13. Scatterplots for seasonal runoff and SWE on the DA date for the Merced River (same as Fig. 12). runoff is overestimated), we would expect the SWE increment to be negative in order to decrease the model seasonal runoff (counteract model error) and vice versa. Thus, the ideal results are that the points fall onto different sides of the y = 0 and x = 0 lines (shown as grey dashed lines in this panel); i.e., the points all fall into the second (upper left) and fourth (lower right) quadrants. This is generally the case for our case basins for both perfect and ESP forcing, which again shows that the SWE DA approach is successful in reducing model and forecast error.
For the three basins highlighted here, there are years where the DA SWE increment is not in the second or fourth quadrants. In these years, the increment decreases subsequent forecast skill. Overall, there are 11 of 28 (39 %), 4 of 24 (17 %), and 12 of 26 (46 %) years for the Greys, Tolt, and Merced rivers where this is the case using perfect forcing. These years generally correspond to small SWE increments relative to that year's SWE and runoff in all basins, except for 5 years in the Merced River, where the SWE increment is larger than 10 % of that year's streamflow production and is incorrect. In the Greys River, all incorrect increments are less than 10 % of the observed runoff for that year and also in years where the NoDA runoff error is less than 10 % of observed. A small increment implies that the estimated observed and model SWE are very similar, and thus, in years with small model error, the model SWE climatology closely matches observed climatology after transformation for this basin. Figure 14 highlights an example WY in the Merced River where the SWE increment and runoff error are both negative, indicating that DA increased the model forecast error.
The Merced River is the only basin to use California SWE observations, and these may be of lower quality as evidenced by the large amount of manual quality control we had to perform on the data and the discussion of these data in Lundquist et al. (2015). This suggests that observed SWE data need to be of higher quality (or information content) than the calibrated model SWE to have the positive impact in the DA approach. The calibrated Merced model has −19 % April-July runoff bias with 23 (88 %) of years having a negative runoff error. EnKF SWE increments are negative in 15 (58 %) and positive in 11 (42 %) of the years. This indicates that the observed SWE transformation to model space is largely unbiased, but the calibrated model bias impacts SWE DA performance. Calibration of the model specifically for seasonal flow to ensure minimal bias, or hydrologic parameter estimation within the EnKF approach (e.g., He et al., 2012), would likely improve hydrologic model performance and thus seasonal SWE DA forecasts in the Merced. Finally, examination of El Niño/La Niña signals (not shown) revealed no clear pattern with degradation of DA forecast skill.
Lastly, there are years where the NoDA runoff error is large, but the SWE increment is small in all three basins. This is not unexpected, as spring SWE is not perfectly correlated with subsequent runoff. This may also hint at a level of data loss in the EnKF approach, and future work should compare streamflow hindcasts using this type of DA approach with traditional statistical methods using SWE as a primary input. It also suggests that improved model calibration, or in combination with model parameter estimation in the EnKF approach (e.g., He et al., 2012), may improve DA performance across all basins, not just the Merced.

Summary and conclusions
This study tests variants of EnKF SWE DA approaches in nine case basins in the western US. These basins have seasonal runoff representative of basins used for water resource management across the western US and have at least six close SWE gauge sites with 20 or more years of observation history. Two approaches of constructing SWE ensemble observations, percentile and Z score interpolation, are examined in this study in an effort to reduce the spatial variability and decrease the interpolation uncertainty while also trans-forming the observations to model space (e.g., the range of the model climatology). A 3-month window of SWE observations generally gives the best performance for these two approaches in this study , Tables S1-6). However, in some basins a different window length may bring larger improvements. A suitable window length needs to include sufficient samples for transformation as well as including the most relevant samples (i.e., a specific seasonal time period). Sensitivity analyses of model uncertainty impacts on DA performance suggest that either the forcing-alone-based estimation of model errors underestimates the total model error variance, or the observed SWE error estimation approaches (interpolation plus the SWE regression) tend to overestimate observation uncertainty, or both (Figs. 3-4, Tables S1-6). Future work should examine this in more detail, as this work clearly indicates that uncertainty scaling approaches (for the model and/or the observations) are likely to be a valuable step for further DA improvements.
Encouragingly, the ESP-based assessment of automated SWE DA in the case study watersheds shows clearly the potential for SWE DA to enhance seasonal runoff forecasts, which is notable as the objective incorporation of observed SWE has been a long-standing challenge in operational forecasting. We show at least minor improvement in seasonal runoff forecasts in all nine basins . A notable finding is also that the benefits of SWE DA are linked to the quality of the model simulations of the basin, which can help to target the application of DA to locations where it will have the most benefit . For the basins with poor NoDA simulations (e.g., the SF Tolt River; Fig. 12), the SWE DA can potentially have greater model performance impacts. The Pacific Northwest and California were found to have the greatest potential for DA improvements to seasonal forecasting in this study (Fig. 7). This stems from weaker NoDA model performance; the NoDA model run will have more years with larger runoff errors. However, there are still individual years where DA may not improve the forecast. This likely stems from hydrologic model bias that leads to SWE state corrections enhancing rather than reducing runoff errors (e.g., Merced River; Figs. 13-14).
We chose a DA update frequency of once per year, on the date of climatological maximum correlation of modeled and observed runoff. In operational practice, updates would be applied more frequently, pointing to an area for future research. We note also that this study was conducted using conceptual lumped watershed models, similar to those used in operational practice in the US. As a result, this study does not shed light on how to address additional challenges that may be associated with using SWE DA in spatially distributed models, or with spatially continuous datasets (e.g., satellite and remote sensing SWE estimates) that are increasingly being developed or applied in streamflow forecasting contexts. SWE DA has been implemented in distributed models in prior experimental contexts across large domains (e.g., Wood and Lettenmaier, 2006), but a systematic examination of EnKF DA in spatially distributed hydrological models, coupled with a thoughtful accounting for model parameter and structural errors, remains a potentially fruitful area of research and development.

Data availability
All data used in this study are publicly available. The watershed shapefiles and basin information are described in Newman et al. (2015a) at doi:10.5065/D6MW2F4D (Newman et al., 2014). The forcing ensemble is described in Newman et al. (2015b) and is available at doi:10.5065/D6TH8JR2 (Newman et al., 2015c). The streamflow data are available through the USGS via http://waterdata.usgs.gov/usa/nwis/sw and in doi:10.5065/D6MW2F4D (Newman et al., 2014). The SNO-TEL observations are available at www.wcc.nrcs.usda.gov/ snow/ while the California SWE observations are available at http://cdec.water.ca.gov/snow.
The Supplement related to this article is available online at doi:10.5194/hess-21-635-2017-supplement.