A Retrospective Streamflow Ensemble Forecast for an Extreme Hydrologic Event: a Case Study of Hurricane Irene and on the Hudson River basin

This paper investigates the uncertainties in hourly streamflow ensemble forecasts for an extreme hydrological event using a hydrological model forced with short-range ensemble weather prediction models. A state-of-the art, automated, shortterm hydrologic prediction framework was implemented using GIS and a regional scale hydrological model (HEC-HMS). The 10 hydrologic framework was applied to the Hudson River Basin (~36,000 km2) in the United States using gridded precipitation data from the National Centers for Environmental Prediction (NCEP) North American Regional Reanalysis (NARR) and was validated against streamflow observations from the United States Geologic Survey (USGS). Finally, 21 precipitation ensemble members of the latest Global Ensemble Forecast System (GEFS/R) were forced into HEC-HMS to generate a retrospective streamflow ensemble forecast for an extreme hydrological event, Hurricane Irene. The work shows that ensemble stream 15 discharge forecasts provide improved predictions and useful information about associated uncertainties, thus improving the assessment of risks when compared with deterministic forecasts. The uncertainties in weather inputs may result in false warnings and missed river flooding events, reducing the potential to effectively mitigate flood damage. The findings demonstrate how errors in the ensemble median streamflow forecast and time of peak, as well as the ensemble spread (uncertainty) are reduced 48 hours pre-event utilizing the ensemble framework. The methodology and implications of this 20 work benefit efforts of short-term streamflow forecasts at regional scales, notably regarding the peak timing of an extreme hydrologic event when combined with a flood threshold exceedance diagram. Although the modelling framework was implemented on the Hudson River basin, it is flexible and applicable in other parts of the world where atmospheric reanalysis products and streamflow data are available. 25


Introduction
Riverine floods are known to adversely impact affected communities by causing casualties, inflicting damage to physical property, temporarily disrupting social and economic activities, and forcing a community to take emergency measures (IFRC, 2013).In the United States, for example, floods are recognized as the main natural disaster with USD 7.96 billion in flood-related damages per year and 82 fatalities per year, averaged over the past 30 years (NWS, 2014).It is reported that 78 % of emergencies are weather related (Weaver et al., 2014;Hoss and Fischbeck, 2016).
The increase in global averaged temperatures enhanced the potential for severe to extreme weather events (Becker and Grunewald, 2003;WMO, 2003).As the world warms, northern regions and mountainous areas are experiencing more precipitation falling as rain rather than snow, with a pronounced increase in precipitation being observed in the area of eastern North America (Karl, 2009).The special report on "Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation" of the Intergovernmental Panel on Climate Change (IPCC), critically assessed recent scientific literature on climate change and the impacts from extreme events.They reported that increased frequency and intensity of rainfall, based on climate mod-F.Saleh et al.: A retrospective streamflow ensemble forecast for an extreme hydrologic event els, substantially contributed in local flooding (Kundzewicz et al., 2014).Studies addressing flood damage in the United States show that the impact of floods to a community has increased over time as a result of both climate factors, such as precipitation, and societal factors, such as increased population and urban development (Pielke Jr., 2000;Changnon et al., 2001).Furthermore, studies report that associated monetary damages from flooding are also likely to go up in the 21st century and beyond (Milly et al., 2002;Allamano et al., 2009;Pall et al., 2011).The rise in the number of extreme weather events in recent years has spurred the need to better predict floods and mitigate flood damage.Such advanced warnings are not only important for economic losses but can mean the difference between life and death (NWS, 2012).Studies suggest that as little as 1 h of lead time can result in a 10 % reduction in flood damages if forecast information is communicated in a timely manner (McEnery et al., 2005).Flood modeling and prediction has greatly advanced in recent years with the advent of geographic information systems (GIS), high-resolution digital elevation models (DEMs), distributed hydrologic and weather models and better delivery systems on the internet (McEnery et al., 2005).However, despite the advancement in hydrological prediction systems, such systems remain plagued by uncertainty from numerical weather prediction models (Clark and Hay, 2004) and, hydrologic model and structure parameters (Krzysztofowicz, 2001a;Gupta, 2005).
In terms of atmospheric forcing, the main source of uncertainty in streamflow forecasts arises from precipitation forecast errors which include errors arising from parameterizations of physical processes in atmospheric models, resolution and initial conditions (Krzysztofowicz, 2001a; Bartholmes and Todini, 2005;Cuo et al., 2011).In this context, ensemble hydrological forecasts using every member in the ensemble are appealing to account for the uncertainty in numerical weather prediction (NWP) model forecasts (Buizza et al., 1999;Krzysztofowicz, 2001b;Bowler et al., 2008;Hamill et al., 2008;Cloke and Pappenberger, 2009).
Recent studies show the promise of adopting streamflow ensemble forecast techniques due to advantages over deterministic forecasts (Habets et al., 2004;Younis et al., 2008;Boucher et al., 2011;Schellekens et al., 2011;Verkade and Werner, 2011;Alfieri et al., 2013) as well as a way of accounting for uncertainties in hydrological forecasting (Chen and Yu, 2007;Demeritt et al., 2007;Davolio et al., 2008;Pappenberger et al., 2008;Reggiani and Weerts, 2008;Cloke and Pappenberger, 2009;Bao et al., 2011;Bogner and Pappenberger, 2011;Cuo et al., 2011;Schellekens et al., 2011;Alfieri et al., 2012;Amengual et al., 2015).Other advantages include the ability to distinguish between an extreme event forecast that is more or less likely to occur within the model's forecast horizon (Buizza, 2008;Golding, 2009) and better decision making with respect to operational hydrological concerns (Ramos et al., 2007;McCollor and Stull, 2008;Boucher et al., 2012).Furthermore, ensemble-based stream-flow forecasts tend to be more consistent between successive forecasts (Pappenberger et al., 2011).Fan et al. (2014) showed benefits in the use of ensembles, particularly for reservoir inflows on flooding events, and in comparison to the deterministic values given by the control member of the ensemble and by the ensemble mean.Komma et al. (2007) reported that for longer lead forecast times, the variability of the precipitation ensemble is amplified as it propagates through the catchment system as a result of non-linear catchment response.They also showed that the ensemble spread is a useful indicator to assess potential forecast errors for lead time greater than 12 h.
A numbers of existing techniques are currently used by the hydrologic community to account for uncertainty in hydrological forecast systems.For instance, Krzysztofowicz (2001a) implemented a method to combine uncertainties from both hydrological models and precipitation forecasts using a Bayesian forecasting system (BFS).This approach was based on decomposing the total uncertainty in the river stage into precipitation uncertainty and hydrologic uncertainty, which are quantified independently and then integrated into a predictive distribution of the river stage.Montanari and Grossi (2008) indirectly related the forecast error to the sources of uncertainty in the forecasting procedure through a probabilistic link with the current forecast issued by the hydrologic model, the past forecast error and the past rainfall.Olsson and Lindström (2008) performed analysis on separating the contributions of the precipitation forecast errors and the hydrological simulation errors.Weerts et al. (2011) used quantile regressions to assess the relationship between the hydrological forecast and the associated forecast error.Renard et al. (2010) addressed the total predictive uncertainty and separated it into input and structural components under different inference scenarios.They highlighted the inherent limitations of inferring inaccurate hydrologic models using rainfall-runoff data with large uncertainties.Brown (2015) quantified the total uncertainty in future streamflow as a combination of the meteorological forcing uncertainties and the hydrologic modeling uncertainties.He implemented a Meteorological Ensemble Forecast Processor (MEFP) to quantify the meteorological uncertainties and correct biases in the forcing inputs to the streamflow forecasts modeling.
In spite of the advancements and advantages in streamflow ensemble forecasts reported in the literature there are a number of key scientific questions that need better understanding.These include how the meteorological forecast uncertainty reflects in the ensemble streamflow forecast; how the degree of spread and hydrological response correlates with the lead time of the forecast and the scale of application; and how effective streamflow ensemble forecasting is during an extreme hydrological event.The present work investigates these questions by retrospectively forecasting streamflow of an extreme event (Hurricane Irene) in the Hudson River basin, USA.Hurricane Irene had strong hydrological effects from high moisture content that brought very heavy rainfall rates to the US east coast including the Hudson River basin (Coch, 2012) (Fig. 1).The total estimated damage from Hurricane Irene was around USD 15.8 billion.This includes about USD 7.2 billion from inland flooding and storm surges (Avila and Cangialosi, 2011).
In this paper, we first describe the case study area and context.We then summarize the main data sets that were used to implement the hydrologic framework.Subsequently, we provide a detailed quantitative analysis and discussion of the uncertainties in the streamflow forecasts associated with the forcing from weather models and demonstrate how uncertainties in streamflow forecast median, time of peak and spread are reduced approaching a given event.
2 Materials and methods

Study area and context
The study encompasses the Hudson River basin (USA) which originates from the Adirondack Mountains of upstate New York and drains into the Atlantic Ocean (Fig. 2).The drainage area of the basin is approximately 36 000 km 2 , covering 25 % of New York State and other portions of the states of New Jersey, Connecticut, Massachusetts and Vermont.The basin is considered one of the largest drainage areas in the eastern seaboard of the United States.According to a national water quality assessment study conducted by the United States Geological Survey (USGS), nearly 60 % of the water supplied in the basin is for commercial or industrial use.Several reservoirs within the Hudson River basin contribute to the New York City water-supply system, which supplies water to about 8 million people.
In 2011, Hurricane Irene caused severe damage and widespread destruction that affected the east coast of the United States.The storm made landfall as a strong tropical storm at Little Egg Inlet in New Jersey on 28 August 2011 (Fig. 1).The total precipitation accumulation from Hurricane Irene during 27-30 August 2011 was more than 300 mm in certain areas of the Hudson River basin (Fig. 4).It inundated streams throughout New Jersey resulting in peak stream flows exceeding the 100-year recurrence interval at many stream gages and causing heavy property and road damage.(Qgis, 2011), R and Python, and exported to the HEC-DSSvue storage system.Basin parameters were derived for the study area and the hydrological model was run using these inputs.
impacted with an estimated USD 1.5 billion in FEMA public assistance costs and 10 deaths (FEMA, 2011).In New Jersey, the property damage was estimated to be USD 1 billion.
In addition to high monetary damages, millions of people across the state were evacuated and seven deaths were reported (Watson et al., 2013;NJOEM, 2014).

Modeling framework description
The operational framework diagram is shown in Fig. 3 and the data sets used in constructing the Hudson River basin regional scale hydrological model are depicted in Fig. 2.
The framework was validated using the National Centers for Environmental Prediction (NCEP) North American Regional Reanalysis (NARR) precipitation data (Mesinger et al., 2006).A retrospective forecast of Hurricane Irene utilizing the 21 ensemble members from NOAA's Global Ensemble Forecast System Reforecast (GEFS/R) was then used (Hamill et al., 2015).Apart from this work, the framework is currently operational and fully automated on the Pharos (Greek for "lighthouse") Linux supercomputer at Stevens Institute of Technology.It produces four forecast cycles of ensemble river discharge per day, simulated at hourly time steps, which feed into the New York Harbor Observing and Prediction System (NYHOPS) (Bruno et al., 2006;Georgas et al., 2007Georgas et al., , 2014)).NYHOPS was developed at Stevens Institute of Technology's Davidson Laboratory to generate forecasts of the Atlantic coast, New York Harbor and Hudson River region through in situ monitoring equipment and hydrodynamic modeling (Blumberg et al., 2015).

HEC-HMS model description
The Hudson River basin was modeled using the latest Hydrologic Engineering Center's Hydrologic Modeling System (HEC-HMS), version 4.1 (USACE, 2015).HEC-HMS, developed by the US Army Corps of Engineers, is a conceptual semi-distributed hydrological model that has been used extensively in rainfall-runoff modeling and other related hydrological studies (Anderson et al., 2002;Neary et al., 2004;Knebl et al., 2005;Amengual et al., 2009;Chu and Steinman, 2009;Halwatura and Najim, 2013;Meenu et al., 2013;Seyoum et al., 2013;Zhang et al., 2013;Yang and Yang, 2014).The model uses a number of adjustable empirically derived parameters that describe the overall structure of the basin including parameters for runoff, baseflow and river routing (Feldman, 2000).In this work, the modified Clark (ModClark) distributed method (Kull and Feldman, 1998) was used to account for the spatial variability and characteristics of the basin.The gridded precipitation inputs were used to enable spatially distributed infiltration calculations at all regions of the basin.Infiltration capacity in the model was quantified using the gridded curve number (CN) methodology derived by the Soil Conservation Service (SCS) (USDA, 1986;Mishra and Singh, 2013).The SCS CN method estimates precipitation excess as a function of cumulative precipitation, soil type, antecedent soil moisture content, land use, total length of the river and the elevation of the catchment area (Scharffenberg, 2015).The baseflow component of the model includes the initial flow and the recession constant to account for groundwater contributions to stream flow (Chow, 1959;Maidment, 1992;Feldman, 2000).The river flow routing was based on the Muskingum equations of the HEC-HMS model (USACE, 2015).

Hudson River basin hydrological model data sets
The ArcGIS HEC-GeoHMS 10.2 extension (Fleming and Doan, 2013) was used to prepare and import the geographical information system (GIS) data into HEC-HMS (Johnson et al., 2001).The regional model data sets shown in Fig. 2 Hydrol.Earth Syst.Sci., 20, 2649Sci., 20, -2667Sci., 20, , 2016 www.hydrol-earth-syst-sci.net/20/2649/2016/ New York  include topography obtained from the USGS National Elevation Dataset (NED) (Gesch et al., 2002), land surface cover obtained from the US Department of Agriculture National Resource Conservation Service (NRCS) and soil data for New York State and New Jersey gathered from the State Soil Geographic Database (STATSGO) (Miller and White, 1998).

Pennsylvania
Land use data sets were obtained from the USGS National Land Cover Dataset (NLCD) (Homer et al., 2012).The Hudson River basin was first delineated into subbasins based on flow direction and accumulation derived from a digital elevation model (DEM) using HEC-GeoHMS (Fleming and Doan, 2013), and then each sub-basin was subdivided into hydrologic response units, each of which has a gridded curve number representing its runoff response rate based on its unique combination of land use, soil type and slope (Gassman et al., 2007).The gridded SCS curve number was obtained by intersecting land use and land cover with the soil data using the CN-grid tool in HEC-GeoHMS.An example of the curve number grid GIS layer over a subbasin of the Hudson River is shown in Fig. 2. It is important to point out that model runoff estimation can be very sensitive to soil moisture, and using static SCS curve number parameters may introduce limitations when the model is run operationally.To overcome such limitations, the framework utilizes a look-up table for the initial abstraction parameters based on the hindcast and the continuous run of the model with the NARR data.Other future techniques that involve integrating machine learning techniques to automatically select the optimal initial abstraction parameters on the fly can be advantageous as well.The sub-basins' storage coefficients and imperviousness percentage were derived from the GIS data sets described earlier this section using HEC-GeoHMS 10.2 (Fig. 2).The observed streamflow data for 15 USGS gauging stations, located in the Hudson River basin (Fig. 2), were automatically retrieved using the R dataRetrieval package (Hirsch and De Cicco, 2015).The observed discharge data were recorded at 15 min time intervals and imported to the USACE Data Storage System Visual Utility Engine (HEC-DSSVue) (HEC, 2009).
The model's baseflow recession constants were derived using an automated base flow separation technique based on the R low flow statistics package "lfstat" (Gustard and Demuth, 2009;Koffler and Laaha, 2012).In total, 25 years of observed historical flow data were used to derive the baseflow recession constants for each sub-basin in order to better simulate the falling limb of the simulated discharge hydrograph.The optimal recession constant values for the sub-basins ranged from 0.67 to 0.90 depending on the sub-basin that was considered.The calculated recession constant values were consistent and in agreement with the ones reported in the literature (Pilgrim and Cordery, 1993;Feldman, 2000).
For the initial baseflow, we used observed conditions in the gauging stations to reduce uncertainties in the model as the model was intended to forecast short-term extreme events, within a 96 h forecast horizon, and not long-term simulations.The model was forced with gridded precipitation, discussed in detail in the next sections of this paper, to work with the ModClark transform method (Kull and Feldman, 1998).

North American Regional Reanalysis
NARR is a long-term, dynamically consistent, highresolution, high-frequency, atmospheric and land surface hydrology data set for the North American domain (Mesinger et al., 2006).NARR was developed as a major improvement upon the earlier National Centers for Environmental Prediction -National Center for Atmospheric Research Global Reanalysis 1 (NCEP-NCAR GR1).NARR data have successfully assimilated high-quality and detailed precipitation observations into the atmospheric analysis to create a long-term, consistent, high-resolution climate data set for the North American domain.The temporal resolution of the NARR data is 3 h and the spatial resolution is 32 km (Mesinger et al., 2006).The NARR precipitation data have been used in a number of hydrological studies.For instance, Choi et al. (2009) used the NARR data sets to successfully calibrate the semi-distributed land-use-based runoff processes (SLURP) model.Solaiman and Simonovic (2010) used the NARR data in a regional hydrological basin and reported satisfactory performance of such data in scarce regions.
In this work, NARR precipitation data, 26-31 August 2011, corresponding to Hurricane Irene was used in the hydrological model applied to the Hudson River basin in a 2 km common hydrologic gridded format using bicubic interpolation.Table 1 displays the rainfall accumulation totals extracted from the NARR data for selected sub-basins of the Hudson River.The hydrological simulation using this data set of precipitation was considered as the simulation of reference and was compared with the ensemble forecast that is reported in the next section.

Global Ensemble Forecast System Reforecast
(GEFS/R) The Global Ensemble Forecast System (GEFS) is a weather forecast model made up of 21 ensemble members (Hamill et al., 2013(Hamill et al., , 2015)).The GEFS accounts for the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the control forecast.The GEFS 1 • horizontal resolution data reforecasts used initial conditions obtained from high-quality reanalyses data and the same assimilation system that is used operationally.These reforecasts have been shown to be particularly useful for the calibration of relatively uncommon phenomena such as heavy precipitation (Hagedorn, 2008;Hamill et al., 2008).In relation to hydrology, reforecasts help produce quantitative probabilistic estimates of river streamflow that are as sharp and reliable as possible (Schaake et al., 2007).

Statistical criteria used to assess models' performance
The performance of the models was statistically evaluated using the criteria of Nash-Sutcliffe efficiency, referred to hereafter as NSE (Eq.1), and bias (in %) between simulations and observations, referred to hereafter as BIAS (Eq.2).The NSE measures the fraction of the variance of the observed flows explained by the model in terms of the relative magnitude of the residual variance (noise) to the variance of the flows (information); the optimal value is 1.0 and values should be larger than 0.0 to indicate minimally acceptable performance (Nash and Sutcliffe, 1970;O'Connell et al., 1970).
where N is the number of compared values, P i is the simulated (forecast) value, O i is the observed value and O i is the average of O i time series.
The BIAS measures the average tendency of the simulated values to be larger or smaller than the observed ones.The optimal value of BIAS is 0.0, with low-magnitude values indicating accurate model simulation.Positive values indicate overestimation, whereas negative values indicate underestimation (Yapo et al., 1996). ( Model BIAS is reported relative to the mean observation magnitude (Eq.2), in percentage (%).

HEC-HMS model calibration using NARR precipitation data
Upon implementing the model setup, we calibrated the hydrological model to assess its ability to reproduce the Hurricane Irene event using the NARR gridded precipitation data (Mesinger et al., 2006).The HEC-HMS model was run on a 2 × 2 km standard hydrologic grid resolution (SHG) (Maidment and Djokic, 2000) at hourly time steps.The simulated flow hydrographs were calibrated against hourly river flow observations to obtain optimal performance in terms of both runoff volume and peak flow.The hydrological parameters were modified to produce a best-fit model using a root mean square error (RMSE) objective function within the HEC-HMS model's Nelder-Mead optimization method (Barati, 2011;Seyoum et al., 2013), aiming at maximizing the fit between simulated streamflow and observations at 15 US Geological Survey (USGS) gauging stations (Fig. 2).The calibration was also carried out by visual and statistical comparison to produce an accurate simulation of discharge at the Hudson River gauging stations (Fig. 2).We also used the HEC-HMS uncertainty function which, through a variant of Latin hypercube sampling, varies sensitive model parameters within a defined range, thereby producing an estimate of best-fit parameters after multiple iterations (Mousavi et al., 2012).For example, the hydrologic Muskingum routing parameters were modified to include a greater ratio of attenuation to translation of runoff in the sub-basins which significantly improved the model results (Fig. 5).To assess and evaluate the parameter uncertainties we performed a Monte Carlo-based uncertainty analysis available in HEC-HMS 4.1 (Scharffenberg et al., 2015).Table 1 lists the NARR accumulated precipitation for each sub-basin of the Hudson River while the summary of the model fit represented by the NSE and BIAS criteria is shown in Fig. 4. The NARR hydrological model results (also referred to as the simulation of reference) showed reasonable fit between model and observations in all the selected subbasins.The lowest NSE was 0.75 and certain sub-basins had NSE values higher than 0.90 (Fig. 4).Amongst the 15 flow stations examined in the study, 13 stations had a BIAS below 10 % while 2 stations had a BIAS higher than 10 %.It was observed that the stations with a higher BIAS were located in the upstream parts of the basin, notably the Hoosic River subbasin (USGS ID 01334500) (Fig. 4).Importantly, the hourly hydrograph shape and timing of peaks accurately replicated the observations as illustrated in Fig. 5. Overall, the reference simulation exhibited a representative fit to observations.Thus, the developed calibrated framework showed promising results for generating streamflow forecasts with a 96 h lead time using ensemble member weather forecast forcing.

Ensemble river discharge retrospective forecast
We forced the HEC-HMS model with precipitation fields from the GEFS-retrospective forecast ensemble members to examine the variations in simulated discharge among the ensemble members.More specifically, we fed the Hudson River basin hydrological model with every single GEFS member of the 21 available members.The resulting sets of streamflow forecasts were then analyzed to better understand the uncertainty of the streamflow forecasts that arises from the hydrologic framework's response to uncertainties in the meteorological forcing.The spread of ensemble members is considered as a useful measure of forecast uncertainty (Pappenberger et al., 2005).
To assess the skill of the forecasts, we compared, at lead times of 72, 48 and 24 h, observations with results of the individual ensemble members, the median of all members, the ensemble control member (GEFSC00) and the NARR simulation of reference (Fig. 5).We chose to include the control member in these comparisons as a proxy for the single deterministic models used when ensemble forecasts are not considered.The hydrological parameters and baseflow conditions calibrated in the NARR simulation of reference were retained in all simulations.The 98th and 2nd percentiles of accumulated precipitation from the 21 ensemble members at each forecast are reported in Table 1.
In the reforecast issued 72 h prior to the event, the high uncertainty in the NWP ensemble inputs translated to a high uncertainty in the simulated streamflow when simulated by the hydrological model.For instance, in station ID 01391500 (Saddle River at Lodi, NJ), there was a 20-fold spread in the accumulated precipitation amongst the 21 ensemble members.For the same station the peak flow ranged from base flow (3 m 3 s −1 ) to 242 m 3 s −1 (exceeding the major flood threshold) (Fig. 5).At the Hackensack River at New Milford the peak flow was ranging anywhere between 20 and 525 m 3 s −1 .The magnitude of spread in other stations was similar (Fig. 5).At that point of the retrospective forecast, the streamflow simulated using the control member (GEFSC00) underestimated the observed flow in all stations except for station ID 01375000 (Croton River on Hudson, NY) where the flow was simulated correctly (Fig. 5).However, the spread between the individual ensemble members remained very high as reported above.In terms of estimated time of peak, one notes that the GEFSC00 control member correctly projected the peak time of the event in all stations with an error of ± 3 h when compared with observations.However, other individual members had an offset of up to 24 h between simulated and observed peaks in certain stations (Fig. 5-a1, -b1, -c1, -d1 and -e1).For instance, at station ID 01375000 (Croton River on Hudson, NY), one individual member was projecting an estimated peak on the evening of 28 August while another individual member was projecting it at noon of the following day.In this particular station, the observed peak was around the midnight of 28 August.This finding offers an interesting perspective in terms of precisely extracting hydrograph features as one may use the control member at this stage of the forecasts to precisely project the time of the peak with a temporal error of ± 3 h.In terms of statistical evaluation, Figs. 6 and 7 report the NSE and BIAS (%) for selected stations.For station ID 01375000, the control member (GEFSC00) predicted the flow accurately with a NSE of approximately 0.95, however the flow is underestimated by about 20 % in the forecast issued 72 h prior to the event.Overall, only seven ensemble members at this station had a NSE above 0.70 while more than 60 % of the members underestimated or overestimated the flow hydrograph by more than 30 % (Fig. 7).Although uncertainties in baseflow initial conditions and model hydrological parameters are not addressed in this work, it was noted that uncertainties from precipitation inputs have a substantial impact on the prediction compared to remaining uncertainties in the initial conditions and parameters of the calibrated hydrological model.
In the next reforecast, issued 48 h before the event, the spread (or the uncertainty envelope) was substantially reduced, notably in the projected time of the peak .This is primarily due to the decrease in accumulated precipitation ensemble spread (Table 1).The magnitude of the peak ensemble spread was between 104 and 229 m 3 s −1 for the Saddle River.The same river had a spread on the order of 20-fold in the prior forecast.This improvement was observed at other stations as well.At 48 h lead time, the control member systematically overestimated the flow at all stations (Fig. 5-a2, -b2, -c2, -d2 and -e2), but  it consistently predicted the time of peak, as was the case in the forecast issued 72 h before the event.The uncertainty in the time of the peak also decreased, with most of the ensemble members predicting the peaks within ± 3 h from the observed ones.This suggests that the peak ensemble spread should not be the only metric considered to quantify potential uncertainty, it suggests that other features, such as the peak timing and magnitude, should also be examined.The control member (GEFSC00), which predicted the flow correctly for Croton River on Hudson (ID 01375000) 72 h before the event, however, at 48 h lead time showed a marked overestimation of the flow 65 %.This suggests that the accuracy of the predictions also varies temporally and no single member can be relied upon consistently as a perfect forecast.Overall, there was a significant improvement in NSE for most of the ensemble members in the forecasts issued 48 h before the event.
In the final reforecast considered, issued 24 h before the event, the spread was further reduced with all the members predicting the rising and the falling limbs of the hydrograph more accurately.The control member presented an almost perfect forecast for Hackensack River at New Milford (Fig. 6).The peak ensemble spread was reduced by 57 %, 60 % and 48 % compared to the 72 h lead time predicted peak ensemble spread at the Hackensack River, Saddle River and Wallkill River stations, respectively.Also, the peaks were predicted to occur within ± 3 h of the observed event peak in all cases.The control member seemed to be consistent with the observations in terms of the peak time at this stage of the forecast and the uncertainties are less than that projected 48 h before the event.At this stage of the forecast 75 % of the en- Figure 7. Models' performance represented by BIAS (%) at lead times of 72, 48 and 24 h from the observed peak flow.The metrics are showing the NARR model outputs, the median of ensemble members and the GEFS control member.The GEFS perturbed members are shown in grey.
semble members had a NSE higher than 0.75 (Fig. 6).This is consistent with findings in recent ensemble streamflow studies that used different modeling frameworks (Thielen et al., 2009;Fan et al., 2014;Yang and Yang, 2014).The results also suggest that the magnitude of spread between the ensemble members depends significantly on the sub-basin drainage area.For example, at station ID 01391500, Saddle River at Lodi (140 km 2 ) a peak discharge of 242 m 3 s −1 corresponds to a maximum accumulated precipitation of 206 mm.Thus, there is peak flow of about 1 m 3 s −1 for 1 mm of accumulated precipitation.This ratio, however, increases for a basin with a larger area.For example, at Wallkill River at Gardiner, NY (ID 01371500, 1800 km 2 ) there was a peak flow of approximately 10 m 3 s −1 for 1 mm of accumulated precipitation as it propagates (non-linearly) through the drainage area, thus there is a correlation between the spread in the precipitation data and the area of the sub-basin.
To have an overall assessment of the forecast skill, we calculated the range and average NSE and BIAS (%) across all stations (Fig. 8) in the Hudson River basin for the different reforecast ensemble members 24 h before the event.The figure depicts how a member that has a good NSE and BIAS at one station can have a very poor performance in other parts of the basin that may partly be due to the statistical downscaling of precipitation from 1 • resolution to 2 km, and the associated uncertainty in the spatial distribution of precipitation.The median of all the members and the control member showed good performance with an average NSE of 0.75 compared to each of the members.The results show that there is no "one size fits all" solution for selecting an ensemble member, noting that each sub-basin has its own distinct set of characteristics manifested in local conditions, such as the size of the basin and land use.This finding calls for further work involving higher-resolution precipitation models to assess the effect of basin size and meteorological forcing resolutions.

Threshold exceedance persistence diagram
In addition to the statistical metrics and visual comparison of stream-flow, an assessment of the forecasts skill was carried out for Hurricane Irene using a threshold exceedance persistence diagram at 6 h time intervals of the forecasts streamflow time series (Fig. 9).The exceedance diagram used in this study is an adaptation of the operational European Flood Alert System (Bartholmes et al., 2009;Thielen et al., 2009).Such quantitative diagrams give an idea about the forecast persistence and support operational flood management decisions and human judgment.The major flood threshold for each station was obtained from the NOAA National Weather Service (2011).It is defined as the category of flood where extensive inundation of structures and roads is expected, which calls for significant evacuation of people and transfer of property to higher elevations (NWS, 2012).
The major thresholds for observed and simulated discharge were transformed into dichotomous time series of 1 (the major threshold is exceeded) and 0 (the major threshold is not exceeded).The information of each cell in the matrix diagram is the probability of exceeding the major flooding value, calculated using all the ensemble members at a given forecast.For instance, a probability of 100 % suggests that all 21 ensemble members used in the work are projecting a major flood.Figure 9 exhibits the exceedance diagram results at selected stations of the Hudson River basin in which observations exceeded the major flood threshold during this event.The results suggest that one may establish a highly reliable streamflow forecast 48 h prior to the event.For instance, the 72 h lead time reforecasts (issued on 26 August 2011, 00:00 GMT) for station ID 01381900 were projecting a 52 % probability (11 members out of 21) of having a major flood event 78 h out.However, the probability increased to 100 % in the 27 August 2011, 00:00 GMT (48 h before the event) reforecast for the same station.The average across all stations for the day 3 (72 h before the event) reforecast was showing a 60 % probability of having a major flood on 29 August 2011 (between 00:00 and 06:00 GMT), while the day 2 (48 h before the event) reforecasts had a 99 % average chance of exceeding the major flood threshold.This highlights the increase in reliability as the event approaches.The diagram also suggests that the persistence of the event occurrence among subsequent forecasts is a good indicator to trigger a major flood warning, especially when the proportion of members above the threshold exceeds 71 % (15 or more out of the 21 ensemble members).By contrast, in station ID 0137500, both observations and flow ensemble members did not exceed the major flood threshold (Fig. 9).This is particularly important for the reliability and validation of the operational forecast system.However, caution should be practiced when interpreting the persistence exceeding diagrams as time series will have to be examined in parallel to confirm any potential discrepancy between the models.

Summary and discussion
The first part of this work consisted of implementing a regional scale hydrological modeling framework on the Hudson River basin using the HEC-HMS model (USACE, 2015) forced with NARR gridded precipitation inputs (Mesinger et al., 2006).The second part investigated the use of GEFS (Hamill et al., 2013(Hamill et al., , 2015) ) ensemble inputs to retrospectively forecast an extreme hydrological event, Hurricane Irene, with a 96 h time horizon at hourly time steps.In total, 21 GEFS ensemble members were tested on the Hudson River basin with reforecasts issued 72, 48 and 24 h prior to the event.
In terms of assessment, we visually and statistically quantified the results of the GEFS individual members, the deterministic forecasts represented by the control member (GEFSC00) and the median of all members.We also used the persistence exceedance diagram of established thresholds to project the possibility of major floods based on all the individual members.The comparison showed that ensemble forecasts are advantageous in quantifying uncertainty of forecasts lead time and raising reliability from an operational perspective.This work did not address uncertainty in hydrological initial conditions and model parameters, which were Figure 9. Color-coded threshold exceedance diagram for Hurricane Irene forecasts at 6 h intervals using the major flood threshold for each USGS station.The flow values below the station ID are the major flood threshold for a given station.The x axis represents the 96 h forecast horizon from the simulation date shown on the left column.The time period, in which the major event hydrological record exceeds the equivalent alert threshold, is indicated using a dark red cell, while the cell values refer to the percentage of ensemble members that were projecting a major flood.For instance, if the value is 100 then all 21 flow ensemble members are projecting a major event within a given time interval, while 0 means none of the ensemble members are exceeding the major event threshold.The observed occurrence of the threshold is exhibited at the last row for each station, the red color code indicates an observed flow higher than the major flood threshold, while the green cell indicates flow below the major threshold.Reported time is in GMT; date format is mm/dd/yyyy.optimized in NARR-based calibration, but rather was focused on how dominant the precipitation inputs are on the results of the hydrological streamflow forecasts.The findings of this work confirm that using ensemble streamflow predictions has advantages over deterministic forecasts in terms of better representing the uncertainty, particularly in an extreme hydrologic event such as the one presented in this work.Furthermore, while the general perception is that hydrological uncertainty is reduced with lead time, there are no studies that quantitatively characterize this aspect using the GEFS retrospective data and in the event of an extreme flood event such as Hurricane Irene.
The work shows that streamflow forecasts are highly dependent on the meteorological inputs and reflect uncertainties associated with these inputs.Clearly, the relatively small area of the sub-basin and resolution of the weather model was of importance for the spread overestimation due to the precipitation inputs.In this context, higher-resolution weather models such as the European Centre for Medium-Range Weather Forecasts (ECMWF) (Molteni et al., 1996;Thiemig F. Saleh et al.: A retrospective streamflow ensemble forecast for an extreme hydrologic event et al., 2015) or the Coupled Ocean-Atmosphere Mesoscale Prediction System (COAMPS) ® (Pullen et al., 2015) may be of particular interest in smaller scale subcatchments in order to provide higher accuracy.The operational GEFS ensemble forecast data sets were upgraded to 0.5 • resolution on 2 December 2015.More detailed weather model forcing may lead to a reduction in the streamflow spread, and reduce the uncertainty of the forecast.
The modeling of the Hudson River basin demonstrated that a given streamflow ensemble member may produce streamflow that is highly in agreement with observations, accounting for uncertainty in precipitation and in the hydrological model parameters.However, there is no "one size fits all" solution for selecting an ensemble member, noting that each sub-basin has its own distinct set of characteristics manifested in local conditions such as the area of the drainage basin and land use.Furthermore, caution should be exercised in forecast models that only use the control member from the weather models or the average ensemble member as it may lead to considerable deviation in river discharge forecasts from observations resulting in false warnings and missed flooding events (Fig. 5).
The findings also suggest that higher confidence in the river discharge forecasts may be attained as we approach a major event by approximately 48 h.The outcomes of this work provide interesting perspectives for future ensemble postprocessing techniques and features extraction, notably regarding the peak timing of an extreme hydrologic event when combined with the major flood persistence diagram.
The operational framework presented in this work offers an improvement over the available NOAA's Advanced Hydrological Prediction System (AHPS) in this particular region (McEnery et al., 2005) and is operationally running 125 ensemble members including (in addition to GEFS) the ECMWF, ECMWF-HRES, the Short-Range Ensemble Forecast (SREF), the Canadian Meteorological Centre (CMC) and the North American Mesoscale Forecast System (NAM).The AHPS streamflow forecasts are at 6 h time intervals using one weather deterministic forecast as input and with a lead time that is less than 60 h in this region (Adams, 2015).
The framework is highly flexible and directly handles GRIB1, GRIB2 and NetCDF meteorological input formats.This operational flexibility in the framework allows it to use other sources of meteorological data instead of NARR, which is not available in areas outside North America.Examples of atmospheric reanalysis products that may be used include the European Centre for Medium-Range Weather Forecasting (ECMWF), National Centers for Environmental Prediction (NCEP) and National Center for Atmospheric Research (NCAR).As for framework validation, if applied to watersheds without gauging stations (or with only a few), it is possible to use remote-sensing river discharge data to calibrate and validate the modeling outputs.Despite that fact that such data have uncertainties, there have been many advancements in this field and there is potential for future applications (e.g., the Surface Water and Ocean Topography (SWOT) satellite mission; Saleh et al., 2012;Biancamaria et al., 2015).Thus, this framework shows promise for operational streamflow forecasting in other parts of the world.
The computational time required to run the Hudson River basin for the 21 GEFS ensemble members was approximately 30 min.This includes processing the GRIB input files and postprocessing of the ensemble outputs.The total time required to run the entire 125 ensemble members is approximately 5.5 h; this includes preprocessing inputs and updating the database and the Stevens Flood Advisory System (SFAS) website.The forecasts are updated every 6 h, with a lead time of 87 h, which is sufficient for issuing a flood warning.
The current work maybe potentially expanded to integrate a hydrodynamic model, such as HEC-RAS (Brunner, 2002), that has advantages in terms of simulating the water levels and flood extents in addition to streamflow.Such hydrodynamic modeling requires flow boundary conditions that can be obtained from the hydrologic forecasting model.This modeling will also require detailed representation of rivers' cross sections that may be obtained using lidar data at sitespecific locations; however, it is not always available at regional scale (Saleh et al., 2011).
In terms of applications, the Hudson River basin regional scale model may be used to continuously forecast the overall variability of the water resources (Saleh et al., 2011;Pryet et al., 2015;Saraiva Okello et al., 2015), predicting fate and transport of water quality such as nitrate (Schoonover and Lockaby, 2006;Wang et al., 2012;Bastola and Misra, 2015;Schuetz et al., 2016) and climate change scenarios (Ducharne et al., 2007(Ducharne et al., , 2010;;Graham et al., 2007;Quintana Seguí et al., 2010;Habets et al., 2013).Moreover, socio-economic analysis may be used to weigh how such improved forecasts can prevent loss of life and minimize the damage to property, with the aid of effective communication and social media.

Data availability
All underlying research data used to construct the hydrologic model are publicly available and accessible online.The sources of data used in this work are in the assets tab.

Figure 1 .
Figure 1.Geostationary Operational Environmental Satellite (GOES) east image of Hurricane Irene making landfall on 28 August 2011 (image source: National Oceanic and Atmospheric Administration (NOAA), 2011).

Figure 2 .
Figure 2. Map showing the Hudson River basin topography including basin divisions (thin black lines) and hydrographic network.Examples of land use, curve number and imperviousness data sets (zoomed) that were used in HEC-GeoHMS to construct the hydrological model are also shown.The upper right side of the figure exhibits an example of the HEC-HMS model structure and its subbasins using standard hydrologic grids (SHG (2 × 2 km)).

Figure 4 .
Figure 4. Summary of HEC-HMS performances at the USGS streamflow stations.The circles represent the criteria of Nash and Sutcliffe (NSE) while the squares represent the BIAS (%).The statistical criteria are computed at hourly time steps.The model performances are illustrated for the NARR precipitation forcing.The map also shows the total observed accumulated precipitation received in 48 h during 28-29 August 2011 (data from National Oceanic and Atmospheric Administration, 2011).The time series for selected flow stations are presented in Fig. 5.

Figure 6 .
Figure 6.Models' performance represented by Nash-Sutcliffe efficiency (NSE) at lead times of 72, 48 and 24 h from the observed peak flow.The metrics are showing the NARR model outputs, the median of ensemble members and the GEFS control member.The GEFS perturbed members are shown in grey.

Figure 8 .
Figure 8. Overall NSE and BIAS (%) across the stations for forecast issued 24 h before Hurricane Irene.