A Hydrological Prediction System Based on the SVS Land-Surface Scheme: efficient calibration of GEM-Hydro for streamflow simulation over the Lake Ontario basin

This work explores the potential of the distributed GEM-Hydro runoff modeling platform, developed at 15 Environment and Climate Change Canada (ECCC) over the last decade. More precisely, the aim is to develop a robust implementation methodology to perform reliable streamflow simulations with a distributed model over large and partly ungauged basins, in an efficient manner. The latest version of GEM-Hydro combines the SVS (Soil, Vegetation and Snow) land-surface scheme and the WATROUTE routing scheme. SVS has never been evaluated from a hydrological point of view, which is done here for all major rivers flowing into Lake Ontario. Two established hydrological models are confronted 20 to GEM-Hydro, namely MESH and WATFLOOD, which share the same routing scheme (WATROUTE) but rely on different land-surface schemes. All models are calibrated using the same meteorological forcings, objective function, calibration algorithm, and basin delineation. GEM-Hydro reveals competitive with MESH and WATFLOOD: NSE √ (NashSutcliffe criterion computed on the square-root of the flows) are for example equal to 0.83 for MESH and GEM-Hydro in validation on the Moira River basin, and to 0.68 for WATFLOOD. A computationally efficient strategy is proposed to 25 calibrate SVS: a simple unit hydrograph is used for routing instead of WATROUTE. Global and local calibration strategies are compared in order to estimate runoff for ungauged portions of the Lake Ontario basin. Overall, streamflow predictions obtained using a global calibration strategy, in which a single parameter set is identified for the whole basin of Lake Ontario, show skills comparable to the predictions based on local calibration: the average NSE √ in validation and over 7 subbasins is of 0.73 and 0.61, respectively for local and global calibrations. Hence, global calibration provides spatially consistent 30 parameter values, robust performance at gauged locations, and reduces the complexity and computation burden of the calibration procedure. This work contributes to the Great Lakes Runoff Inter-comparison Project for Lake Ontario (GRIP-O) which aims at improving Lake Ontario basin runoff simulations by comparing different models using the same input forcings. The main outcome of this study consists in a new generalizable methodology for implementing a distributed


Introduction
Given the continuous increase in precipitation forecast skill of numerical weather prediction (NWP) systems (Sukovich et al., 2014), it became possible to obtain skillful runoff forecasts directly from NWP model outputs, and streamflow forecasts by routing these gridded runoff fields.Indeed, modern NWP models tend to simulate to some extent the snow, vegetation, and soil processes that contribute to the generation of runoff and streamflow.In practice, however, many limitations are still associated with the representation of such processes in NWP systems, which were documented in Clark et al. (2015) and Davison et al. (2016).
Hydrological processes simulated by land-surface schemes (LSS) have been increasingly integrated into NWP models (Balsamo et al., 2009;Masson et al., 2013;Alavi et al., 2016;Wagner et al., 2016), as soil water content and snow water equivalent are recognized as key state variables for streamflow forecasting (Koster et al., 2004;Entekhabi et al., 2010).Environment and Climate Change Canada (ECCC), which provides operational weather and environmental forecasts within its boundary, is currently in the process of implementing a major upgrade to the LSS of the Global Environmental Multi-scale model (GEM), the national model.This new scheme, named SVS for soil, vegetation and snow, has been devised to assimilate space-based soil moisture retrievals as well as surface data, and has proven efficient at simulating soil moisture and brightness temperature (Alavi et al., 2016;Husain et al., 2016).SVS will be used to replace the Canadian version of the ISBA (Interaction Sol-Biosphère-Atmosphère) scheme that has been used in GEM since 2001(Bélair et al., 2003).One of this paper's objectives is to present the first evaluation of the capabilities of the new SVS scheme for streamflow prediction in Canada.
GEM's LSSs can be run either two-way coupled to the atmospheric model or offline, using GEM or other observed atmospheric forcing.The platform for running GEM offline is known as GEM-Surf (Bernier et al., 2011).Runoff obtained from the LSS can then be routed to the outlet of the basin using the WATROUTE routing scheme (Kouwen, 2010).This configuration is known as GEM-Hydro.
Although the SVS scheme typically performed well for soil moisture simulations (e.g., Alavi et al., 2016;Husain et al., 2016), the capabilities of SVS to predict streamflow within the framework of GEM-Hydro, especially for large basins with ungauged portions, have not yet been examined.In this work, we present the calibration and evaluation of GEM-Hydro based upon the SVS scheme for streamflow simulation over the Lake Ontario basin.
The Lake Ontario basin is chosen for the application of GEM-Hydro because the basin can favor the examination of GEM-Hydro (and SVS) performance for runoff simulation over a wide range of hydrological conditions (mixed vegetation/land cover, natural/regulated regimes, gauged and ungauged portions), and because there are a large amount of data available for model setup for this region.
Different cascades of interconnected models have been developed over the years to simulate the Great Lakes water levels and thermodynamics, as reported by Wiley et al. (2010), Deacu et al. (2012) and Gronewold et al. (2011), the latter describing the Advanced Hydrologic Prediction System (AHPS), a seasonal water supply and water level forecasting system developed by the National Oceanic and Atmospheric Administration (NOAA) Great Lakes Environmental Research Laboratory (GLERL) in the mid-1990s that has since been employed operationally by the USACE and regional hydropower authorities.Recently, ECCC has developed a short-term (84 h) operational water cycle prediction system (coupled atmospheric, hydrologic and hydrodynamic modeling) for the Great Lakes and St. Lawrence River (WCPS-GLS; see Durnford et al., 2017).The system uses the version of GEM-Hydro that relies on the simpler ISBA LSS.
To our knowledge, the AHPS and WCPS systems are the only two systems that can provide inflow forecasts for each of the Great Lakes on both sides of the Canada-US border, and neither relies on very sophisticated hydrological models.The need for improving simulations and forecasts of runoff to the Great Lakes has been recognized by both agencies (Gronewold and Fortin, 2012).Multiple additional hydrologic models are indeed available (Coon et al., 2011), but their spatial domains are typically constrained to either the US or Canada.Before embarking on an upgrade of operational systems, GLERL and ECCC agreed to perform a number of intercomparison studies under the umbrella of the Great Lakes Runoff Intercomparison Project (GRIP), in order to better understand the status of existing systems, and to set a benchmark for model performance against which future models could be evaluated.The first study was conducted on the Lake Michigan (GRIP-M) basin by Fry et al. (2014), who compared historical runoff simulations from dissimilar hydrologic models using different calibration frameworks and input data.Amongst the models compared were GLERL's Large Basin Runoff Model (LBRM; Croley and He, 2002) that is part of the AHPS, the NOAA National Weather Service model (NWS; Burnash, 1995), and ECCC's MESH distributed model (Modélisation Environnementale -Surface and Hydrology; Pietroniro et al., 2007;Haghnegahdar et al., 2014).A second configuration of MESH was also included, based on Deacu et al. (2012), from which evolved the configuration of GEM-Hydro used by Durnford et al. (2017) for the operational WCPS-GLS system.The NWS model performed best in terms of NSE, but was positively biased, perhaps be-cause of its typical use as a flood forecasting tool.Overall, it was difficult to attribute any difference in model results to the model structure, given that different forcing data and calibration procedures had been used by each contributor to the project.
The GRIP project was extended next to Lake Ontario (GRIP-O) by Gaborit et al. (2017), who compared two lumped models, namely LBRM and GR4J (modèle du Génie Rural à 4 paramètres Journalier; Perrin et al., 2003), based upon the exact same forcing data and calibration framework.Two precipitation datasets were used as input: the Canadian Precipitation Analysis (CaPA; Lespinas et al., 2015), and a Thiessen polygon interpolation of the Global Historical Climatology Network -Daily (GHCND; Menne et al., 2012).CaPA is a near-real-time quantitative precipitation estimate product from ECCC that is available on a 10 km grid for all of North America (http://collaboration.cmc.ec.gc.ca/cmc/cmoi/ product_guide/submenus/capa_e.html).
The main finding of the first GRIP-O study is that the performance of the models was very satisfactory, resulting in an average NSE √ (Nash-Sutcliffe criterion computed on the square root of the flows) in validation of 0.86 (over all subbasins and configurations), despite the fact that most tributaries have a regulated flow regime.This satisfactory performance justifies the use of CaPA as a precipitation forcing dataset in later studies, especially for distributed models which require gridded precipitation as input.The performance of lumped models also provides a reference level of performance when evaluating distributed hydrological models.
As an extension of the first GRIP-O study, the present work is focused on the evaluation of distributed hydrologic models for Lake Ontario basin runoff simulations.Distributed models typically have a broader range of applications than lumped ones.For example, GEM-Hydro can be utilized to estimate the Lake Ontario net basin supplies (or NBSs, the sum of lake tributary runoff, overlake precipitation, and overlake evaporation : Brinkmann, 1983).However, distributed models are more complicated to calibrate and more computationally intensive, especially for large basins.The present study mainly aims at developing a methodology to improve the calibration efficiency of the distributed GEM-Hydro model for streamflow modeling over the Lake Ontario basin, including its ungauged parts.The proposed methodology is transferable and can be applied to other sophisticated distributed models and large basins with ungauged parts.In order to assess the impact of the SVS land-surface scheme on runoff simulations, the GEM-Hydro model is compared with two other distributed models, which rely on the same routing scheme (WATROUTE) as used in GEM-Hydro but different land-surface schemes.The intercomparison of the three models could also provide insight into avenues to further improve GEM-Hydro and to capture structural uncertainty in runoff simulations using the multi-model approach.

Models
Three different platforms are compared in this study: MESH, WATFLOOD and GEM-Hydro.MESH and GEM-Hydro have in common a distributed representation of most hydrological processes occurring in a basin and a structure organized around two main components: a LSS for the representation of surface processes (evapotranspiration, infiltration, snow processes, water circulation in the soils), and a river routing scheme for simulating water transport in the streams, which consists of WATROUTE for all models.WATROUTE is a 1-D hydrologic routing model relying mainly on flow directions and elevation data (Kouwen, 2010).It routes to the basin outlet the surface runoff and recharge produced by the surface schemes.In WATROUTE, runoff directly feeds the streams while recharge can be provided to an optional lower zone storage (LZS) compartment, representing superficial aquifers, which releases water to the streams.WAT-FLOOD and GEM-Hydro make use of the LZS, whereas recharge from MESH feeds directly into the stream.WAT-FLOOD is not considered to include a LSS because it is not solving the energy balance, only the water balance, but it is distributed.
The version of MESH used in this study relies on version 3.6 of the Canadian LAnd Surface Scheme (CLASS).Each grid cell is subdivided in a number of tiles, and each tile is classified as belonging to one of the five grouped response units (GRUs), based on its land-use-soil-type combination.In this paper, we follow the local calibration strategy advocated by Haghnegahdar et al. (2014) for MESH (see section on calibration strategy).
GEM-Hydro is very similar to MESH, but is tied to the LSSs available in GEM: ISBA and SVS.A previous study on the same basin demonstrated the clear superiority of SVS over ISBA, especially in regard to the baseflow component of the streamflow (see Gaborit et al., 2016).We thus only use SVS with GEM-Hydro in this paper.
WATFLOOD (Kouwen, 2010) is a distributed model of intermediate complexity that only needs precipitation and temperature as forcing, as opposed to MESH and GEM-Hydro which need additional atmospheric variables (see Supplement).It relies on the GRUs concept and on many empirical equations.WATFLOOD has been employed by Pietroniro et al. (2007) over the Great Lakes basin.
GEM-Hydro is implemented with a 10 arcmin resolution for the LSS and 0.5 arcmin (≈ 1 km) for the routing.Sensitivity tests (Gaborit et al., 2016) revealed that 2 and 10 arcmin resolutions for SVS lead to quite similar performance in terms of streamflow at the outlet, while a substantial amount of computation time is saved when running the coarser resolution (see Fig. 1).The pre-processing time required by WA-TROUTE remains almost the same whatever the domain size (Fig. 1), which mitigates the interest of using a coarse resolu-   1 shows the datasets used for physiographic information.
As the GEM-Hydro suite (including WATROUTE) is quite demanding in terms of computation time, it was decided to test a stand-alone configuration of GEM-Hydro relying on text files only and in which WATROUTE is replaced by a Unit Hydrograph (UH).This version is henceforth referred to as GEM-Hydro-UH.Figure 1 gives an overview of the relationship between computation time of the different models and the dimension of their domain.Note that GEM-Surf (land-surface part of GEM-Hydro) was run on ECCC's supercomputer while GEM-Hydro-UH and WATROUTE were run on a machine with an AMD Athlon dual-core processor 4800+, because GEM-Hydro-UH and WATROUTE are not parallelized yet (their computation time would not change substantially if run on ECCC's supercomputer).
The computation time for the experiment setup described here and when splitting the domain in four on an ECCC supercomputer is about 1.5 min day −1 for GEM-Surf, provided that the pre-processing of the atmospheric variables was already done (which is the case in calibration: the pre-processing is done only once).WATROUTE (i.e., the routing part of GEM-Hydro) requires 25 s day −1 for the setup described here when running on a local machine.The WATROUTE pre-processing (i.e., preparation of the WA-TROUTE input files from the SVS outputs, which would need to be done for each new run in calibration) takes about 30 s day −1 and is quite constant whatever the domain size of the inputs fields.One simulation run over the GRIP-O period (4.5 years) therefore currently requires about 2 days with GEM-Hydro, which makes it impossible to perform any automatic calibration (which requires at least 400 runs; see below).GEM-Hydro-UH, based on a stand-alone version of SVS, requires only about 3 % of the GEM-Hydro computation time mainly because of the Input/Output processing time: the stand-alone version makes use of text files which are kept open during the simulation and it requires only 3 s day −1 on a local machine for this setup (1.2 h for the 4.5 years GRIP-O period or 20 days of calibration with 400 runs if running the whole domain).However, the computation time required by WATROUTE still had to be bypassed to perform automatic calibrations, which was done with the UH concept.The UH (see for example Sherman, 1932) allows the estimation of the streamflow at the basin outlet by partitioning the basin averages of runoff and recharge in time.The same WATROUTE LZS formulation is used in GEM-Hydro-UH in order to estimate stream recharge.The basin averages required for the UH are computed as a weighted average of the SVS grid cells located in the considered basin.The UH only requires a decay parameter corresponding to the lag or response time of the considered basin, which controls the delay between the rainfall event and the resulting streamflow peak.This is estimated with the Epsey method (Almeida et al., 2014), which requires the basin area, perime-  ter, and the maximum and minimum elevations along the basin main river.The UH lag time is also used as a free parameter during calibration (Table 2).This is inspired by the UH applied to the routing storage of GR4J (Perrin et al., 2003), but is employed here at an hourly time step.Finally, this framework allows a considerable reduction of computation time and therefore allows us to perform calibration.However, GEM-Hydro-UH is faster than GEM-Hydro as long as the domain size remains of the order of a few thousand points (see Fig. 1).Beyond that threshold, not only is calibration no longer feasible with GEM-Hydro-UH, but it is possible that it becomes even slower than GEM-Hydro since the latter can be parallelized.Hydrographs resulting from GEM-Hydro and GEM-Hydro-UH can be very similar (Fig. 3).Finally, the SVS parameters identified by calibrating GEM-Hydro-UH are next transferred to the full version of GEM-Hydro, which then only needs WATROUTE Manning coefficients to be adjusted (if needed) in order to mimic the optimal hydrographs obtained with GEM-Hydro-UH.This last adjustment can be done manually with a few offline WA-TROUTE runs.

Study area and data
The GRIP-O spatial framework is defined in Fig. 2. A more detailed description of the area is available in Gaborit et al. (2017).The Lake Ontario basin (Fig. 2) covers 83 000 km 2 , of which 19 000 km 2 is the lake surface.All upstream water ar-  (Wang et al., 2015).Dots (blue for natural flow regimes and red for regulated regimes) are the mostdownstream flow gauges (i.e., the main tributaries' gauges which are closest to Lake Ontario's shoreline) selected for model calibrations.
riving through the Niagara River is excluded to focus only on the lateral runoff component of Lake Ontario NBSs (see Introduction).The US-Canada border follows the Niagara River, the middle of Lake Ontario, and the St. Lawrence River down to the Moses-Saunders dam at Cornwall, On-É.Gaborit et al.: A hydrological prediction system based on the SVS land-surface scheme tario, the lake outlet.Apart from some major cities (e.g., Toronto), the basin is mostly rural (agriculture, pasture, forest), as shown in Danz et al. (2007).
Streamflow time series were selected based on their duration and proximity to the lake shoreline.Of the 30 selected sites (Fig. 2), 27 have no missing data, 2 are complete at 94 %, and 1 at 80 % over the GRIP-O period.Nearly 70 % of the total Lake Ontario basin is gauged by the selected sites.Most of the rivers are regulated in some way, mainly for hydropower and flood mitigation, but regulation generally consists of reservoirs with a simple weir at their outlet (i.e., static control).Therefore, this did not prevent lumped models from registering good performance in the former GRIP-O study of Gaborit et al. (2017).As a consequence, no effort was made to represent in a detailed manner the artificial structures of the region in WATROUTE.Moreover, the small diversions occurring to fill some canals in the region, or even the aquifers which can contribute significantly to baseflow (Singer et al., 2003;Kassenaar and Wexler, 2006), do not prevent lumped models from registering good performance, which is helpful to this study.
The physiographic data required by the distributed models under study consist of soil texture, land use/land cover, digital elevation model (DEM), and flow direction grids.Table 1 lists the datasets used to provide the physiographic and atmospheric inputs required by the models.There are 26 land cover classes are defined in GEM-Hydro.Soil textures are from the Global Soil Dataset for Earth system modeling (GSDE; Shangguan et al., 2014), which contains information down to 2.8 m.Soil texture was not calibrated for GEM-Hydro-UH, but some hydraulic parameters, which are derived from soil texture, were calibrated (Table 2).The maximum soil depth is calibrated in GEM-Hydro-UH (Table 2) -see Supplement for MESH and WATFLOOD configuration details.
Precipitation forcings consist of 24-hourly accumulations from the Canadian Precipitation Analysis (CaPA version 2.4b8).Over the period of interest, CaPA consists of precipitation fields modeled by the Canadian Regional Deterministic Prediction System (RDPS, ≈ 15 km resolution), corrected by local rain gauge observations (Lespinas et al., 2015).The daily CaPA accumulations were disaggregated on an hourly time step by following the temporal pattern of hourly precipitation from the RDPS (Carrera et al., 2010).The remaining atmospheric forcings are taken from RDPS outputs, using short-term forecasts having lead time of 6 to 18 h.

Calibration strategy
The GRIP-O experiment extends from 1 June 2004 to 26 September 2011.Calibrating a hydrologic model over a period of 4 to 5 years is generally deemed sufficient to achieve reasonable model robustness (e.g., Refsgaard and Knudsen, 1996).The calibration period thus ranges from 1 June 2007 to 26 September 2011 (4.5 years).Validation is from 1 June 2005 to 1 June 2007 (2 years, the last one being used as spin-up for calibration), and spin-up from 1 June 2004 to 1 June 2005 (1 year).Note that during the automatic calibrations, the spin-up year was simulated only once and for all subsequent runs.The objective function is the Nash-Sutcliffe criterion (Nash and Sutcliffe, 1970) computed on the square root of the observed and simulated time series (Eq.1), in order to avoid overemphasizing peak-flow events -henceforth referred to as "NSE √ ": (1) These decisions are consistent with the lumped modeling decisions made for GRIP-O in Gaborit et al. (2017).Other evaluation criteria used in this study consist in the common Nash-Sutcliffe criteria (NSE), the Nash criteria calculated over the log of the flows ("NSE Ln"), and a percent bias criteria (PBIAS, Eq. 2) assessing the simulation's overall water budget fit: a positive value denotes a general tendency to underestimate flows, and vice versa: • 100. (2) All metrics are evaluated at the daily time step.Calibration relies for all models on the dynamically dimensioned search (DDS) algorithm (Tolson and Shoemaker, 2007).Calibration cost did not allow models to be calibrated locally for all GRIP-O subbasins (Fig. 2), but only those shown in Fig. 5.One local calibration takes between 2 and 5 days of computation (400 model runs; see below).Table 2 lists the free parameters of GEM-Hydro.GEM-Hydro-UH was calibrated using multiplicative coefficients that adjust the spatially varying values of a given parameter, leading to a reasonable number of free parameters (16) while preserving spatial variability -see Supplement for MESH and WATFLOOD calibration details.
It is important to emphasize that the approach used to calibrate GEM-Hydro may result in unrealistic values for some parameters, as the multiplicative coefficients could bring them beyond the range of physical coherence.More precisely, soil water content thresholds and albedo (Table 2) cannot be higher than 1.Therefore, these values were constrained to realistic ranges after they were adjusted by the calibration algorithm by imposing them a minimum value of 0 and a maximum of 1.
The initial parameter values were set to default ones that generally provide satisfactory results for the model (GEM-Hydro-UH, Table 2).The number of maximum model runs allowed was set to 400 for GEM-Hydro-UH (Sect.2.2).This maximum appeared sufficient in the sense that the algorithm converged to a good quality final result before reaching 400.This is because the number of GEM-Hydro-UH free parameters is relatively low (16, Table 2).The DDS algorithm is  very efficient in the sense that it adjusts the search behavior to the maximum number of objective function evaluations (model runs) in order to converge to good quality solutions (Tolson and Shoemaker, 2007).The similarity of the performances obtained with GR4J and GEM-Hydro-UH (Fig. 5) supports the choice of the methodology used here, as GR4J was implemented with a maximum of 2000 model runs, three distinct calibration trials, and had an even lower number of free parameters (6; see Gaborit et al., 2017).
Even though the three models studied here were not calibrated using the same number of free parameters and the same maximum allowed model runs (see Supplement), it is assumed that the calibration strategies employed allow each model to come very close to its optimal performance for a given subbasin and the time period considered.Indeed, the strategy used for each of the three models always involves parameters affecting the whole range of the main hydrological processes.The most important methodological consistencies for achieving a fair comparison between models include, in our view, a common calibration algorithm and objective function, along with common physiographic and forcing data.
Finally, some subbasins in Fig. 2 have more than one major tributary flowing into Lake Ontario.In this case, the mostdownstream observed flows on independent tributaries are summed and then extrapolated to the whole subbasin using the area ratio method (ARM; Fry et al., 2014).The resulting "synthetic" flows were considered as observations for GEM-Hydro-UH calibration over the whole subbasin, including its ungauged parts.This methodology was applied to all subbasins with more than one most-downstream gauge (identified with the "n/a" mention for the station attribute in Table 3) for consistency with the calibration experiments performed in the first GRIP-O study (see Gaborit et al., 2017), and because lumped models (and GEM-Hydro-UH) can only estimate streamflow at one location.For these subbasins, the true gauged fraction is specified in Table 3.

Strategy for ungauged areas
The ultimate objective of the GRIP-O project consists in improving simulated Lake Ontario NBSs, which calls for estimating runoff from all ungauged areas.To do so, calibration was performed over the GRIP-O gauged area (which includes all GRIP-O gauged subbasins; see Fig. 2), and the resulting parameter set was used in the model implemented over the whole Lake Ontario basin, including its ungauged parts.The "GRIP-O gauged area" is actually gauged at 88.5 % due to the strategy used for subbasins with several major tributaries (see end of previous section).
For GR4J, a single (unique) model was used over each of these two areas, requiring a unique calibration and a straightforward parameter transfer.Hence, for GR4J, local calibration was used but with a unique model for the GRIP-O gauged area.
GEM-Hydro-UH was, however, implemented locally for each of the gauged GRIP-O subbasins, but a global calibration strategy (see further down) led to a unique calibrated parameter set which was then transferred to a GEM-Hydro model implemented over the whole Lake Ontario basin.
The approach based on calibration for the GRIP-O gauged area and parameter transfer to the whole Lake Ontario basin  was preferred to other possible alternatives mainly because it allows taking into account rainfall over the ungauged areas as well as rainfall over the gauged areas, or, in other words, to use the best approximation of rainfall.
The global calibration of GEM-Hydro-UH consists in finding a unique trade-off parameter set that allows for simultaneous improvement in performances for all subbasins (Ajami et al., 2004;Haghnegahdar et al., 2014;Gaborit et al., 2015), whereas local calibration consists in finding each subbasin's optimal parameter set.Local calibration logically leads to the optimal performances for a given subbasin, but global calibration may lead to temporal robustness (Gaborit et al., 2015) and spatial consistency of the parameter values, because they are either fixed or adjusted the same way over the whole area under study.Local calibration, on the other hand, because of equifinality and experiment imperfections (model processes, forcing data, observed flows, etc.), may compensate for simulation errors and lead to parameter sets that do not work well when transferred to other (even neighbor) subbasins, as suggested by the fact that very different parameter sets were obtained here with the local calibrations of GEM-Hydro-UH (Sect.2.1 and Table 4).Global calibration is not exempt from equifinality issues either, but to a lesser degree than local calibration.Indeed, the use of global parameters constrains parameter values across the basin to be equal and thus provides less freedom to achieve the same overall performance with different parameter sets.Moreover, the attention paid to the parameter ranges used (Table 2) allows us to be confident in the physical relevance of the final parameter values.
The objective function associated to global calibration of GEM-Hydro-UH is as follows: with Nloc i the NSE √ value calculated from the local calibration on subbasin i, and Nglob i the NSE √ calculated from the global calibration on subbasin i.This objective function aims at minimizing differences between performances obtained from global and local parameter sets.However, as Hydrol.Earth Syst.Sci., 21, 4825-4839, 2017 www.hydrol-earth-syst-sci.net/21/4825/2017/ GEM-Hydro-UH was not locally calibrated for all of the 14 GRIP-O subbasins (only those of Fig. 5 because of the computation cost), performances obtained with local GR4J calibrations (Gaborit et al., 2017) were used for missing ones, justifying the use of that model in this study.
With global calibration, the response time parameter controlling the UH duration (Table 2) was replaced with a multiplicative factor adjusting the default response times of all local subbasins.Models were finally implemented over the whole Lake Ontario basin (Fig. 2), and runoff simulations performed with the parameter sets calibrated over the GRIP-O gauged area.GEM-Hydro was selected for this task instead of GEM-Hydro-UH since it was more straightforward and a priori more realistic (see further) to use WATROUTE instead of the simple UH for the ungauged areas of the Lake Ontario basin.In GEM-Hydro, standard Manning coefficients were used in WATROUTE, while the lag time of GEM-Hydro-UH was adjusted during calibration.But it was assessed that simulations with GEM-Hydro (calibrated SVS and LZS parameters and standard Manning values) were very close, both in terms of hydrographs and performances at the gauged sites, to those from the calibrated GEM-Hydro-UH.Figure 4 summarizes the methodology described here for estimating runoff from the ungauged areas of the Lake Ontario basin with GEM-Hydro.

Results
The comparison between GEM-Hydro and GEM-Hydro-UH is first presented to demonstrate the relevance of the UH approach to save the computation time associated with running the routing model of GEM-Hydro.Score improvements obtained by calibrating GEM-Hydro-UH for several subbasins of Lake Ontario basin are then presented, followed by a performance comparison for all models.Finally, the methodology proposed with GEM-Hydro and the lumped GR4J model to simulate streamflows for the ungauged parts of the Lake Ontario basin is evaluated.
Figure 3 presents the hydrographs simulated for the Moira River (subbasin 11 in Fig. 2), with SVS default parameters, standard WATROUTE parameter values in the case of GEM-Hydro, and a UH lag time estimated with the Epsey method in the case of GEM-Hydro-UH.As can be seen from this figure, GEM-Hydro-UH is able to produce streamflow simulations which are very close to those obtained with GEM-Hydro, underlying the relevance of such an approach to save computation time.Between the uncalibrated GEM-Hydro and GEM-Hydro-UH performances and over the different GRIP-O subbasins, the average absolute difference in NSE √ was 8 %, with the worst difference being 21 % (GEM-Hydro being most of the time better than GEM-Hydro-UH).A complete GEM-Hydro run over the GRIP-O calibration period (4.5 years) takes about 48 h, while the GEM-Hydro-UH version requires only 1.2 h over the same period.

GEM-Hydro-UH local calibrations
This section presents GEM-Hydro-UH performances (Fig. 5) either with its default parameter values or after its local calibration on Lake Ontario subbasins, whose characteristics are given in Table 3.
As can be seen from Fig. 5, calibration provides substantial improvements in NSE √ values.Similar results were obtained for NSE and NSE Ln (although these results are not shown), and a lower improvement for PBIAS; Interestingly, all uncalibrated NSE √ are above zero (Fig. 5), and even satisfactory for subbasins 10 and 11.This is encouraging for ungauged subbasin applications.It can also be noticed in Fig. 5 that calibration sometimes inverts the sign of the PBIAS criteria (switching from over-to under-estimation or vice versa).
Calibration also improves GEM-Hydro-UH snow water equivalent (SWE) simulations but to a lesser degree than for the streamflow.For example, the NSE values for SWE simulations over the four consecutive winters of the GRIP-O period improved from −0.12 to 0.42 for the Genessee subbasin, and from 0.49 to 0.68 for the Black River subbasin, respectively before and after calibration (the SWE variable was not used in the computation of the objective function).SWE ob-Table 5. Performance for the GRIP-O gauged area and the whole Lake Ontario basin (Fig. 2) with GR4J and globally calibrated GEM-Hydro-UH and GEM-Hydro models.Cal., val.: calibration and validation periods, respectively.servations come from the SNow Data Assimilation System (SNODAS; see NOHRSC, 2004).Calibration does influence evapotranspiration, but no observations are available to evaluate this model output.For example, for the Moira River, the mean subbasin annual evapotranspiration (over the calibration period) is equal to 527 mm and to 647 mm, before and after calibration respectively.The robustness of the model is also deemed very good, since performances do not substantially deteriorate between calibration and validation (Table 5).Calibrated parameter values are quite different from one subbasin to the other (even for neighbor subbasins), which may be due to equifinality (different parameter sets can lead to similar simulations) but also to the anthropogenic streamflow regulations.Table 4 presents the ranges of the final parameter values obtained with local calibration.This strongly limits the potential for parameter transferability to ungauged subbasins (Razavi and Coulibaly, 2012;Parajka et al., 2013).As explained in Sect.1.4, global calibration can help overcome this by leading to a spatially coherent parameter set.Results of such an approach are presented in Sect.2.3.

GRIPO
Calibrated GEM-Hydro-UH performance values are generally very close to those obtained with GR4J and CaPA precipitation (Fig. 5): the mean absolute difference in NSE √ values is 6.1 %, with the maximum being 15 % (GR4J being generally better).This is very encouraging as the performance benchmark set by GR4J simulations is most of the time quite high and hard to attain for other models.

Inter-comparison of all models
This section aims at comparing MESH, WATFLOOD and GEM-Hydro-UH, but detailed results specific to MESH and WATFLOOD are only provided in the Supplement to this paper.When looking closely at the Moira River hydrographs, for the three calibrated models (Fig. 6), important differences arise.For instance, WATFLOOD has a more flashy behavior and tends to overestimate peak flow events, MESH generally underestimates flows, and GEM-Hydro-UH lies somewhere in between.Peak flow events (even for other subbasins) associated to the spring freshet are generally better represented by MESH, which may be due to a better representation by CLASS of various cold regions hydrological processes, such as snow accumulation and melt, snow interception by vegetation, as well as soil freezing and thawing.NSE √ in validation for this basin are respectively equal to 0.83, 0.68 and 0.83 for MESH, WATFLOOD and GEM-Hydro-UH.

Runoff estimation for the whole Lake Ontario basin
The parameter values identified from the global calibration are presented in Table 4, along with the ranges resulting from local calibrations.See Sect.1.4 for more information about methodology related to global calibration.It can be seen from Table 4 that final global parameters generally lay inside the intervals obtained from local calibration, highlighting the trade-off found by global calibration.Moreover, it was noticed (not shown here) that parameter values were very different between local and global calibration procedures, even for basins displaying very similar performances between the two strategies (such as subbasins 3, 5 and 8; see Fig. 7), highlighting the fact that local calibration is more prone to overcalibration (i.e., equifinality).GEM-Hydro-UH results are given first for each gauged subbasin, in order to compare global calibration, local calibration and default parameters (Fig. 7), followed by GR4J and GEM-Hydro results (with global calibration) for the GRIP-O gauged area and the whole Lake Ontario basin (Table 5 and Figs.8-9).
GEM-Hydro-UH performances are lower with global calibration than with local calibration, as expected, and sometimes even lower after global calibration than with the default parameters for some subbasins (notably 10 and 11, Fig. 7).However, performances are satisfactory for most of the 14 GRIP-O subbasins with a single parameter set, which confirms that global calibration fulfilled expectations.Given that it takes about 7 days to achieve a local calibration, global calibration, which was completed in 20 days for the 14 subbasins at once, allows us to save a substantial amount of computation time.Furthermore, global calibration favors the spatial consistency of parameters and facilitates parameter transfer to ungauged areas, whereas there is no a priori best manner to transfer parameter values obtained from local cal-Hydrol.Earth Syst.Sci., 21, 4825-4839, 2017 www.hydrol-earth-syst-sci.net/21/4825/2017/  ibration (Razavi and Coulibaly, 2012;Parajka et al., 2013).
In this study, the strategy related to parameter transfer to the ungauged subbasins is based on spatial proximity, which was already identified as among the best parameter transfer methods for this type of climate in Canada (Razavi and Coulibaly, 2012).Although a comprehensive assessment of the reliability of the methodology used here for parameter transfer would require the "leave-one-out" framework (see Razavi and Coulibaly, 2012), the satisfying performances and temporal robustness obtained for all GRIP-O subbasins with global calibration, along with the spatial consistency of the unique final parameter set, the homogeneity of the area under study and the spatial proximity of ungauged basins together justify the relevance and a priori reliability of the methodology employed in this study.This statement is moreover supported by the evaluation performed for the whole basin in what follows.
Performance evaluation for the total GRIP-O gauged area (Table 5) shows that GR4J is better than GEM-Hydro-UH in calibration, but worse in validation.GEM-Hydro-UH leads to a very satisfactory performance, but most importantly to a better streamflow simulation than GR4J in terms of dynamics (see Fig. 8).Note that the smoother GR4J behavior is not due to the single model approach for the whole area, as a similar behavior occurred when aggregating simulations from local GR4J models (Gaborit et al., 2017).This smooth behavior seems inherent to the lumped attribute and concepts of GR4J.As depicted in Table 5, performances for the GRIP-O gauged area obtained with GEM-Hydro are close to those obtained with GEM-Hydro-UH, despite being lower for the former, which comes from the standard (uncalibrated) Manning coefficients used with GEM-Hydro, whereas the UH lag time was adjusted during the calibration of GEM-Hydro-UH.WATROUTE coefficients could have been manually tuned in order for GEM-Hydro performance values to reach those of GEM-Hydro-UH in Table 5, but this was not deemed necessary given the already very satisfying performance values obtained with the uncalibrated Manning values.
Runoff simulations for the whole Lake Ontario basin, including its ungauged areas, are very promising (Table 5).Even if runoff observations actually consist in this case in estimations based on the ARM, computed performances are a priori reliable given that the true gauged fraction of the total area is equal to about 70 %, and that the ARM proves reliable starting from a 50 % gauged fraction (Fry et al., 2014).
It is therefore argued that the methodology proposed here (global calibration of GEM-Hydro-UH and parameter transfer to GEM-Hydro) is relevant, efficient and reliable, provided that a large enough fraction of the total area is gauged.It could moreover be applied in different climatic contexts, regions, and with different models.Finally, Lake Ontario monthly NBSs were estimated with the globally calibrated GEM-Hydro model, and results were compared both to the GLERL residual and component NBS estimates (Fig. 9).Residual NBSs rely on the lake observed change in storage and streamflows for the Niagara and St. Lawrence rivers (DeMarchi et al., 2009).Component NBSs used here are based on the GLERL Monthly Hydrometeorological Database (GLM-HMD; Hunter et al., 2015), which relies on observed data extrapolated with the ARM for runoff, on observed data interpolated with the Thiessen polygon method for overlake precipitation, and on the Large Lake Thermodynamics lumped Model (LLTM) for overlake evaporation.Component NBS estimates are updated on a regular basis.Data used in this work were updated on 2 August 2016.It is still unknown which of these two NBS estimation methods (i.e., residual or component method) is the most accurate (DeMarchi et al., 2009).
It can be seen that the cumulated NBS estimates derived from the calibrated GEM-Hydro model (using global calibration) stand between the component and residual NBS estimates, but are closer to the latter ones.It is, however, difficult to draw any conclusion regarding the bias of these estimation methods given the uncertainty associated with NBS estimates (DeMarchi et al., 2009).When comparing the GLM-HMD component NBS method to the calibrated GEM-Hydro simulation on a component-by-component basis, the main difference between the two occurs for overlake evaporation, with evaporation from the component method being significantly lower than GEM-Hydro evaporation (not shown).This mainly explains why the NBS estimates from the component method are higher than the other estimates in Fig. 9.But again, it is not possible to accurately evaluate overlake evaporation estimates given the lack of observations for this variable.The uncalibrated GEM-Hydro model results in cumulative NBS estimations which are below all other NBS estimations, which tends to suggest that they are underestimated.Therefore, the methodology proposed to calibrate GEM-Hydro seems to improve Lake Ontario NBS simulations.

Discussion and conclusion
This study explored for the first time the performance of SVS to estimate runoff for a large basin with ungauged portions.Our results indicate that the SVS LSS, as embedded in GEM-Hydro and GEM-Hydro-UH, led to reasonable streamflow simulations for the Lake Ontario basin.According to the intercomparison experiment conducted for three subbasins (see Supplement), GEM-Hydro-UH and GEM-Hydro are both competitive to MESH and WATFLOOD.GEM-Hydro has even proven able to produce decent, generally satisfactory runoff simulations with default parameter values, except for Hydrol.Earth Syst.Sci., 21, 4825-4839, 2017 www.hydrol-earth-syst-sci.net/21/4825/2017/ areas with a high urban cover fraction.This result is encouraging because SVS is expected to replace ISBA in ECCC operational models in the coming years.
The model intercomparison study also indicates that there is still room to further improve SVS.For example, adding the soil freeze-thaw processes to the current SVS may improve GEM-Hydro simulations of runoff peaks in spring.Additionally, a new snow module (ISBA-ES) is also being implemented in SVS, which currently relies on a simple forcerestore approach.Finally, work is under way to represent a surface of ponded water in each SVS grid cell, in order to represent subgrid-scale lakes and wetlands and to better account for the delay associated with surface runoff transfer into the streams.
The calibrated GEM-Hydro-UH performance values are close to GR4J ones (Gaborit et al., 2017).The potential benefits of global calibration have been demonstrated here, as for a previous Hydrotel application (Gaborit et al., 2015).It achieves satisfactory performances for a large area with a unique calibration and favors temporal robustness, spatial consistency and parameter transferability.Therefore, one of this study's main outcomes is the confirmation that global calibration is a very promising and efficient methodology to implement hydrologic models over large areas.It saves computation time and leads to a spatio-temporally robust parameter set that can be transferred to nearby (ungauged) areas.This outcome is important because parameter transfer methods derived from local calibration are still largely prone to failure.More studies still have to be performed with global calibration on other basins and with other hydrological models to confirm the value of this methodology, which worked well for the model and basin studied here.Global calibration of SVS is envisioned in future versions of the WCPS, to assess its benefits in improving weather forecasts, as a calibrated SVS could be coupled to the RDPS atmospheric model, and because a calibrated SVS version should improve surface flux representation.Calibrating a LSS based on streamflow and then using it in an atmospheric model to improve weather forecasts has not been reported in the literature so far, to our best knowledge.Another originality of this work which may be of interest to a broad audience is the way the distributed parameters were adjusted during calibration.Instead of regrouping the parameters by GRU as for SA-MESH (see Supplement), which led to 60 free parameters during calibration, GEM-Hydro-UH was calibrated with only 16 parameters, which consist mainly of multiplicative factors by which the associated actual parameter values were all multiplied the same way.This allows preserving the spatial variability and coherence of a given parameter, while minimizing the number of free parameters that still affect the whole domain.Of course, additive or exponent factors could be used too, if deemed more relevant.This strategy is moreover suited to using the DSS algorithm, which allows a very fast convergence (in less than 400 iterations) when a limited number of free parameters are used, and therefore contributes to the efficiency of the implementation methodology proposed here.Again, this could be applied to any distributed hydrologic model.Furthermore, in order to calibrate the GEM-Hydro model, its standard routing part was replaced by a simple UH during calibration of the landsurface scheme, the simpler setup requiring only 3 % of the original computation time.The routing component of GEM-Hydro can be run afterwards, and re-calibrated separately.Once again, the UH can be used with any LSS and on any basin, which allows us to calibrate a distributed model when the routing part is time-consuming, as for WATROUTE.
We developed a methodology (global calibration, multiplicative factors used in calibration, and the UH bypass of the routing component) to improve the calibration efficiency and performance of the distributed GEM-Hydro model for streamflow modeling over the Lake Ontario basin, including its ungauged parts.The proposed methodology is transferable and can be useful to the hydrologic community, especially for those who want to use distributed hydrologic models to simulate streamflow for large basins with ungauged parts.
Finally, this work presented the development of an efficient distributed hydrological modeling platform for the Lake Ontario basin, which can be used as a readily testing ground for distributed models.During the preparation and writing of this paper, using the proposed methodology in this study, GEM-Hydro was also applied to the Canadian Nelson, Churchill and MacKenzie River basins as well as the whole Hudson Bay basin, with satisfactory performance values.This is encouraging given the high degree of regulation involved in some of these basins.

Figure 1 .
Figure 1.Computation time for GEM-Surf (land-surface part of GEM-Hydro), GEM-Hydro-UH and WATROUTE.See text for details.The number of grid points in this study is 1276 (476 000) for GEM-Surf/GEM-Hydro-UH (WATROUTE).

Figure 2 .
Figure2.GRIP-O spatial framework: Lake Ontario subbasin delineation (GRIP-O subbasins).GLAHF subbasins are from the Great Lakes Aquatic Habitat Framework(Wang et al., 2015).Dots (blue for natural flow regimes and red for regulated regimes) are the mostdownstream flow gauges (i.e., the main tributaries' gauges which are closest to Lake Ontario's shoreline) selected for model calibrations.

Figure 4 .
Figure 4. Diagram summarizing the methodology employed to simulate Lake Ontario runoff with GEM-Hydro.

Figure 5 .
Figure 5. Uncalibrated and calibrated GEM-Hydro-UH performances over the calibration period.Results are presented as NSE √ (a) and PBIAS (b), for many GRIP-O subbasins.The gray dashed line shows perfect scores and GR4J reference is displayed with black markers.

Figure 7 .
Figure 7. GEM-Hydro-UH performances in validation for the 14 GRIP-O gauged subbasins (Fig. 2) with default, locally and globally calibrated parameter values.Perfect scores are shown.Results are presented as NSE √ (a) and PBIAS (b).

Figure 9 .
Figure 9. Cumulative Lake Ontario NBS (net basin supply) estimates.Months are shown on the x axis.See text for further details.

Table 1 .
Data sources.NAm: North America; n/a: not applicable.

Table 4 .
Final parameter values or ranges after calibration; for global calibration, HU_decay consists of a multiplicative coefficient.See Table2for parameter definition.n/a: not applicable.