Global evaluation of runoff from 10 state-of-the-art hydrological models

Observed streamflow data from 966 medium sized catchments (1000–5000 km2) around the globe were used to comprehensively evaluate the daily runoff estimates (1979– 2012) of six global hydrological models (GHMs) and four land surface models (LSMs) produced as part of tier-1 of the eartH2Observe project. The models were all driven by the WATCH Forcing Data ERA-Interim (WFDEI) meteorological dataset, but used different datasets for non-meteorologic inputs and were run at various spatial and temporal resolutions, although all data were re-sampled to a common 0.5 spatial and daily temporal resolution. For the evaluation, we used a broad range of performance metrics related to important aspects of the hydrograph. We found pronounced intermodel performance differences, underscoring the importance of hydrological model uncertainty in addition to climate input uncertainty, for example in studies assessing the hydrological impacts of climate change. The uncalibrated GHMs were found to perform, on average, better than the uncalibrated LSMs in snow-dominated regions, while the ensemble mean was found to perform only slightly worse than the best (calibrated) model. The inclusion of less-accurate models did not appreciably degrade the ensemble performance. Overall, we argue that more effort should be devoted on calibrating and regionalizing the parameters of macro-scale models. We further found that, despite adjustments using gauge observations, the WFDEI precipitation data still contain substantial biases that propagate into the simulated runoff. The early bias in the spring snowmelt peak exhibited by most models is probably primarily due to the widespread precipitation underestimation at high northern latitudes.

Abstract. Observed streamflow data from 966 medium sized catchments (1000-5000 km 2 ) around the globe were used to comprehensively evaluate the daily runoff estimates  of six global hydrological models (GHMs) and four land surface models (LSMs) produced as part of tier-1 of the eartH2Observe project. The models were all driven by the WATCH Forcing Data ERA-Interim (WFDEI) meteorological dataset, but used different datasets for non-meteorologic inputs and were run at various spatial and temporal resolutions, although all data were re-sampled to a common 0.5 • spatial and daily temporal resolution. For the evaluation, we used a broad range of performance metrics related to important aspects of the hydrograph. We found pronounced intermodel performance differences, underscoring the importance of hydrological model uncertainty in addition to climate input uncertainty, for example in studies assessing the hydrological impacts of climate change. The uncalibrated GHMs were found to perform, on average, better than the uncalibrated LSMs in snow-dominated regions, while the ensemble mean was found to perform only slightly worse than the best (calibrated) model. The inclusion of less-accurate models did not appreciably degrade the ensemble performance. Overall, we argue that more effort should be devoted on calibrating and regionalizing the parameters of macro-scale models. We further found that, despite adjustments using gauge obser-vations, the WFDEI precipitation data still contain substantial biases that propagate into the simulated runoff. The early bias in the spring snowmelt peak exhibited by most models is probably primarily due to the widespread precipitation underestimation at high northern latitudes.

Introduction
Hydrological models are indispensable tools for many purposes including, but not limited to, (i) flood and drought forecasting, (ii) water resources assessments, (iii) assessing the hydrological impacts of human activities, and (iv) increasing our understanding of the hydrological cycle. It is more than 50 years since the first attempts at hydrological modeling (Linsley and Crawford, 1960;Rockwood, 1964;Sugawara, 1967;Freeze and Harlan, 1969). Since then, a plethora of conceptual, physically based, and stochastic hydrological models has been developed, each with its own assumptions and characteristics (for non-exhaustive overviews; see Singh, 1995;Singh and Frevert, 2002;Rosbjerg and Madsen, 2006;Trambauer et al., 2013;Sooda and Smakhtin, 2015;Bierkens et al., 2015;Kauffeldt et al., 2016). Because all hydrological models are inevitably imperfect representations of real-ity, they produce highly uncertain estimates even if we would have access to perfect meteorological data (Beven, 1989).
One of the most useful variables for hydrological model evaluation is runoff, since it reflects the integrated response of a host of hydrological processes occurring in a catchment (Fekete et al., 2012) and because observations are readily available for many catchments across the globe . Table 1 lists, to our knowledge, all macro-scale (i.e., continental to global scale) studies evaluating the runoff estimates of multiple models that have been published so far. Out of these 20 studies, two focused on the conterminous USA, five focused on Europe, while 13 had a global scope. However, many of these studies used observations from a relatively small number (< 100) of large catchments ( 10 000 km 2 ). The use of a small number of basins limits confidence in the results and precludes a spatially detailed assessment, while the large size of the catchments makes it more difficult to distinguish between deficiencies in the forcing, the (sub-)surface component, or the river routing component of the modeling chain. Moreover, a large number of the studies only evaluated monthly mean runoff, precluding analysis of the shape of individual flow events, or used the Nash and Sutcliffe (1970) efficiency (NSE), which has been criticized in several previous studies for being overly sensitive to the timing and magnitude of peak flows (Schaefli and Gupta, 2007;Jain and Sudheer, 2008). Furthermore, many studies considered only a few hydrological models (≤ 5) or performance metrics (≤ 2), limiting the insights that can be gained.
As part of tier-1 of the eartH2Observe project, 10 state-ofthe-art hydrological models were run globally at a daily time step for the period 1979-2012 using the same forcing dataset, in an effort to develop a global reanalysis of water resources that supports efficient water management and decision making . Six of the models are global hydrological models (GHMs) while four of the models are land surface models (LSMs). GHMs have traditionally been designed to simulate (sub-)surface water fluxes and storages, while LSMs have traditionally been designed to simulate the soil-vegetation-atmosphere interactions within climate models (Haddeland et al., 2011;Bierkens, 2015). GHMs generally represent hydrological processes in a more conceptual way, solve only the water balance, commonly operate at daily time steps, and typically have a small number of soil layers (≤ 3 in the current study) and a single snow layer. Conversely, LSMs generally represent hydrological processes in a more physically based way, solve both the water and energy balances, typically operate at (sub-)hourly time steps, and tend to have many soil and snow layers (4-11 and 1-12, respectively, in the current study; for more details on the models, see Table 1 of Schellekens et al., 2016). The present study aims to comprehensively evaluate the runoff estimates of these 10 models across the globe in an effort to answer the following pertinent research questions: 7. Do all models show the early bias in runoff timing in snow-dominated catchments previously documented (e.g., Zaitchik et al., 2010) and what is the cause?

Forcing
The models were all driven by the daily 0.5 • WATCH Forcing Data ERA-Interim (WFDEI) meteorological dataset (1979-2012Weedon et al., 2014) with the precipitation (P ) data adjusted using the monthly 0.5 • gauge-based Climate Research Unit (CRU) TS3.1 dataset (Harris et al., 2013). Although the models all used the same P data, they used potential evaporation (PET) derived using diverse formulations, ranging from the temperature-based Hamon equation (PCR-GLOBWB) to various radiation-based approaches (WaterGAP3, SWBM, and HBV-SIMREG), the Penman-Monteith combination equation (HTESSEL, JULES, LIS-FLOOD, SURFEX, and W3RA), and a surface-energy balance approach (ORCHIDEE). The models also used different datasets for non-meteorologic inputs. For more details, see Schellekens et al. (2016). Table 2 lists the 10 state-of-the-art macro-scale hydrological models of which we evaluated the simulated daily unrouted runoff depths (mm d −1 ). The data used in this study have been named tier-1 and represent an initial run by all participating modeling groups . All data were acquired through the eartH2Observe Water Cycle Integrator (WCI; http://wci.earth2observe.eu), and for each model the sum of the subsurface and surface runoff com-ponents was calculated. Six of the models are GHMs (LIS-FLOOD, PCR-GLOBWB, SWBM, W3RA, WaterGAP3, and HBV-SIMREG) and four are LSMs (HTESSEL, JULES, ORCHIDEE, and SURFEX). The GHMs were all run at daily time steps and the LSMs at hourly and 15 min time steps. The models were run at a 0.5 • spatial resolution, with the exception of LISFLOOD and WaterGAP3, which were run at 0.1 • and 0.08 • , respectively. For the analysis, however, all model output was re-sampled to a common 0.5 • spatial and daily temporal resolution. Four of the models were subjected to varying degrees of calibration to improve their parameters (LISFLOOD, SWBM, WaterGAP3, and HBV-SIMREG; see Sect. 4.4 for specifics). More details concerning the models can be found in Table 1 of Schellekens et al. (2016).

Observed streamflow
Daily and monthly observed streamflow data were used in this study to evaluate the runoff estimates of the models. The observed streamflow and catchment boundary data used in this study originate from the same three sources as Beck et al. (2013Beck et al. ( , 2015Beck et al. ( , 2016, namely (i) the Global Runoff Data Centre (GRDC; http://www.bafg.de/GRDC/), (ii) the Geospatial Attributes of Gages for Evaluating Streamflow (GAGES)-II database (Falcone et al., 2010), and (iii) an Australian streamflow data compilation by Peel et al. (2000). The following seven criteria were used to select suitable catchments for our analysis:   (Stewart et al., 2005). A water year is defined as the 12-month period from October to September in the Northern Hemisphere and April to March in the Southern Hemisphere Seasonal flow distribution 34.36 BFI − Base flow index, the ratio of long-term baseflow to total runoff; the baseflow portion of the total runoff was computed following the procedure of Gustard et al. (1992), which takes the minima at 5-day non-overlapping intervals and subsequently connects the valleys in this series of minima to generate baseflow 1. The streamflow record length was required to be ≥ 5 years (not necessarily consecutive) during 1979-2012 (the temporal span of the simulated runoff data).
2. The catchment area had to be < 5000 km 2 , to minimize the effects of channel routing delays and to reduce the likelihood of significant anthropogenic water use. We could not use larger catchments and evaluate routed streamflow estimates since three of the models did not simulate river routing (JULES, SWBM, and HBV-SIMREG).
3. The catchment area had to be > 1000 km 2 , to prevent catchments unrepresentative of the 0.5 • grid cells (2182 km 2 at 45 • latitude) from confounding the results.
4. To reduce human influences, catchments were required to have < 2 % classified as urban (using the "artificial areas" class of the GlobCover version 2.3 map; 300 m resolution; Bontemps et al., 2011) and subject to irrigation (using version 5 of the Global Map of Irrigation Areas GMIA; 5 min resolution; Siebert et al., 2005).
5. We used the Global Reservoir and Dam (GRanD) database (v1.1; Lehner et al., 2011) to exclude catchments influenced by major reservoirs (defined by total reservoir capacity > 10 % of the observed mean annual streamflow).
6. Catchments with forest gain or loss > 20 % of the catchment area (the threshold at which changes in runoff can generally be detected; Bosch and Hewlett, 1982) were excluded using version 1.1 of the Landsat-based forest change dataset (30 m resolution; Hansen et al., 2013).
7. To further reduce the number of disinformative catchments, all streamflow records were visually screened for artifacts and anthropogenic influences (caused by, for example, diversions and impoundments). Furthermore, USA catchments flagged as "non-reference" in the GAGES-II database were discarded, and GRDC catchments for which the catchment boundaries could not be reliably determined were discarded (Lehner, 2012).
In total 966 catchments (median size 1970 km 2 ; median record length 19 years during 1979-2012) were found to be suitable for the analysis, of which 641 catchments have daily streamflow data and 325 catchments (mainly located in Russia) have only monthly streamflow data. The locations of the selected catchments will be shown in the Results section. All observed streamflow data were converted to runoff in mm d −1 using the provided catchment areas.

Model evaluation
The simulated runoff of the models were evaluated in five ways. First, for each catchment, we calculated the differences D (−) between simulated and observed values of several runoff signatures. Table 3 lists the six runoff signatures selected including their computation from the period with simultaneous simulated and observed runoff. The baseflow index (BFI), square-root-transformed 1st percentile exceedance flow (Q1), and square-root-transformed 99th percentile exceedance flow (Q99) require daily (rather than monthly) flow data. To compute the flow timing (T50) from monthly data, we first computed daily time series from monthly time series using linear interpolation. Some of the signature values were square-root transformed to give more weight to small values. D was computed according to where Y represent the values of the runoff signatures (−), the q subscript denotes the runoff signature, and the "sim" and "obs" subscripts refer to simulated and observed, respectively. The σ values (−) are constants that represent the spatial variability in the runoff signatures across the landscape and are used to normalize the D values (i.e., to make the D values of the different signatures intercomparable; see Table 3). The σ values were computed by taking the standard deviation of the observed values. Next, the mean D value over all catchments was computed (expressed by D). D and D values closer to zero correspond to better model performance (see Table 4). It should be noted that, although D provides a valuable estimate of the overall performance, a good D value may reflect an overestimation in one region that is compensated by an underestimation in another region. Second, to evaluate the temporal variability of the simulated runoff time series, we computed Pearson linear correlation coefficients (r) between daily, log-transformed daily, 5-day, monthly, monthly climatic, and annual time series of simulated and observed runoff (termed r dly , r dly log , r 5 day , r mon , r mon clim , and r yr , respectively). The r dly , r dly log , and r 5 day values were only computed for catchments with daily observations. If monthly data were not supplied by the data providers, monthly values were computed by simple averaging of the daily data only if > 25 non-missing values were available. Annual values were computed by simple averaging of the monthly data (either supplied or computed) only if > 10 non-missing values were available. We subsequently computed for each model and metric the mean r value over all catchments, expressed by r. The r and r values range from −1 to 1, with higher values corresponding to better model performance (see Table 4).
Third, to summarize the overall performance of each model, we computed for each catchment a summary performance statistic (termed OS) incorporating the previously mentioned metrics, and computed the mean value over all catchments (OS). The OS consists of two parts, of which the first (OS sig ) considers the performance in terms of runoff signatures and is defined as The second part (OS var ) evaluates the performance in terms of temporal variability, and is defined as OS var = mean r dly , r dly log , r 5 day , r mon , r mon clim , r yr . ( The summary score is subsequently computed following: The BFI, Q1, and Q99 components of Eq. (2) and the r dly and r dly log components of Eq.
(3) were omitted if daily observations were unavailable for a particular catchment. Higher OS values correspond to better model performance; the maximum attainable value is 1 (see Table 4). Fourth, to evaluate the ability of each model to simulate the variability among the catchments in the six previously mentioned runoff signatures, Spearman rank correlation coefficients (ρ) were computed between simulated and Table 4. Qualitative descriptions of intervals of the performance metrics to aid in interpreting the results.
observed values of the runoff signatures. Spearman rank correlation coefficients rather than Pearson linear correlation coefficients were used to minimize the influence of outliers.
The ρ values range from −1 to 1, with higher values corresponding to better model performance (see Table 4). Fifth, we computed trends in simulated and observed mean annual runoff time series (termed MAR trend) using the simple non-parametric approach of Sen (1968). We subsequently calculated the ρ between simulated and observed MAR trends (ρ MAR trend ), reflecting the agreement in spatial trend patterns.
Sixth and last, we produced density plots of grid cell values of aridity index (AI; ratio of long-term available energy to P ) versus RC (ratio of long-term simulated runoff to P ), revealing how the models behave in terms of RC under different climatic conditions. To estimate the available energy we used PET for four models (ORCHIDEE, PCR-GLOBWB, W3RA, and WaterGAP3) and net radiation for three models (HTESSEL, JULES, and SURFEX). For the remaining models estimates of the available energy were not available.
For the evaluation, we used for each catchment the simulated runoff time series of the 0.5 • grid cell with its center located within the catchment. However, if multiple grid cell centers were located within the catchment, we calculated the mean simulated runoff time series, and if no grid cell center was located within the catchment, we used the simulated runoff time series of the grid cell with its center located closest to the catchment centroid.

Multi-model ensembles
Ensemble modeling using the outputs from multiple models or from different realizations of the same model typically improves predictive accuracy and is widely used in atmospheric, climate, and hydrological sciences (Wandishin et al., 2001;Tebaldi and Knutti, 2007;Breuer et al., 2009;Viney et al., 2009). We tested two ways of combining the runoff estimates of the individual models into ensembles. First, for each 0.5 • grid cell and day with non-missing values for all models, the mean simulated runoff of the 10 models was calculated (i.e., equal weights were assigned to the models). The resulting runoff estimates will be referred to hereafter as "MEAN-All". Second, we computed the mean based on only the four models that performed best in terms of OS, to exam-ine the effect of excluding less-accurate models. These runoff estimates will be referred to hereafter as "MEAN-Best4".

Caveats
There are a number of caveats that should be kept in mind when interpreting the results. First, some of the models (notably the LSMs) were not traditionally developed to estimate daily runoff for such small catchments. Some of the GHMs, on the other hand, have runoff estimation in small catchments among their primary aims (e.g., LISFLOOD, Water-GAP3, W3RA, and HBV-SIMREG), and four GHMs were even explicitly calibrated against observations (LISFLOOD, SWBM, WaterGAP3, and HBV-SIMREG; see Sect. 4.4 for specifics). Second, a model performing poorly in one respect may well perform better for other hydrological variables, climates, catchments, or performance metrics. Third, a poor model performance could simply be the result of suboptimal parameter values. Fourth, some studies have found that less-accurate models may still lead to a better ensemble mean (Ajami et al., 2006;Viney et al., 2009), although this did not appear to be the case here (see Sect. 4.6). Fifth, we stress that while some models may perform well, they are inherently unsuitable for specific types of impact assessments. For example, SWBM and HBV-SIMREG do not account for physical differences among land cover types and hence cannot be used for studies assessing the hydrological impacts of changes in land cover. Sixth and finally, the forcing data quality has an important influence on the evaluation results that should not be overlooked.

Results and discussion
In this section we will answer the questions posed in the introduction.
4.1 How well do the different models simulate runoff? Table 5 shows, for the uncalibrated models, the calibrated models, and the ensembles: (i) the mean difference between simulated and observed values of the (normalized) runoff signatures (D), (ii) the mean temporal correlation between simulated and observed runoff time series (r), and (iii) the mean overall performance in terms of runoff signatures and temporal correlation coefficients (OS). HTESSEL obtained negative D values for the square-root-transformed RC and the square-root-transformed mean annual runoff (MAR), indicating it underestimates runoff. JULES performed moderately in terms of temporal correlation, as indicated by the low r values. Conversely, LISFLOOD performed good overall, particularly in terms of temporal correlation, although it tends to overestimate RC and MAR. ORCHIDEE appears to strongly underestimate runoff and performed fairly in terms of temporal correlation, whereas PCR-GLOBWB shows moderate to good scores for most metrics. Apart from Table 5. For the individual models and the ensembles, (i) the mean difference between simulated and observed values of the (normalized) runoff signatures (D), (ii) the mean temporal correlation between simulated and observed runoff time series (r), (iii) the mean overall performance in terms of runoff signatures and temporal correlation (OS), and (iv) the spatial correlation between simulated and observed values of the runoff signatures (ρ). See Fig. 1 for the locations of the catchments. a much too early bias in the flow timing (T50), SURFEX demonstrated moderate to good performance overall. Similar to SURFEX, W3RA exhibited a very early bias in T50, but generally obtained moderate to good scores. WaterGAP3 and particularly HBV-SIMREG outperformed the other models in most cases. JULES, ORCHIDEE, SURFEX, Water-GAP3, and especially SWBM displayed negative D values for the BFI and the square-root-transformed 99th flow percentile (Q99), and a positive D value for the square-roottransformed 1st flow percentile (Q1; Table 5), suggesting they consistently overestimate quickflow. Conversely, LIS-FLOOD and particularly PCR-GLOBWB exhibited positive D values for BFI and Q99, and a negative D value for Q1, indicating they tend to underestimate quickflow. Table 5 also presents, for the 10 models and the ensembles, the spatial correlation between simulated and observed values of the runoff signatures (ρ). HTESSEL, JULES, W3RA, WaterGAP3, and HBV-SIMREG performed good overall, while the remaining models performed moderately overall. PCR-GLOBWB, SURFEX, and WaterGAP3 performed poorly in terms of BFI, while SWBM obtained a poor score for Q99. WaterGAP3 performed good to excellent for all signatures except BFI, likely due to the empirical estimation of groundwater recharge and thus baseflow as a function of landscape characteristics (Döll and Fiedler, 2008). HBV-SIMREG attained good to excellent ρ values for all signatures. The models generally performed best for T50 and worst for BFI among the signatures. Table 5 also shows, for the 10 models and the ensembles, OS scores for the major Köppen-Geiger climate types. We used the newly produced Köppen-Geiger climate map from Beck et al. (2016), which is based on the high-quality World-Clim climatic dataset (Hijmans et al., 2005) supplemented with regional climatic datasets for the USA (Daly et al., 1994) and New Zealand (Tait et al., 2006). All four LSMs (HTESSEL, JULES, ORCHIDEE, and SURFEX) generally demonstrated fair performance in cold and polar climates. Conversely, PCR-GLOBWB demonstrated poor performance in tropical and arid climates, likely due to the overestimation of baseflow. SWBM performed moderately only in arid catchments, probably at least partly due to the lack of baseflow under these conditions (Pilgrim et al., 1988;Beck et al., 2013). Similarly,  found that SWBM performs well during dry periods for eight small Swiss catchments (60-392 km 2 ). Only LISFLOOD, WaterGAP3, and HBV-SIMREG exhibited at least moderate performance for all climates. Figure 1 presents, for the 10 models and the ensembles, maps of simulated minus observed MAR for the catchments, revealing the data underlying the MAR D and ρ values listed in Table 5. Maps of all other runoff signatures are presented in Supplement Figs. S1.2-1.8. HTESSEL and ORCHIDEE strongly underestimate runoff for most of the catchments, while LISFLOOD appears to strongly overestimate runoff for most of the globe with the exception of snow-dominated regions. All models showed negative MAR biases in snowdominated regions such as Alaska, the Rocky Mountains, and southern Russia, while they consistently showed positive MAR biases for the Great Plains (USA) and southern Australia. Figure 2 shows, for the 10 models and the ensembles, maps of the correlation between simulated and observed monthly flows (r mon ) for the catchments, showing the data underlying the r mon values presented in Table 5. Maps of all other temporal variability metrics are presented in Figs. S1.9-1.14. In general, the GHMs obtained good r mon values for most catchments, while the LSMs obtained moderate r mon values for most catchments. All LSMs showed poor to fair r mon values for snow-dominated catchments.
Although the NSE has been widely criticized for being overly sensitive to the magnitude and timing of peak flows (e.g., Schaefli and Gupta, 2007;Jain and Sudheer, 2008;Criss and Winston, 2008;Gupta et al., 2009), we did calculate NSE scores to allow the present results to be put in the context of previous macro-scale studies (see Supplement Table S1). For most models negative median NSE scores were obtained, similar to Zhang et al. (2016), who evaluated the monthly and annual runoff estimates from 14 (uncalibrated) macro-scale models in 644 large Australian catchments (> 2000 km 2 ). Our scores are, however, slightly lower than those obtained by Lohmann et al. (2004) and Xia et al. (2012), who evaluated the daily runoff estimates from four (uncalibrated) macro-scale models in about a thousand small-tomedium sized USA catchments (< 10 000 km 2 ), but this is probably attributable to the high quality of the USA forcing data (Wu et al., 2017). They are also somewhat lower than those obtained by Decharme and Douville (2007), who evaluated two (uncalibrated) macro-scale models in 80 large catchments (> 100 000 km 2 ) around the globe, but this can be explained by their much larger catchment sizes. Figure 3 shows, for the seven models with data on energy availability, density plots of grid cell values of aridity index (AI; ratio of long-term energy availability to P ) versus RC (ratio of long-term mean runoff to P ), revealing how the models respond in terms of RC to different climatic conditions. Also shown are the energy-limit line for which actual evaporation equals the available energy, the water-limit line for which runoff equals P , and the Budyko (1974) curve, the most well-known among several similar empirical relationships describing the competition between runoff and actual evaporation (Ol'dekop, 1911;Pike, 1964;Zhang et al., 2001;Porporato et al., 2004). Given its empirical nature, the Budyko curve should only be used for visual reference, and not to judge the performance of the different models. Besides the striking differences in behavior among the models, it can be seen that HTESSEL, JULES, PCR-GLOBWB, W3RA, and WaterGAP3 do not adhere to the water and/or energy limits (Fig. 3a, b, d, f, and g, respectively). For Wa-terGAP3, this may be due to the use of calibration factors, which have the potential to generate runoff that can go beyond the physical limits in an effort to compensate for errors  in the P , PET, or streamflow data. For the other models this could be indicative of issues with the runoff and/or evaporation routines. The larger spread found for the models for which we used net radiation to estimate the available energy (HTESSEL, JULES, and SURFEX; Fig. 3a, b, and e, respectively) is because the majority of the net radiation is converted to sensible heat rather than latent heat in cold climates (Kleidon et al., 2014).
It is generally difficult to gain insight into why a particular model performs as it does due to the large number of interacting model components, equations, and parameters. Nevertheless, the underestimation of runoff by HTESSEL probably reflects the excessive evaporation by HTESSEL previously reported by Haddeland et al. (2011). PCR-GLOBWB most likely suffers from suboptimal baseflow-related parameter values, since its structure is similar to that of LISFLOOD, which performs markedly better. SWBM clearly suffers from the absence of a baseflow routine outside (semi-)arid regions. Although W3RA and HBV-SIMREG use an identical snow routine, W3RA performs considerably worse in snowdominated regions, probably because HBV-SIMREG uses a snowfall gauge undercatch correction factor. The unsatisfactory performance demonstrated by the LSMs in snowdominated regions could be related to deficiencies in the snow routines or the energy balance estimates (see Sect. 4.3). WaterGAP3 and particularly HBV-SIMREG performed quite well overall, likely because of their comprehensive calibration (see Sect. 4.4). In any case, the pronounced inter-model performance spread found here suggests that model choice should be regarded as a critical step in any hydrological modeling study. Moreover, it underscores the importance of hydrological model uncertainty in addition to climate input uncertainty, as also emphasized in several other recent macroscale studies (Haddeland et al., 2011;Schewe et al., 2013;Prudhomme et al., 2014;Mendoza et al., 2015b;Giuntoli et al., 2015a). Currently, the large majority of studies assessing the hydrological impacts of climate change completely neglect hydrological model uncertainty (Teutschbein and Seibert, 2010).

How well do the models perform in terms of long-term runoff trends?
The models displayed very similar MAR trends (Fig. S1.8), meaning they respond similarly to climate variability, given that none of the models account for land use or land cover changes, urbanization, reservoir construction, or increasing atmospheric CO 2 . However, the models obtained rather low spatial (Spearman) correlation coefficients (ρ MAR trend ) ranging from 0.32 (SURFEX) to 0.42 (LISFLOOD ; Table 5), indicating that the simulated MAR trends correspond fairly to moderately well to the observed ones. These values are lower than the (Pearson) correlation coefficients ranging from 0.52 to 0.63 obtained by Stahl et al. (2012), who evaluated MAR trends from seven models using observations from 293 small European catchments (100-1000 km 2 ), presumably due to the better quality of the European meteorological forcing and observed streamflow data. Milly et al. (2005) evaluated MAR trends from a 12-model ensemble using observations from 165 large catchments (> 50 000 km 2 ) around the globe, obtaining a (Pearson) correlation coefficient of 0.34, which is similar to ours. These low correlations, which were somewhat unexpected given the relative ease with which MAR can be estimated (e.g., Westerberg and McMillan, 2015;Beck et al., 2015), may be indicative of changes in non-climatic drivers of hydrological change or drift errors in the forcing or observed streamflow data. We expect the inter-model variability in trends to be higher and the agreement with observations to be even lower for seasonal and monthly averages as well as runoff signatures sensitive to the shape of individual flow events (cf. Bastola et al., 2011;.
Overall, these results suggest that studies using global-scale datasets to assess the impacts of past climate change on runoff in small-to-medium-sized catchments should be interpreted with considerable caution.

How do the results of the GHMs differ, if at all, from those of the LSMs?
Similar to Haddeland et al. (2011), the LSMs were found to produce less runoff overall (Table 5 and Fig. 1), perhaps due to their use of physically based Richards-Darcy type equations, which neglect preferential flows. We further found that the GHMs perform, on average, worse than the LSMs in raindominated regions: the GHMs (excluding the comprehensively calibrated models WaterGAP3 and HBV-SIMREG; see Sect. 4.4) obtained mean OS scores of 0.28, 0.33, and 0.43 for tropical, arid, and temperate climates, respectively, while the same values for the LSMs are 0.39, 0.47, and 0.47, respectively (Table 5). However, the lower performance of the GHMs is primarily attributable to PCR-GLOBWB and SWBM. As mentioned before, PCR-GLOBWB probably suffers from a suboptimal baseflow-related parameterization, while SWBM suffers from the absence of a baseflow routine. The GHMs do appear to perform consistently better than the LSMs in snow-dominated regions: the GHMs (again excluding WaterGAP3 and HBV-SIMREG) obtained mean OS scores of 0.46 and 0.32 for cold and polar climates, respectively, while the same values for the LSMs are 0.31 and 0.25, respectively (Table 5). The performance of the LSMs appears to be mainly due to a very early bias in flow timing, a very low baseflow contribution, and a misrepresentation of the seasonal cycle (Figs. S1.4, S1.5, and S1.13, respectively). Our results are in agreement with Giuntoli et al. (2015b), who found five GHMs to outperform, on average, four LSMs using observations from 252 temperate and cold catchments (64 to 1 350 000 km 2 ) located in the central USA, and with Zhang et al. (2016), who found that two LSMs performed considerably worse than two GHMs in cold and polar regions using observations from 644 catchments Figure 3. Density plots of grid cell values of aridity index (AI) versus runoff coefficient (RC), for the seven models with data on the available energy. The green line represents the energy limit for which actual evaporation equals PET, the purple line represents the water limit for which runoff equals P , whereas the blue line represents the Budyko (1974) curve.
(> 2000 km 2 , upper limit not reported) around the globe. The poorer performance obtained by the LSMs is probably indicative of differences between the snow routines used by GHMs and LSMs. The GHMs use relatively simple conceptual temperature-index snow routines driven by air temperature, which can be estimated with relative ease, whereas the LSMs use more complex physically based energy balance snow routines driven by estimates of energy balance components, which are subject to considerable uncertainty, particularly in regions with complex topography (Ferguson, 1999). Although several previous studies have found that the two types of snow routines yield comparable performance (e.g., WMO, 1986;Franz et al., 2008;Zeinivand and De Smedt, 2009;Debele et al., 2010), these studies used a very small number of relatively well-instrumented catchments (six, two, one, and three, respectively), which may have led to less-generalizable conclusions. Overall, it appears that the energy balance estimates and snow routines used by the LSMs require re-evaluation (cf. Zhang et al., 2016).

Are calibration and regionalization important or even essential?
Calibration is important for both conceptual and physically based hydrological models to provide more accurate runoff estimates, to account for (i) the impossibility of measuring all required model parameters at the model application scale, (ii) lack of process understanding, (iii) possibly overly simplistic process representations, (iv) the spatiotemporal discretization of highly heterogeneous rainfall-runoff processes, and (v) errors in the forcing data (Beven, 1989;Blöschl and Sivapalan, 1995;Duan et al., 2001Duan et al., , 2006Mc-Donnell et al., 2007;Nasonova et al., 2009;Rosero et al., 2011;Minville et al., 2014). Yet, despite the development of numerous calibration techniques over the last 50 years (Dawdy and O'Donnell, 1965;Duan et al., 2004) and the current widespread availability of streamflow observations , macro-scale models generally tend to be uncalibrated (Sooda and Smakhtin, 2015;Bierkens, 2015;Kauffeldt et al., 2016). This is perhaps mainly due to (i) the substantial amount of work involved with calibration (e.g., Bock et al., 2015), (ii) the risk of obtaining unrealistic parameters due to equifinality and data issues (Andréassian et al., 2012), and (iii) the lack of a commonly accepted regionalization technique . In addition, the modeler may feel that since their model is physically based, it does not require calibration (Beven, 1989). LSMs in particular are rarely calibrated against runoff, likely because (i) runoff estimation is generally not among the primary aims of LSMs; (ii) for water transport in the soil, LSMs typically use Richards-Darcy type equations, which are computationally expensive and require a fine vertical and temporal soil discretization; and (iii) LSMs often do not account for river routing, confounding the calibration of large catchments. Instead, the parameters in macro-scale models are usually based on "expert opinion" and thus founded on the bold assumption that the modeler sufficiently understands the hydrological processes, feedbacks, and parameter interactions taking place within the model for any location on Earth.
Nevertheless, out of the 10 models considered in this study, four use parameters derived by calibration (LISFLOOD, SWBM, WaterGAP3, and HBV-SIMREG all GHMs). LISFLOOD was calibrated against observed streamflow for 24 large catchments (84 230 to 4 680 000 km 2 ) across the globe using the WFDEI forcing and an aggregate objective function incorporating bias, NSE, and log-transformed NSE computed from daily streamflow data. The calibration might have influenced the present evaluation; although we used much smaller catchments (1000 to 5000 km 2 ), 47 % of our catchments are located within the calibration catchments. SWBM uses a spatially uniform parameter set based on calibration using the E-OBS forcing (Haylock et al., 2008) against European data on such key hydrologic variables as soil moisture, total water storage, evaporation, and runoff . For the calibration against runoff, they used observations from 436 small European catchments (mostly < 1000 km 2 ), and considered daily and monthly correlations as well as bias. The calibrated parameter set was subsequently applied globally. Besides the addition of a baseflow routine, SWBM would probably benefit from regionalized parameters that vary according to landscape characteristics. WaterGAP3 has been calibrated using the WFDEI forcing in terms of bias for the interstation regions (the catchment of a station excluding the catchments of nested upstream stations) of 2071 stations (catchment size ranging from 2830 to 966 321 km 2 ) around the globe, some of which have also been used in the current evaluation. The calibrated parameters were subsequently regionalized to ungauged regions using multiple linear regression based on six predictors (Döll et al., 2003). The model does indeed perform very well for MAR and thus RC, but this did not necessarily translate into good performance for BFI (Table 5, and Figs. 1 and 2). HBV-SIMREG also uses regionalized parameter fields, produced by transferring calibrated parameters from 674 small-to-medium sized "donor" catchments (10 to 10 000 km 2 ) across the globe to "receptor" grid cells with similar climatic and physiographic characteristics . In their study, Beck et al. (2016) show that HBV using spatially uniform parameters performs within the range of the other models, confirming that the relatively good performance of HBV-SIMREG stems from the regionalization exercise. In addition, although Beck et al. (2016) did not use the WFDEI forcing for the calibration, they calibrated against several of the performance metrics also used here and used 179 of our catchments as parameter donors, further explaining the relatively good performance obtained by HBV-SIMREG (Table 5, and Figs. 1 and 2).
Overall, it appears that the calibration exercises for Wa-terGAP3, HBV-SIMREG, and possibly LISFLOOD have resulted in markedly improved performance. However, Water-GAP3 performed poorly in terms of ρ BFI (Table 5), meaning the calibration of MAR did not translate into better BFI performance. These results underscore the benefits of calibrated parameters over a priori parameters (cf. Duan et al., 2006;Hunger and Döll, 2008;Nasonova et al., 2009;Rosero et al., 2011;Greuell et al., 2015;Zhang et al., 2016) and highlight the importance of using an objective function for the calibration that incorporates a broad range of metrics related to various important aspects of the hydrograph (cf. Gupta et al., 2008;Vis et al., 2015;Shafii and Tolson, 2015). These results also emphasize the usefulness of regionalization techniques (Parajka et al., 2013), which typically enhance performance over the entire model domain and are thus of particular value for macro-scale modeling, given that the majority of the land surface is ungauged or poorly gauged (Sivapalan, 2003;Hannah et al., 2011). However, although there are numerous studies performing regionalization at a regional scale (see reviews by He et al., 2011;Hrachowitz et al., 2013;Razavi and Coulibaly, 2013;Parajka et al., 2013), only a few studies have attempted regionalization at a macro-scale (see review by Beck et al., 2016). We argue that more effort should be devoted to regionalizing the parameters of macro-scale models (cf. Bierkens, 2015;Döll et al., 2015).
It should be noted, however, that the potential performance improvement gained by calibration and regionalization will depend on the structure and flexibility of the model in question. Many current models have rigid structures and/or insufficient free parameters and thus cannot be calibrated successfully (Mendoza et al., 2015a). Moreover, for climate projections one should bear in mind that calibrated parameters become less valid when the model is subjected to climatic conditions it has never seen before (Knutti, 2008).

What is the impact of the forcing data on the simulated runoff?
There are not only strong inter-model differences in the performance patterns but also clear inter-model similarities, suggesting that the forcing data quality imparts a strong limit on the performance. This is most notable for the MAR metric: all models showed negative biases in MAR in snowdominated regions such as Alaska, the Rocky Mountains, and southern Russia, while they consistently showed positive biases in MAR for the Great Plains (USA) and southern Australia (Fig. 1). The high spatial correlation in the performance patterns suggests that these consistent performance patterns may be due to biases in the WFDEI P data, rather than biases in the streamflow observations, which are unlikely to be spatially correlated. It is conceivable that biases are present in the WFDEI P data, because (i) the monthly CRU dataset, which has been used to correct the WFDEI dataset, is based on only a subset of the available gauges and does not explicitly account for orographic effects; (ii) in sparsely gauged regions the correction using CRU is more likely to deteriorate rather than improve the P estimates; and (iii) the Adam and Lettenmaier (2003) gauge undercatch correction factors are based on interpolation of a very sparse sample of gauges and thus subject to considerable uncertainty. For the conterminous USA we quantified the biases in the WFDEI P data using the high-quality Parameter-elevation Relationships on Independent Slopes Model (PRISM) climatic dataset (Daly et al., 1994), which is based on considerably more gauges than CRU and includes sophisticated corrections for orography. Figure 4a shows the bias in mean annual P from WFDEI relative to that from PRISM, suggesting that the WFDEI P data are indeed subject to large biases. Figure 4b shows the bias in MAR from the MEAN-All ensemble relative to MAR from the observations, revealing a comparable bias pattern, thus confirming that the biases in the WFDEI P propagate in the simulated runoff. The correlation coefficient between the MAR and P bias values is 0.58, indicating a moderately strong relationship. These P biases appear to translate into even more pronounced runoff biases in (semi-)arid regions (notably the northern Great Plains; Fig. 4b and c) due to the highly nonlinear response behavior in these environments (Lidén and Harlin, 2000;Fekete et al., 2004;Van Dijk et al., 2013a). We were unable to quantify the P biases globally since no other independent, global-scale P dataset exists (the WorldClim and CHPclim datasets are likely to exhibit similar biases as the CRU TS3.1 dataset, given that they are based on similar sets of gauges). However, we expect the P biases to be at least similar, if not more severe, outside the wellinstrumented conterminous USA (cf. Fekete et al., 2004;Hijmans et al., 2005;Biemans et al., 2009;Zhou et al., 2012;Kauffeldt et al., 2013;Greuell et al., 2015). It should be noted that biases in PET are probably of secondary importance as compared with biases in P (Donohue et al., 2010;Sperna Weiland et al., 2011;Seiller and Anctil, 2015).
The global-scale quantification and reduction of these P biases should be a priority for future research. Satellitederived P offers unique opportunities in this regard (e.g., Funk et al., 2015) that extend beyond the tropics with the recent launch of the Global Precipitation Measurement (GPM) mission (Smith et al., 2007). Another little-explored way of reducing P uncertainty is by "doing hydrology backwards"; that is, to use information on other hydrological variables, for example, satellite-derived surface soil moisture (e.g., Brocca et al., 2014), streamflow observations (e.g., Adam et al., 2006;Beck et al., 2017), and snow-depth observations (e.g., Cherry et al., 2005) to reconstruct P through hydrological modeling. Arguably the most important obstacles to combining multiple data sources are the inconsistent temporal coverage and scale of different data sources and the general lack of error/uncertainty estimates.
Although the models all used the same P data, they used different formulations to compute PET, which has likely contributed to differences in simulated runoff among the models in energy-limited regions (Weiß and Menzel, 2008;Kingston et al., 2009;Haddeland et al., 2011;Weedon et al., 2011;Sperna Weiland et al., 2011). However, PET data were available for only four models, which is insufficient to examine whether the PET formulation has had a discernible influence on the simulated runoff, given the numerous other differences in structure and parameterization among the models.

How valuable are multi-model ensembles?
The multi-model ensemble MEAN-All incorporated all 10 models, while MEAN-Best4 incorporated only LISFLOOD, W3RA, WaterGAP3, and HBV-SIMREG (i.e., the four models that performed best in terms of OS; Table 5). MEAN-All and MEAN-Best4 were found to perform better than all individual models (with the exception of HBV-SIMREG, which has been comprehensively calibrated; Table 5, and Figs. 1 and 2). These results highlight the benefits of multi-model ensembles, in line with several previous studies (Ajami et al., 2006;Duan et al., 2007;Viney et al., 2009;Materia et al., 2010;Velázquez et al., 2010;Gudmundsson et al., 2012a;Xia et al., 2012;Yang et al., 2015). The similar OS scores obtained by MEAN-All and MEAN-Best4 (0.57 and 0.60, respectively; Table 5) suggests that the inclusion of lessaccurate models has only limited adverse effects. It may be worthwhile for future studies to examine the benefits of more sophisticated multi-model combination techniques involving bias correction or model weighting (e.g., Ajami et al., 2006;Duan et al., 2007;Bohn et al., 2010). These weights can subsequently be transferred from gauged to ungauged areas using regionalization techniques typically used for hydrological model parameters .
HBV-SIMREG differs from the other models because it represents a "multi-parameterization ensemble", which . For the conterminous USA, (a) the bias in mean annual P from WFDEI relative to PRISM, (b) the bias in MAR from the MEAN-All ensemble relative to the observations, and (c) the aridity index, the ratio of mean annual PET (computed from PRISM air temperature using Hargreaves et al., 1985) to P (PRISM; note the nonlinear color scale). Each data point in (b) represents a catchment centroid. The bias in (a) and (b) was computed following B = (X − R)/(X + R), where B is the bias, X the uncertain value, and R the reference value. B values range from −1 to 1. A 100 % overestimation results in B = 1/3, whereas a 50 % underestimation results in B = −1/3. means the model was run multiple (10) times globally using different (regionalized) parameter sets representing different catchment response behaviors . HBV-SIMREG obtained slightly better performance than both MEAN-All and MEAN-Best4 overall (Table 5), tentatively suggesting that a multi-parameterization ensemble for a single, sufficiently flexible model provides performance comparable to a multi-model ensemble (cf. Oudin et al., 2006;Yang et al., 2011;Coxon et al., 2014). If this is confirmed, it would negate the need to set up, run, and maintain multiple models, and incentivize the development of a single community hydrological model (cf. Weiler and Beven, 2015) as well as modeling systems allowing for the selection of alternative model structures (cf. Bierkens, 2015), such as the Framework for Understanding Structural Errors (FUSE; Clark et al., 2008), Noah Multi-Parameterization (Noah-MP;, and SUPERFLEX (Fenicia et al., 2011).

Do all models show the early bias in runoff timing
in snow-dominated catchments previously documented and what is the cause?
With the exception of ORCHIDEE and HBV-SIMREG, all models showed early T50 biases in snow-dominated regions ( Fig. S1.3), indicating that the models produce the spring snowmelt peak early, as has also been reported in several previous studies using different models and forcing data (Lohmann et al., 2004;Slater et al., 2007;Decharme and Douville, 2007;Balsamo et al., 2009;Zaitchik et al., 2010;Beck et al., 2015). The early runoff timing is probably pri- Figure 5. Scatterplot of the difference between simulated (MEAN-All) and observed-transformed RC (D RC ) versus the difference between simulated (MEAN-All) and observed T50 (D T50 ) for the catchments (n = 966).
marily due to P underestimation, which leads to insufficient snow accumulation that subsequently melts too quickly (Hancock et al., 2014). The fact that HBV-SIMREG performs well in this regard is probably attributable to the snowfall gauge undercatch correction factor of the model. Indeed, Fig. 5 tentatively shows that catchments in which the models strongly underestimate runoff (i.e., negative D RC ) generally tend to exhibit an early bias in T50 (i.e., negative D T50 ) and vice versa. The absence or misrepresentation of certain processes that delay snowmelt runoff in the models may have exacerbated the early runoff timing problem. Examples of such processes include the isothermal phase change of the snowpack, retainment of meltwater in the snowpack in pore spaces, infiltration of meltwater into the soil, meltwater refreezing during cold days and nights, and ice jams in rivers.
On the whole, more research is needed to ascertain the exact reasons of the early runoff timing.

Conclusions
The runoff estimates from 10 state-of-the-art macro-scale hydrological models, all forced with the WFDEI dataset, were evaluated using observations from 966 medium-sized catchments around the globe. With reference to the questions posed in the introduction, the following was found: 1. The performance differed markedly among models, underscoring the importance of hydrological model uncertainty in addition to climate input uncertainty, and suggesting that model choice should be regarded as a critical step in any hydrological modeling study.
2. The models displayed similar MAR trends, although they were in poor agreement with observed trends.
Model-based runoff trends in small-to-medium sized catchments should thus be interpreted with considerable caution.
3. Considering only the uncalibrated models, the GHMs performed similarly to the LSMs in rainfall-dominated regions but consistently better than the LSMs in snowdominated regions, perhaps due to the use of more datademanding snow routines or the misrepresentation of frozen soil and snowmelt processes by the LSMs.
4. The models that have been calibrated obtained higher scores for the performance metrics incorporated in the respective objective functions used for calibration.
5. The WFDEI P forcing data still appear to contain substantial biases, despite adjustments using gauge observations. These P biases translate into biases in the simulated runoff, which are amplified in (semi-)arid regions. In snow-dominated regions there appears to be a consistent underestimation in P and thus simulated runoff.
6. The multi-model ensembles obtained only slightly worse performance than the best (calibrated) model, and the inclusion of less-accurate models did not severely degrade the performance. A multi-parameterization ensemble for a single, sufficiently flexible model is easier to realize but we speculate may yield the same performance benefits as a multi-model ensemble.
7. Most models were indeed found to generate the spring snowmelt peak early, probably due to the previously mentioned P underestimation and the absence or misrepresentation of certain processes that delay snowmelt runoff in the models.
The Supplement related to this article is available online at https://doi.org/10.5194/hess-21-2881-2017supplement. Competing interests. The authors declare that they have no conflict of interest.