Comparative assessment of predictions in ungauged basins – Part 1: Runoff-hydrograph studies

The objective of this assessment is to compare studies predicting runoff hydrographs in ungauged catchments. The aim is to learn from the differences and similarities between catchments in different locations, and to interpret the differences in performance in terms of the underlying climate and landscape controls. The assessment is performed at two levels. The Level 1 assessment is a metaanalysis of 34 studies reported in the literature involving 3874 catchments. The Level 2 assessment consists of a more focused and detailed analysis of individual basins from selected studies from Level 1 in terms of how the leave-one-out cross-validation performance depends on climate and catchment characteristics as well as on the chosen regionalisation method. The results indicate that runoff-hydrograph predictions in ungauged catchments tend to be more accurate in humid than in arid catchments and more accurate in large than in small catchments. The dependence of performance on elevation differs by regions and depends on how aridity varies with elevation and air temperature. The effect of the parameter regionalisation method on model performance differs between studies. However, there is a tendency towards a somewhat lower performance of regressions than other methods in those studies that apply different methods in the same region. In humid catchments spatial proximity and similarity methods perform best while in arid catchments similarity and parameter regression methods perform slightly better. For studies with a large number of catchments (dense stream gauge network) there is a tendency for spatial proximity and geostatistics to perform better than regression or regionalisation based on simple averaging of model parameters from gauged catchments. There was no clear relationship between predictive performance and the number of regionalised model parameters. The implications of the findings are discussed in the context of model building.


Introduction
Runoff hydrographs, i.e. the time series of river runoff, are the result of numerous interacting processes within the catchment: precipitation, runoff generation at the land surface, infiltration into the subsurface, uptake from vegetation and consequent transpiration, evaporation from the soil, water movement through various flow paths on the land surface, in the unsaturated zone and in the groundwater. Understanding the hydrograph will help understand how these processes combine. Predictions of runoff hydrographs are one way of testing our hypotheses on these processes. Predictions of runoff hydrographs are also needed for practical purposes such as obtaining design characteristics for spillways, culverts and embankments, for water resources management applications such as water allocation for irrigation, industry and human use, hydropower operation and environmental flow estimation. They are also useful for risk management such as in flood and drought forecasting. Finally, there is considerable interest in assessing the effects of environmental change (e.g. land use, hydraulic structures, climate) on the runoff hydrographs and water quality for which accurate runoff predictions are needed (Sachs and McArthur, 2005;Kovacs et al., 2012;Blöschl and Montanari, 2010). Clearly, predictions J. Parajka et al.: Part 1: Runoff-hydrograph studies of runoff hydrographs are important for many purposes of societal relevance. However, in most catchments of interest no runoff data are available, so the hydrographs need to be predicted from other information within that catchment or from other catchments. This is the "Prediction in Ungauged Basins" or PUB problem.
In 2003, the International Association of Hydrological Sciences (IAHS) launched a concerted effort on investigating the PUB problem, the PUB initiative. The main focus of this initiative was to advance the knowledge and understanding of climatic and landscape controls on hydrologic processes occurring at all scales and to improve the ability to predict the fluxes of water in ungauged basins, along with their uncertainties (Sivapalan et al., 2003). One of the clear tasks that the PUB initiative set out to achieve was to address the fragmentation of modelling approaches through comparative evaluation: "Classify model performances in terms of time and space scales, climate, data requirements and type of application, and explore reasons for the model performances in terms of hydrological insights and climate-soil-vegetationtopography controls" (SSG: PUB Science Steering Group, 2003, p. 18).
The objective of this and two companion papers Viglione et al., 2013) is to compare different approaches for runoff prediction in ungauged catchments. While companion papers investigate predicting performance of methods for extreme runoff estimation  and compare statistical and process based methods for predicting a range of runoff characteristics at different timescales , in this paper we compare studies predicting runoff hydrographs in ungauged catchments. The aim is to learn from the differences and similarities between catchments in different locations, and to interpret the differences in performance in terms of the underlying climate and landscape controls. In particular, the following research questions are addressed: i. How good are the runoff predictions in different climates?
ii. Which parameter regionalisation method performs best?
iii. How does data availability impact performance?
iv. How does model complexity impact performance?
v. To what extent does runoff prediction performance depend on climate and catchment characteristics?

Method of comparative assessment
For the comparative assessment of runoff-hydrograph predictions in ungauged basins, a two-step process has been adopted in this paper: Level 1 assessment: in a first step, a literature survey was performed. Publications in the international refereed literature were scrutinised for results of the predictive performance of runoff hydrographs. The Level 1 assessment is a meta-analysis of prior studies performed by the hydrological community. The advantage of this type of meta-analysis is that a wide range of environments, climates and hydrological processes can be covered that go beyond what can be reasonably achieved by a single study. It is a comparative assessment that synthesises the results from the available international literature. However, the level of detail of the information provided is often limited. The results in the literature were almost always reported in an aggregated way, i.e. as average or median performance over the study region or part of the study region.
Level 2 assessment: to complement the Level 1 assessment, a second assessment step was performed, termed Level 2 assessment. In this step, some of the authors of the publications from Level 1 were approached to provide data on their runoff-hydrograph predictions for individual ungauged basins. The data they provided included information on the catchment and climate characteristics, on the method used, the data availability, and predictive performance. The overall number of catchments involved was smaller than in the Level 1 assessment, so the spectrum of hydrological processes covered in the assessment was narrower. However, the amount and detail of information available in particular catchments was much higher. As in Level 1, the cross-validation performance for ungauged basins was analysed; however, information on individual catchments was now available. The cross-validation performance was estimated by a leave-oneout strategy, where each gauged catchment was in turn considered as ungauged and estimated runoff was compared with the observed runoff hydrographs. The predictive accuracy was then described by the Nash-Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970) of daily runoff.
The comparative assessment conducted in this paper stratifies the analyses into three main groups: 1. Analysis of process controls on the model performance.
A number of climate and catchment characteristics have been identified. A large number of catchments and modelling studies around the world have then been organised according to these climate and catchment characteristics, with a view to learning from their differences and similarities in performance in a general way.
2. Analysis of predictive performance for different types of methods. The methods for estimating the parameters of rainfall-runoff models in ungauged basins have been grouped into the classes discussed in Sect. 3. Rather than evaluating specific methods the focus has been on types of methods, so to be able to generalise beyond individual studies.  3. Analysis of data availability. The quality of runoff predictions in ungauged basins not only depends on the hydrological setting and the regionalisation method but also, importantly, on the data that are available for the regionalisation. The comparison therefore also examines the number of stream gauges available in a particular study as an index to characterize data availability. Table 1 lists the 34 studies published in the last decade that are used in this paper. It includes summary information about the study region, regionalisation method applied and the predictive runoff model efficiency. The consistency of results differs between the studies. In some papers, the results are presented only as figures, in others these are summarized by median or range of runoff model performance. Several studies compare different hydrologic models and/or regionalisation approaches, which results in a total of 75 assessments of predictive performance. These results are the base for the Level 1 assessment, which represents at total of 3874 catchments (Table 2). Nine study authors out of the Level 1 assessment provided detailed information about climate and catchment characteristics in a consistent way and reported the regionalisation performance for each catchment (Level 2 assessment). This dataset combines data from 1832 catchments. Three catchment characteristics are analysed: aridity index, mean elevation and catchment area. Aridity index (the ratio of potential evaporation and precipitation on a longterm basis, averaged across the catchment) is an indicator of the competition between energy and water affecting the water balance. Elevation (average topographic elevation within the catchment) is a composite indicator including a range of processes, such as long-term precipitation and hence soil moisture availability and air temperature. In some environments there is a relationship between elevation and aridity and elevation and snow processes. Catchment area is an indicator of the degree of aggregation of catchment processes related to scale effects; an indicator of storage within the catchment; and an indicator of the amount of rainfall data  Table 1. Line indicates study (Petheram et al., 2012) where the same method was applied across different climatic regions. Boxes show 25-75 % quantiles. that is available for runoff estimation in ungauged basins, since larger catchments tend to contain a large number of rain gauges. With increasing area also the estimation variance of areal rainfall decreases and an areal rainfall might be biased by increasing number of stations located in lower parts of the catchment (Lebel et al., 1987).

Studies and datasets used
Prediction of runoff hydrographs in ungauged catchments is traditionally based on hydrologic model simulations. Almost all the studies reported in Table 1 used lumped conceptual models, a few studies used semi-distributed (Parajka et al., 2005), HRU-based (Viviroli et al., 2009) or distributed models (Allasia et al., 2006;Samaniego et al., 2010a, b). Most of the models predict the hydrographs at a daily time step. In the case of conceptual models, the model parameters cannot usually be measured or inferred from measurements. The parameters therefore need to be transferred (regionalised) from gauged catchments in the region, termed donor catchments (Blöschl, 2005). There is a plethora of different methods used for parameter regionalisation. In the Level 1 and Level 2 assessments we assigned them into five groups: spatial proximity, similarity, model averaging, parameter regression and regional calibration. While the spatial proximity, similarity and model averaging methods, assume that the entire parameter set of a gauged basin is also valid in the ungauged basins, parameter regression and regional calibration methods relate individual model parameters to catchment characteristics. A more detailed description of each group of regionalisation methods is as follows.
-Spatial proximity: if one assumes that climate and catchment characteristics vary only smoothly in space then spatial proximity between the catchments may be a suitable similarity measure to select the donor catchment. Proximity is usually defined on the basis of distances between the catchment outlets or catchment centroids (Randrianasolo et al., 2011;Zvolenský et al., 2008;Li et al., 2009). It is also possible to use the geostatistical distances, which account for the nestedness of the catchments (e.g. Skøien et al., 2006;Skøien and Blöschl, 2007).
-Similarity: an alternative is to choose the donor on the basis of the similarity of the climate and catchment characteristics in the two catchments. Similarity is usually measured by the root mean square difference of all the characteristics in a pair of catchments. The characteristics are usually standardised by their standard deviation or transformed in another way to make them comparable. Studies which choose a donor on the basis of this method use a wide range of climate and catchment characteristics. Kokkonen et al. (2003) transferred the entire parameter set from the catchment with the most similar elevation of the catchment outlet. McIntyre et al. (2004) defined the most similar catchment in terms of the catchment area, standardised annual average precipitation, and baseflow index. Other studies used a larger number of characteristics, such as Parajka et al. (2005) who defined the similarity by mean catchment elevation, areal proportion of porous aquifers, lake index, stream network density, soils, geology and land use, and Zhang and Chiew (2009) who identified the most similar catchments in terms of catchment area, mean elevation, slope, stream length, aridity, woody vegetation fraction and plant-available water-holding capacity.
As it is discussed in Oudin et al. (2010), more relevant catchment characteristics should be sought to better describe the geological and lithological conditions from a hydrological perspective.
-Model averaging: sometimes a weighted combination of the parameter sets from more than one donor catchment is used, where the catchments are selected either based on proximity, catchment characteristics or both (Goswami et al., 2007;Kim and Kaluarachchi, 2008;Seibert and Beven, 2009). One can either assume a fixed subdivision of the region into groups of catchments or, alternatively, allow each catchment to have its own group of donor catchments (Burn and Boorman, 1993).
-Parameter regression: alternatively, the calibrated model parameters can be related individually to catchment characteristics in the gauged catchments through empirical relationships, and these can be used to estimate the model parameters in the ungauged basin. The most common method of this type is the parameter regression method. For example, Kokkonen et al. (2003) found the drying parameter of the IHACRES (Identification of unit Hydrographs And Component flows from Rainfall, Evaporation and Streamflow data) model to be negatively related to mean overland-flow distance and the time constant governing the rate of recession in the slow store to be related to topographic slope in the Coweeta catchment, North Carolina. Merz and Blöschl (2004) found the very fast storage coefficient to be negatively correlated with topographic slope and elevation. This implies that runoff response may be particularly flashy in the high elevation catchments in Austria. Ideally, the relationship between the model parameters and the catchment characteristics should be hydrologically justifiable to give confidence for extrapolation to ungauged basins. However, this is not always the case (e.g. Sefton and Howarth, 1998;Peel et al., 2000;Fernandez et al., 2000) due to unrepresentative catchment characteristics and identifiability issues of the model parameters (Blöschl, 2005).
-Regional calibration: instead of first estimating model parameters at each (gauged) site and then relating them to catchment characteristics by an empirical relationship as in the above methods, these two steps are implemented concurrently in the regional calibration method by calibrating the coefficients of these relationships. The main motivation for doing this is to find more reliable parameters than is possible by calibrating the model parameters themselves and to make use of the spatial information contained in the catchment characteristics. The studies differ in the empirical relationships used such as regressions (Fernandez et al., 2000), homogeneous groups (Hlavčová et al., 2000;Szolgay et al., 2003), and geostatistical methods Hundecha et al., 2008).

How good are the predictions in different climates?
The synthesis of the results of the existing studies (Level 1) indicates that most of the studies were performed in Europe and Australia, and more studies were performed in humid than in tropical and arid climates ( Fig. 1 and Table 1). In total, there are 11, 5, 16 and 43 studies in arid, tropical, cold and humid climates, respectively. Figure 2 shows that the performance of runoff predictions tends to be lower in arid than in cold and humid regions. The range of NSE varies between less than 0.4 (Goswami et al., 2007;McIntyre et al., 2005) to 0.87 (Hundecha et al., 2008). The median NSE is 0.54, 0.64 and 0.66 in arid, cold and humid regions, respectively. There is only one study that compares the performance of the same method for different climatic conditions, the study of Petheram et al. (2012), indicated as a grey line in Fig. 2. Their results show that, in Australia, the NSE runoff efficiency is higher in tropical than in arid catchments. The main reason that the methods perform less well in arid regions appears to be that arid regions tend to be spatially more heterogeneous and the hydrological processes more non-linear.

Which method performs best?
The parameter-regionalisation methods used in the Level 1 assessment include spatial proximity, similarity, model averaging, parameter regression and regional calibration. The assessments in each group are not based on exactly the same regionalisation approach, but the methodology is similar. The spatial proximity group consists of 33 results that include the nearest neighbour, kriging and inverse distance weighting interpolation methods. The similarity group (9 results) uses parameters from those catchments that are most similar in terms of catchment and/or climate characteristics. The parameter regression group includes 17 results with different regression models used for transfer of model parameters and one study (Boughton and Chiew, 2007) in which a hydrologic model is calibrated to mean annual runoff estimated by a regression   Table 1. Boxes show 25-75 % quantiles. model. The model averaging group includes 12 results from either a regional pooling (averaging) of model parameters or ensemble runoff simulations for ungauged catchments. Finally, the regional calibration group includes 4 results from parameter estimation and model calibration simultaneously in a number of gauged catchments in a region.
The comparison of the methods (Fig. 3) indicates that the difference between the studies within each group is larger than between the groups. The NSE performance within each group is, for most of the assessments, within the range 0.5 and 0.75, while the median NSE for each group varies between 0.58 (spatial proximity) to 0.66 (similarity). The results of studies that compare different approaches (shown as grey lines in the figure) indicate that the predictive performance of parameter regression is poorer than the other methods, with the exception of one study (Samuel et al., 2011) where the simple average of model parameters performed the worst. In this case, however, the predictive performance is generally lower than in other published studies. The reasons why one approach to regionalisation may work better than others are discussed within several intercomparison studies and other reviews (Merz and Blöschl, 2004;Oudin et al., 2008;Parajka et al., 2005;Vogel, 2005). Oudin et al. (2008), for example, reported that spatial proximity slightly outperformed the similarity method in regions with a dense stream gauge network. They reported that the predictive performance of these two approaches becomes similar when the density of stream gauges decreases to less than 60 gauges per 100 000 km 2 . Parajka et al. (2005) reported that a significant similarity in catchment characteristics over relatively short distances in Austria may contribute to the relatively good performance of the spatial proximity and similarity regionalisation methods.  Figure 4 shows the median Nash-Sutcliffe performance as a function of the number of catchments analysed in each study. As would be expected, the 21 studies with less than 20 catchments have the largest scatter in the performance because of the smallest sample size. As the number of catchments increases the performance tends to decrease. It is possible that, in some of the studies with few catchments, these catchments were hand picked in terms of suitability for regionalisation and this happens less frequently in the studies with more catchments. For 12 studies with more than 250 catchments the performance however tends to increase. Again, some selection of catchments based on automated methods may have been performed at that scale. More detailed information on the dependence of performance on both method and number of catchments per study is shown in Fig. 5. Figure 5 summarizes 33, 9, 12, 17 and 4 results for spatial proximity, similarity, model averaging, parameter regression and regional calibration methods, respectively. The maximum performance exceeds 0.8 for the similarity, regression and model averaging methods, but this performance is documented only for the small datasets. Interestingly, the performance of similarity-based regionalisation is clearly lower for assessments with large datasets. There are only a few studies that compare runoff-hydrograph predictions obtained by different groups of methods over large datasets (e.g. three or more groups of methods and validation in more than 25 catchments). These studies suggest that for regions with dense networks of gauging stations (e.g. France and Austria) the spatial proximity approach performed best. Oudin et al. (2008) concluded that spatial proximity was the best regionalisation method in France while the regression approach was the least satisfactory. The results of Parajka et al. (2005) indicate that, for Austria, kriging and similaritybased approaches performed equally well, and significantly better than regressions or global or regional parameter averages. The results of Samuel et al. (2011) showed that also for the less dense stream-gauge network in Ontario (Canada) spatial proximity methods can perform more favourably than methods that use catchment characteristics, and coupling of  spatial proximity and similarity methods provided better performance than regression and model-averaging approaches.

How does model complexity impact performance?
To assess the effect of model complexity, the studies were grouped in terms of the number of model parameters that were regionalised (Fig. 6). The results indicate that, overall, there is no strong dependency of the performance on model complexity. The median of the performance for each group of models is around 0.65, with the exception of the group with 9-10 model parameters, which is lower. The largest variability (between 0.5 and 0.88) is found for models with 11-12 parameters. Studies that explored regionalisation performance of models with different complexity (Petheram et al., 2012;Chiew, 2010;Viney et al., 2009) suggest that whilst an increasing number of free parameters may lead to increased calibration performance, the difference in runoff-prediction performance was small or negligible (Viney et al., 2009;Petheram et al., 2012). The results of Oudin et al. (2008) showed that simpler models may slightly outperform more complex models in the predictive mode. It is also interesting to compare what regionalisation methods have been used in the different studies. The spatial proximity approach tends to be used for more complex models (more than 9 transferred parameters). There is a tendency of applying simpler models in arid and mixed arid and humid catchments, while in humid and cold regions more complex models have been used.

To what extent does runoff prediction performance depend on climate and catchment characteristics?
The assessment of NSE predictive performance with respect to the four climate and catchment characteristics (Level 2 assessment) is presented in Fig. 7 Fig. 8. Aridity index as a function of mean catchment elevation for the studies used in Level 2 assessment ( Table 1). The aridity represents the median over all catchments in a particular elevation class. clear pattern of decreasing performance with aridity index for catchments with an aridity larger than 0.6. The performance in the humid catchments is generally above 0.6, while it decreases to 0.5 or less in more arid catchments. It appears that in humid catchments, the rainfall-runoff processes are more linear, the hydrologic states tend to be less variable and the controls on runoff are spatially less variable, so a better performance would be expected. For the regional calibration method there is little dependency of the performance on aridity, but these studies are from Germany and Austria, where the catchments are never very arid.
The relationship between performance and elevation is more complex and depends on the region used for the assessment. There is a decrease of performance with increasing elevation in France (Oudin et al., 2008) and Australia , and an increase of performance with increasing elevation in Austria (Parajka et al., 2005). These differences are due to the different dependencies of aridity with elevation (Fig. 8). Figure 8 summarizes the aridity in 320, 912, 76 and 210 catchments in Austria, France, USA and Australia, respectively. While in Austria the aridity is less than 0.5 in catchments above 900 m a.s.l. and strongly decreases with increasing elevation, in France the aridity index exceeds 0.75 and actually increases with elevation. In Australia the aridity index is always larger than in the other regions. This pattern is consistent for all regionalisation approaches, except regional calibration, which was applied in Germany (9 catchments) and Austria (320 catchments) where catchments are never very arid. The pattern for air temperature (not shown here) is similar with a clear tendency of decreasing performance with increasing temperature in Austria and the opposite in France. Interestingly, the model averaging method has a low median and large scatter of performance in colder catchments, which may be due to snow processes. Similarly, Param. regress.
as for other characteristics, the regional calibration is less sensitive to air temperature than the other methods. The results in Fig. 7 show a very clear increase of the performance with catchment scale for all approaches and essentially all regions. The median performance is around 0.60 in small catchments (0-300 km 2 ) and increases to around 0.80 for larger catchments. Also, the variability in performance between the catchments decreases with catchment scale, i.e. the large catchments never give a very low performance. An exception is a slight increase of performance variability for the spatial-proximity method in the largest catchments in Australia and France, but this is only for a small group of catchments. Overall, this very clear pattern of an Table 3. Methods with the highest and lowest cross-validation performance of runoff predictions in ungauged basins. Arid relates to catchments with an aridity index > 1, humid to those with an aridity index ≤ 1. Level 1 refers to an assessment of the average performance of studies, Level 2 to an assessment of the performance for individual catchments. Number of studies and catchments see increase of the performance with catchment scale may be due to two reasons. The first is a trend for an increasing number of raingauges within a catchment as the catchment size increases. This trend likely reflects the relation between raingauge density relative to the correlation length scale of the rainfall (Schaake, 1981). The second may be related to the aggregation effect of runoff. As the catchment size increases, some of the hydrological variability is averaged out due to an interplay of space-time scale processes, which will improve hydrological simulation. Both effects are consistent with the scale effects of performance in gauged catchments (see, e.g. Merz et al., 2009;and Nester et al., 2011). Figure 9 summarizes the performance for different regionalisation approaches, stratified by the aridity index. The total number of catchments is 1570, 1466, 1507, 1241 and 329 for spatial proximity, similarity, model averaging, parameter regression and regional calibration methods, respectively. The top, middle and bottom panels show the performance for all catchments in Table 2, and catchments with an aridity index below and above 1, respectively. Overall, in all catchments the spatial proximity and similarity methods perform slightly better than the parameter regression and model averaging approaches. In arid catchments, however, similarity and parameter regression tend to perform slightly better than spatial proximity and model averaging. These results suggest that climate characteristics more strongly impact the runoffprediction performance in ungauged basins than the regionalisation method.

Discussion and conclusions
This paper has compared the performance of predicting daily runoff hydrographs in ungauged basins using conceptual runoff models with regionalised model parameters. Two kinds of assessments were performed; a Level 1 assessment, which constitutes a meta-analysis from the literature; and a Level 2 assessment, which analyses individual catchments in more detail. The results indicate that the Level 1 and Level 2 assessments are consistent while shedding light on different aspects of the prediction problem. The Level 1 assessment suggests that in humid and cold regions the performance of predicting daily runoff hydrographs in ungauged basins tends to be better than in arid regions. All regionalisation methods analysed (spatial proximity, similarity, model averaging, parameter regression and regional calibration) show a similar performance with considerable scatter within each method. There is a tendency towards a somewhat lower performance of regressions than other methods in those studies that apply different methods in the same region. Studies with few catchments and studies with a large number of catchments tend to exhibit better performance than studies with an intermediate number of catchments. For studies with a large number of catchments (dense stream-gauge network) there is a tendency for spatial proximity and geostatistics to perform better than regression or regionalisation based on simple averaging of the model parameters. There is no clear dependence of the model performance on the number of model parameters regionalised. The Level 2 assessment suggests that the performance of all methods decreases with increasing aridity. The dependence of performance on elevation and air temperature differs by region and depends on how aridity varies with elevation and air temperature. The performance of all methods increases with catchment area. In humid conditions spatial proximity and similarity methods perform best, while in arid catchments similarity and parameter regression perform slightly better than the other methods (Table 3).
The predictive accuracy of different regionalisation methods was quantified in terms of Nash-Sutcliffe efficiency (NSE). Since it is a traditional performance measure used in hydrology, it has an advantage that almost all reviewed studies evaluate the predictive accuracy by using NSE (an exception is the study of Vogel (2005) that uses R 2 ). On the other hand, NSE is a normalized skill score that measures runoff model performance relative to a baseline model, which is in this case mean of observed runoff values. This can lead to overestimation of NSE in catchments with strong seasonal runoff regime (see e.g. discussion in Schaefli and Gupta, 2007). As pointed out in Gupta et al. (2009), a comparison of NSE across basins with different seasonality should therefore be interpreted with caution. For future comparative evaluations, we would hence suggest to use additional information and performance measures that will also enable evaluation of different parts of runoff hydrographs, i.e. peaks, times to peak (Nester et al., 2011) or event recessions. This will help shed more light on the ability of different regionalisation methods to predict different hydrograph signatures across different runoff regimes.
Most of the studies analysed in this assessment applied lumped hydrologic models for runoff-hydrograph predictions. There are only a few distributed-modelling studies available for the assessment. Distributed models are harder to compare because of the added complexity in parameter estimation (see Blöschl et al., 2008). As distributed models are increasingly used for a range of purposes, it will be ever more important to also compare and cross-validate the prediction accuracy of distributed models in the future. An example of such comparisons is presented in the results of the Distributed Model Intercomparison Project (Smith et al., 2004(Smith et al., , 2012, which focuses mainly on operational flood and water resources forecasting. A cross-validation in terms of prediction accuracy in ungauged catchments will help to further improve the understanding of how to effectively parameterize the climate/landscape relationship with runoff generation at different scales. Also, it may be useful, to compare distributed models not only on the basis of runoff data but also on the basis of other hydrological response variables such as snow patterns using snow models of different complexity (Blöschl andKirnbauer, 1991, 1992;Nester et al., 2012).
The comparative assessment indicates two main implications for hydrological modelling. The first implication relates to the selection of model structure for runoff prediction. There are very few studies that have actually analysed what model structure would be appropriate for a particular catchment or landscape, yet it is likely that not all models will work equally well in all environments (e.g. Fenicia et al., 2011). Choice of model structure is usually guided by prior knowledge of the hydrologic system, the availability of data, and prior experience of the practitioner. This has led to a plurality of models being used. To avoid fragmentation and duplication, it might be valuable to group the world into classes of similar behaviour, based on some kind of classification scheme, and then to narrow down the number of models adopted. This will increase the experience with all such models and, through the sharing of this experience, it can lead to improvement of the models themselves and also improved predictive performance. van Werkhoven et al. (2009) found that an appropriate choice of the model structure simplified parameter estimation as the plausible parameter range is narrower if the model structure corresponds to the actual controls. The model structure of a catchment should hence be selected in the context of the particular hydro-climatic situation that controls the water balance through the soil-vegetation-atmosphere system. Depending on the setting, model structures should differ because the important hydrological processes may differ vastly between different landscapes.
The second implication stems from the fact that there is still a great potential of what can be learned from the synthesis of existing studies. Presently, however, it is not straightforward to compare the results of different studies. Many studies combine and aggregate results from different climate and physiographic settings and report only summary statistics of regionalisation performance and/or catchment characteristics. For future synthesis assessments, it would be useful to develop a universal protocol on reporting scientific results in the hydrological literature. In addition, the establishment of freely accessible data repositories to improve the synthesis and repeatability of studies would significantly contribute to making hydrology more coherent around the globe.