Looking beyond general metrics for model comparison lessons from an international model intercomparison study

. International collaboration between research institutes and universities is a promising way to reach consensus on hydrological model development. Although model comparison studies are very valuable for international cooperation, they do often not lead to very clear new insights regarding the relevance of the modelled processes. We hypoth-esise that this is partly caused by model complexity and the comparison methods used, which focus too much on a good overall performance instead of focusing on a variety of speciﬁc events. In this study, we use an approach that focuses on the evaluation of speciﬁc events and characteristics. Eight international research groups calibrated their hourly model on the Ourthe catchment in Belgium and carried out a validation in time for the Ourthe catchment and a validation in space for nested and neighbouring catchments. The same protocol was followed for each model and an ensemble of best-performing parameter sets was selected. Although the models showed similar performances based on general metrics (i.e. the Nash–Sutcliffe efﬁciency), clear differences could be observed for speciﬁc events. We analysed the hydrographs of these speciﬁc events and conducted three types of statistical analyses on the entire time series: cumulative discharges, empirical extreme value distribution of the peak ﬂows and ﬂow duration curves for low ﬂows. The results illustrate the relevance of including a very quick ﬂow reservoir preceding the root zone storage to model peaks during low ﬂows and including a slow reservoir in parallel with the fast reservoir to model the recession for the studied catchments. This intercomparison enhanced the understanding of the hydrological functioning of the catchment, in particular for low ﬂows, and enabled to identify present knowledge gaps for other parts of the hydrograph. Above all, it helped to evaluate each model against a set of alternative models.

Abstract. International collaboration between research institutes and universities is a promising way to reach consensus on hydrological model development. Although model comparison studies are very valuable for international cooperation, they do often not lead to very clear new insights regarding the relevance of the modelled processes. We hypothesise that this is partly caused by model complexity and the comparison methods used, which focus too much on a good overall performance instead of focusing on a variety of specific events. In this study, we use an approach that focuses on the evaluation of specific events and characteristics. Eight international research groups calibrated their hourly model on the Ourthe catchment in Belgium and carried out a validation in time for the Ourthe catchment and a validation in space for nested and neighbouring catchments. The same protocol was followed for each model and an ensemble of bestperforming parameter sets was selected. Although the models showed similar performances based on general metrics (i.e. the Nash-Sutcliffe efficiency), clear differences could be observed for specific events. We analysed the hydrographs of these specific events and conducted three types of statistical analyses on the entire time series: cumulative discharges, empirical extreme value distribution of the peak flows and flow duration curves for low flows. The results illustrate the relevance of including a very quick flow reservoir preceding the root zone storage to model peaks during low flows and including a slow reservoir in parallel with the fast reservoir to model the recession for the studied catchments. This intercomparison enhanced the understanding of the hydrological functioning of the catchment, in particular for low flows, and enabled to identify present knowledge gaps for other parts of the hydrograph. Above all, it helped to evaluate each model against a set of alternative models.

Introduction
Large efforts of the hydrological community go into the development of a large variety of hydrological models that are able to filter and reproduce relevant hydrological processes and are preferably applicable in a range of catchments (e.g. Kumar et al., 2013). The outflow from catchments is a combination of different runoff processes, occurring in different parts of the catchment and at different moments throughout the year (e.g. Berghuijs et al., 2014;Nippgen et al., 2015;Penna et al., 2015). Threshold behaviour (e.g. Spence, 2010;McMillan, 2012) and heterogeneity of moisture states (e.g. Detty and McGuire, 2010;Rinderer et al., 2014) create complex systems from which it is difficult to filter the relevant timescales and processes. Overall, hydrological models vary in process representation (conceptual vs. physically based), in the degree of spatial distribution (lumped, semi-distributed and fully distributed) and in the actual runoff process being modelled (e.g. Fenicia et al., 2016). The disadvantage of this abundance of models is that new insights and developments are very scattered and difficult to combine (e.g. Weiler and Beven, 2015). However, a large advantage of having all these different models is their possible use as multiple working hypotheses (e.g. Clark et al., 2011) in a model comparison study to investigate which processes, process representations and spatial distributions are suitable for a set of catchments.
Comparison studies are common in hydrological science, and each study has its own twist. While some studies may focus on simulations in a large variety of catchments with widely different characteristics (e.g. Gupta et al., 2014;Duan et al., 2006;Gudmundsson et al., 2012), others focus on a variety of model structures in a limited number of catchments (e.g. Breuer et al., 2009;Holländer et al., 2009;Nicolle et al., 2014;Vansteenkiste et al., 2014;Koch et al., 2016). Many of them rely on international collaboration between several institutes and universities to tackle important open hydrological research questions. Large sample studies enable rigorous testing of alternative model hypotheses and deriving ranges for which model structures are applicable in specific catchments (e.g. Gupta et al., 2014;Thirel et al., 2015a, b). A lesson learned from comparative hydrology in a small number of catchments is the importance of soft data (modeller's system understanding) as well as hard data (data and model), among others described by Winsemius et al. (2009) and Holländer et al. (2013). In the first and second distributed model intercomparison projects, Reed et al. (2004) and Smith et al. (2012) assessed the performance of lumped versus distributed models and calibrated versus uncalibrated models. They recommended to look in more detail at differences in model structures to increase our understanding of cause and effect. Ceola et al. (2015) pointed out that previous intercomparison studies have contributed little to deriving the causes of performance differences between various model structures. They state that this could be attributed to the complexity and the large differences of model struc-tures, and to the difficulty to link the presence of a model feature to a better or worse performance. Nevertheless, comparison experiments with different model structures should be encouraged to maintain the dialogue between different research groups and agree on adequate modelling concepts (Weiler and Beven, 2015).
During the last decade, model comparison studies have become much easier to carry out due to the large amount of freely available data and the increasing options for sharing data, tools and models. However, a solid model comparison study requires both a clear protocol, and a fair comparison method for the model results (Ceola et al., 2015;Hutton et al., 2016). Protocols can, among other things, contain information regarding preprocessing of data, calibration techniques or guidelines for transferring parameter sets. Very strict protocols do not always line up with the experience of the modeller and the different requirements for each model. Therefore, protocols should be clear, but can never be allembracing. On the other hand, assessing the performance of the different model realisations should be identical. Standard performance measures (i.e. Nash-Sutcliffe efficiency, root mean squared error, mean absolute error) give a general overview, but are unable to point out small differences between model realisations (e.g. Schaefli and Gupta, 2007;Euser et al., 2015). The small differences can possibly be visualised by focusing on specific events and by using more specific performance indicators like hydrological signatures (e.g. Nijzink et al., 2016) or statistics of selected storm events (e.g. Reed et al., 2004;Smith et al., 2012). Using additional data sources for model comparison can further discriminate between model conceptualisations (e.g. Rakovec et al., 2016).
Thus, model comparison studies can be a powerful tool to maintain the scientific dialogue and may contribute to increasing catchment understanding. In this study, different universities and institutes working in and studying the transboundary Meuse basin, in western Europe, applied their rainfall-runoff model to a set of subcatchments of the Meuse basin using the same meteorological forcing. Modelled fluxes were analysed to gain insight in the behaviour of a set of hydrological models. Our objectives are as follows: (i) set a clear calibration protocol for the participating modellers; (ii) propose an evaluation protocol focused on the assessment of specific events instead of general metrics, and discuss the challenges associated with a general and objective approach to model evaluation; (iii) apply the evaluation protocol to various hydrological models proposed by various international institutions; and (iv) relate differences in the simulated hydrographs to model components, and to their associated processes representations.

Study areas and data
This study focuses on three subcatchments of the Meuse basin in the Belgian Ardennes: Ourthe, Lesse and Semois catchments and on the two main subcatchments of the Ourthe: Ourthe Orientale (eastern side) and Ourthe Occidentale (western side) catchments ( Fig. 1 and Table 1). The Ourthe catchment at Tabreux was selected for calibration because of the limited influence of artificial reservoirs and its mesoscale, which enables to focus mainly on hydrology instead of hydraulics. One large reservoir is located in the Ourthe catchment at the confluence of the Ourthe Orientale and Ourthe Occidentale; a short study showed that the influence on the downstream discharge is relatively small (see Sect. 10 of the Supplement for more explanation). The Ourthe is a typical rain-fed river with a fast response to rainfall due to shallow soils and steep slopes (Driessen et al., 2010) and has a strong seasonal behaviour (Euser et al., 2015).
Many studies have already been carried out in the Ourthe catchment (e.g. Driessen et al., 2010;Rakovec et al., 2012;Euser et al., 2015) because of its significant contribution of flow volumes in the Meuse during floods (de Wit et al., T. de Boer-Euser et al.: Meuse model comparison 2007). The catchment of the Ourthe at Tabreux has a total area of 1607 km 2 with an elevation ranging between 107 and 663 m. Mean annual precipitation and potential evaporation are 1000 and 730 mm yr −1 , respectively. The main land use is agriculture (28 % crops and 28 % pasture), followed by forestry (46 %) and only 6 % of the catchment has an urban cover (CORINE land use map, (European Environment Agency, 2000)).
The neighbouring Lesse and Semois catchments and the nested Ourthe Occidentale and Ourthe Orientale catchments were selected for validation. The Lesse and the Semois catchments are about 25 % smaller than the Ourthe catchment, and their forest cover is slightly higher than in the Ourthe catchment (Table 1). Annual mean precipitation is similar in the Lesse catchment while it is 25 % higher in the Semois catchment. The upstream parts of the Semois catchment and the nested catchments within the Ourthe (Occidentale and Orientale) are relatively flat, while the Lesse catchment and downstream parts of the Ourthe and Semois catchments have steeper slopes. The hourly specific discharge of the Ourthe at Tabreux is most similar to that of the Lesse (on average 3 % lower than Ourthe, with R 2 of 0.91) and least similar to that of the Semois (on average 50 % higher than Ourthe, with R 2 of 0.78). The hourly specific discharges in the Ourthe Orientale and Occidentale are on average 7 and 5 % higher than in the Ourthe at Tabreux, with R 2 values of 0.92 and 0.88, respectively.
Data preparation involved interpolation of hourly precipitation station data based on Thiessen polygons. The station data are collected and made available for this study by the Service Public de Wallonie 1 . Daily minimum and maximum temperatures from the freely available gridded E-OBS dataset (0.25 • × 0.25 • resolution; Haylock et al., 2008) were disaggregated to hourly values using radiation data at Maastricht (Royal Netherlands Meteorological Institute 2 ) and a sine function. Daily potential evaporation was derived with the Hargreaves formula (Hargreaves and Samani, 1985) and disaggregated to hourly values using a sine function during the day and no evaporation at night. Precipitation and temperature data were available for the period from 1 January 2000 to 31 December 2010. The data (distributed, semi-distributed or lumped) were made available to the researchers through an FTP server. Figures of hourly observed discharge, precipitation, potential evaporation and temperature for each catchment can be found in Sect. 1 of the Supplement.

Methods
This comparison study roughly consists of three elements: the modelling protocol followed by each participant, the models used by each participant and the tools used for comparing the individual model results.

Modelling protocol
Eight international research groups participated in this model comparison study using one or several rainfall-runoff models. A total of 11 models were used, consisting of 7 independent models and 4 models from the SUPERFLEX framework . A modelling protocol was prescribed to enable a comparison of the results. The protocol for the modelling involved a split-sample calibration and validation for predefined periods using a common dataset (Klemeš, 1986) for the Ourthe catchment. The validation consisted of a blind validation in time (same catchment, but a different period) for the Ourthe catchment and a blind validation in space (same period, but different catchments) for the nested Ourthe Orientale and Ourthe Occidentale catchments and for the neighbouring Lesse and Semois catchments. Blind validation implies that only time series of forcing data (and no discharge observations) were given to the participants.
Calibration was carried out for the Ourthe at Tabreux for the period 1 January 2004 to 31 December 2007, using 2003 as a spin-up year. Nash-Sutcliffe efficiency (E NSE ; Nash and Sutcliffe, 1970) and E NSE of the log of the flows (E NSElog ; Collischonn et al., 2008) were used as objective functions for calibration. E NSE was chosen as an objective function for calibration because it is a common metric in hydrology to assess model performance with regard to high flows. E NSElog was chosen as a second objective function to take low flow performance into account as well. Participants were free to use a calibration method of their choice to estimate parameter values as long as they used the prescribed objective functions. Although it makes the comparison of the model results less straightforward, a free calibration method does account for the experience a modeller has with a specific model. Using the Pareto front between E NSE and E NSElog , the best 20 realisations were selected for each model to account for a range in model realisations (Fig. 2a).
Validation in time was carried out for the Ourthe at Tabreux for the period from 1 January 2001 to 31 December 2003, using 2000 as a spin-up year. This period was selected for validation as it includes some relevant hydrological events such as the drought in the summer of 2003 and high flows during the winters. The validation period has relatively high flows compared to the calibration period. An additional validation in time was carried out for the period 2008-2010 for the Ourthe at Tabreux, using the calibration period as a spin-up. For the latter period, participants only received forcing data (Fig. 2b  Validation in space was carried out for two nested catchments of the Ourthe: Ourthe Orientale (at Mabompré) and Ourthe Occidentale (at Ortho) for the period from 1 January 2001 to 31 December 2010, using 2000 as a spin-up year. Additionally, the derived parameter sets for the Ourthe at Tabreux were used to model the neighbouring Lesse (at Gendron) and Semois (at Membre) catchments for the same period. Only the forcing data were provided to the participants for this validation in space (Fig. 2b).

Descriptions of models
Each modelling group provided results as described above. A variety of models was used, including lumped, semidistributed and fully distributed models. All models are conceptual, but their degree of complexity varies, and they are used by institutes or universities working in the Meuse basin. Figure 3 depicts the main fluxes and storages of the applied models. Table 2 shows for each model the used forc-ing, the calibration method and whether parameters were regionalised. Below, a short description is given for each model: the term "root zone storage" is used for the reservoir from which transpiration is modelled. Further, the term "very quick runoff" is used for a faster process than "fast runoff"; these terms can be compared with "overland flow" and "interflow", respectively. The response times for the very quick runoff, the fast runoff and the groundwater runoff are for most models in the order of 1, 8 and 80 days, respectively.
-GR4H-CemaNeige (Mathevet, 2005) is a combination of the CemaNeige snow module (Valéry et al., 2014) and an hourly version of GR4J (Perrin et al., 2003). GR4H is an empirical four-parameter hourly model with a root zone storage and two routing routines: one for very quick and one for fast runoff. The division of water between the two routines is fixed at a 0.1-0.9 ratio; both reservoirs interact with the groundwater. Interception is taken into account by subtracting potential evaporation from precipitation to obtain net precipita- Interception capacity Deltares a Number of calibrated parameters; b Of the parameters, 11 were linked to other parameters based on parameter constraints (e.g. Gharari et al., 2014) tion. GR4H was developed for high flows rather than for low flows, as low flows are rarely studied at an hourly time step.
-PREvision et Simulations pour l'Annonce et la Gestion des Etiages Sévères (PRESAGES; Lang et al., 2006) is a daily tool for low flow forecasting and evaluation, and it was slightly modified to run on an hourly timescale. It is derived from GR4H with differences such as no incorporation of snow and a separated groundwater reservoir connected in series with the fast runoff reservoir. There is no longer interaction between the very quick runoff reservoir and the groundwater.
-Wageningen Lowland Runoff Simulator (WALRUS; Brauer et al., 2014a, b) is a lumped conceptual model for application in lowland areas with shallow groundwater tables. The model consists of three reservoirs: a combined root zone and groundwater reservoir, a combined very quick and fast runoff reservoir and a surface water reservoir. Snow accumulation and melt are simulated in a preprocessing step and interception was not taken into account. Note that the Ourthe catchment is not located in lowlands; we included WALRUS in the comparison to investigate where a model designed for lowlands would succeed and fail in a hilly catchment.
-M2-M5 models of the SUPERFLEX framework  are four lumped conceptual models with an increasing degree of complexity. M2 consists of a root zone storage and a non-linear fast runoff reservoir. M3-M5 extend the M2 conceptualisation by adding a lag function (M3-M5), a snow routine (M4-M5) and a groundwater reservoir (M5). Interception is not taken into account by all four models.
-NAM is an adapted version of the hydrological model which is coupled to MIKE11 (Nielsen and Hansen, 1973). It consists of a snow module, interception reservoir, root zone storage and a groundwater reservoir; the latter is configured in parallel. Fast and very quick runoff are generated from the interception reservoir but depend on the saturation of the root zone storage.
-FLEX-Topo (Savenije, 2010;Euser et al., 2015) is a conceptual model with three parallel model elements -VHM ) involves a step-wise and data-based procedure to identify a parsimonious lumped conceptual model. For the Ourthe catchment, a model was identified which consists of a root zone storage and three runoff components: very quick runoff, fast runoff and groundwater runoff, which are configured in parallel; interception was not taken into account.
-The wflow_hbv model is a completely distributed version of the conceptual HBV model (Lindström et al., 1997) in the wflow framework 3 with a kinematic wave as routing instead of the original triangular routing function. The model has an interception reservoir, snow module, root zone storage, fast runoff reservoir and a groundwater reservoir. The parameter values are uniform for the entire catchment, except for maximum interception capacity, which is related to land use.

Evaluation methods
The results of the 11 models and five catchments were compared in multiple ways. First, the scores obtained for the objective functions (E NSE and E NSElog ) were compared. This step enabled to determine the overall calibration performance of the models. We expected that this analysis would not reveal much difference; thus, two additional analyses were carried out: a statistical analysis and a hydrograph comparison for specific periods (Fig. 2c). These additional analyses focused on aspects that were not specifically taken into account during the calibration procedure, as to investigate the full range of a model's capabilities. Three types of statistical analyses and comparisons of simulation results and observations were conducted: cumulative discharges, empirical extreme value distribution of the peak flows and flow duration curves for low flows. The cumulative discharges were plotted for the entire modelled period and used to investigate the overall water balance. The empirical extreme value distributions were constructed from independent peak discharges, following ; the return period was calculated as the mean time interval between the exceedance of given runoff amounts. This analysis of peak flows was carried out to investigate if models were able to simulate the full range of peak discharges observed in the catchments. In addition, the empirical extreme value distri-3 http://wflow.readthedocs.io/en/latest/. bution can provide information on the usefulness of models for flood (frequency) studies and extrapolations to more extreme events, by assessing the shape of the distribution, as well as the tendency of the difference between higher modelled and observed peak flows. The flow duration curves were constructed for the lowest 20 % of the discharges. Low flows are important in the Meuse basin, especially from a user's perspective; comparing observed and simulated flow duration curves helped to assess how well models were able to reproduce low flows.
Finally, specific periods were selected to compare modelled and observed hydrographs. By looking at specific events, more detailed differences can be observed between models. Four different periods outside the calibration period were selected for this analysis: a summer period, a transition from low to high flows and two winter periods. The summer of 2008 was selected, because many high-intensity precipitation events occurred during this period; during the summers in the calibration period, these events did not occur very frequently, making this a benchmark period. The autumn of 2003 was selected as a low-to-high flow transition period, as 2003 was a very dry summer, so problems with re-saturation were likely to be largest during this year. The two analysed winter periods were 2002-2003 and 2010. In the studied catchments, winter runoff can consist of rainfall (in 2002-2003) in the event of higher temperatures or of snowmelt (2010) in the event of lower temperatures. By comparing these two winter periods, the model's ability to reproduce both conditions was investigated. It should be noted that not all models contain a snow routine; thus, the winter of 2010 was also used to investigate how important a snow routine is for simulating discharges.
The statistical analyses and specific periods of the hydrographs were first compared visually; additionally, the relative error between a set of observed and modelled signatures was assessed. The modelled signature values were calculated based on the best-performing model realisation and are shown in the specific plots. The best-performing model realisation was selected for each signature to reflect the best achievable model performance and to minimise the effect of the different bandwidths in model realisations between the different models.

Results
The analyses of metrics, statistics and hydrographs for the 11 model structures, run for the five catchments for the period 1 January 2001 until 31 December 2010, showed different model performances. All analysed figures can be found in the Supplement (Sects. 3-9); overall, they confirm that all models perform well (maximum E NSE varying between 0.85 and 0.91 and maximum E NSElog between 0.85 and 0.93 for the entire modelled period for Tabreux; Supplement, Sect. 2). In all figures, the results for the 20 selected realisations per  Figure 4. Difference between observed and modelled yearly discharge for Ourthe (green bars), Ourthe Orientale (orange bars) and Ourthe Occidentale (purple bars). Note that, to make the graphs more readable, outliers were not plotted.
model are shown and their bandwidth is closely related to the calibration method applied. It was found that even very simple lumped models (M2) could perform as well as very complex (semi-)distributed models (FLEX-Topo and wflow_hbv) under wet conditions. Most models had higher performances during the validation period than during calibration and blind validation periods in terms of E NSE and E NSElog , probably due to the wetter conditions during the validation period. The hydrographs and the cumulated discharges over the entire period showed that most models slightly underestimated observed flows, except for FLEX-Topo. A number of relevant differences between models and catchments are highlighted below. For each section, we explain our findings by showing the results for the most illustrative catchment.

Internal averaging within the Ourthe catchment
Yearly simulated and observed flows in the Ourthe and its two nested catchments (Ourthe Orientale and Ourthe Occidentale) possibly show the effect of internal averaging, as depicted in Fig. 4. While discharged volumes are underestimated by all models in the Ourthe Occidentale, they are overestimated by most models in the Ourthe Orientale, and this seems to average out for the Ourthe at Tabreux. Topography, land cover and geology are comparable for the Ourthe Orientale and Ourthe Occidentale catchments, with the Ourthe Orientale catchment being a little steeper and having slightly more forest cover. However, the topography of both catchments differs from that of the Ourthe catchment at Tabreux, indicating that parameters may not be directly regionalised.
Another difference between the Ourthe catchment and its subcatchments is the volume of precipitation and runoff; the Ourthe Orientale catchment receives more precipitation and produces less runoff than the Ourthe Occidentale catchment. Previous studies (e.g. Kleidon and Heimann, 1998;Gao et al., 2014;de Boer-Euser et al., 2016) showed a link between climate (i.e. precipitation and evaporation volumes) and root zone storage capacity. Following their argument, the root zone storage capacity should indeed be larger for the Ourthe Orientale catchment and smaller for the Ourthe Occidentale catchment, compared to the Ourthe catchment, to meet the evaporative demand of the Ourthe Orientale catchment. Using the root zone storage capacity of the Ourthe catchment for the Ourthe Orientale catchment could lead to modelled discharges that are too high; using it for the Ourthe Occidentale catchment could lead to modelled discharges that are too low. On the other hand, it is also possible that precipitation is underestimated in the Ourthe catchment, as all models are underestimating the runoff volume for the Ourthe Occidentale. . "Slope1" presents the relative error in the slope of the distribution with T r < 1.5 years; "Slope2" presents the relative error in the slope of the distribution with T r > 1.5 years.

Modelling flood peaks
for the Lesse. The signatures used for the flood peaks are the slopes of the upper and lower part of the distribution; the break point between the upper and lower slope is set at a return period of 1.5 years. Out of all studied catchments, high flow extremes are the most difficult to capture by the models for the Lesse catchment: most models underestimate the flood peaks for this catchment, while they can capture them relatively well for the other catchments. For the Lesse catchment, all models are able to model the lower peaks, but they underestimate the higher peaks. Although these higher peaks are difficult to simulate, Fig. 5 shows that it is not impossible as at least one model (GR4H-CemaNeige) is able to reproduce the steeper increase in peak flow for high return periods and capture the highest peaks. The other models have a varying behaviour: some capture a part of the higher slope, while others have a poor performance for this signature. The fact that some models are able to capture the highest peaks reduces the probability that data errors and handling are the cause of underestimating the highest peaks in these catchments, as one might have concluded if all models had failed.
What is striking about the results for all catchments is that the simplest models, consisting of only two reservoirs like M2, perform as well or sometimes even better than more complex models. This indicates that during these very wet events, in the entire catchment, fast flow paths were activated and all water is drained towards the stream. With a parsimonious model structure, it is relatively easy to calibrate the limited number of parameters to fit the peak flows. When a model is more complex, including a splitting component between the fast runoff and another runoff reservoir, it might be more difficult to calibrate the model and peak flows might be overestimated or underestimated. Model M5 also contains such a splitter of about 20 % going to the groundwater reservoir, but this does not seem to be high enough to influence the performance negatively. The importance of a groundwater and an interception reservoir during these events is limited, as they represent only a very limited fraction of the peak flows, as can be seen from the difference between M4 and M5.

Modelling low flows
Low flows were analysed by plotting the lowest 20 % of the observed and modelled flow duration curves. The slope and mean of this part of the flow duration curve were used as signatures. Discharges during the summer recession periods are generally low (ranging between 0.004 and 0.015 mm h −1 for the lowest 20 %) compared to the average discharge (0.05 mm h −1 ). The influence of a groundwater reservoir on the modelled discharge is significant, as the flow duration curves illustrate, for example, for the Ourthe at Tabreux Figure 6. Lowest 20 % of the flow duration curves for the Ourthe at Tabreux for all models (red line indicates observed discharge; blue lines indicate modelled discharge). (Fig. 6). Adding a groundwater reservoir improves the simulation of the low flows, as illustrated by the difference in performance between models M4 and M5. The only difference between the two models is the presence of a groundwater reservoir; as a consequence, M4 underestimates low flows and M5 simulates them properly. This indicates that water is stored during the high flow period in winter and released again during the low flow period in summer.
The configuration of the groundwater reservoir is also important: model structures with a groundwater reservoir parallel to the fast runoff reservoir (M5, NAM, FLEX-Topo, VHM) generally give the best results. Model structures without a groundwater reservoir (M2-M4) underestimate the low flows, while models with a serial or interactive groundwater reservoir (GR4H, PRESAGES, WALRUS, wflow_hbv) overestimate the low flows or model the recessions too steeply. On one hand, this indicates the importance of preferential recharge in the catchment; on the other hand, it indicates the existence of runoff processes with different timescales. With a parallel groundwater reservoir, the timescales for runoff generation are decoupled; with a serial or interactive groundwater reservoir, they are connected. These differences in results between models indicate that the processes related to fast and slow runoff generation occur relatively independent of each other in the studied catchments. The described results are clearly visible in the flow duration curves of the Ourthe at Tabreux and the Lesse at Gendron; for the other catchments, the same patterns can be found, but slightly shifted upwards or downwards.

The effect of a very quick runoff component
In the summer of 2008, precipitation intensities were higher than in other years, although total precipitation amounts were similar. This resulted in a flashy response of summer peaks, which is clearly shown for the catchment of the Ourthe at Tabreux in Fig. 7. The performance for this peaky response was assessed with signatures for the average slope of the declining limbs and the total discharged volume. The antecedent root zone storages before the events can be expected to be low due to high transpiration rates in summer. While most models are not able to capture the summer peaks, VHM and FLEX-Topo are able to simulate the dynamics well, although FLEX-Topo overestimates the summer peaks. As shown in Fig. 3, VHM and FLEX-Topo are the only models that contain a very quick runoff component preceding the root zone storage and are independent of the root zone storage. Hence, it illustrates that this component is essential for simulating short, intense summer events which are likely to cause infiltration excess overland flow (i.e. precipitation intensity being higher than infiltration capacity of the soil). Under dry conditions, the infiltration capacity of the soil is assumed to be disconnected from the saturation of the soil. Therefore, the very quick flow component should be inde-Hydrol. Earth Syst. Sci., 21, 423-440, 2017 www.hydrol-earth-syst-sci.net/21/423/2017/ Figure 7. Modelled (blue) and observed (red) discharges for summer 2008 for the Ourthe at Tabreux. The green line shows the cumulative actual evaporation for the plotted period. Note that the four graphs with precipitation and temperature on top are the same. The term "sfl" presents the relative error in the average slope of the falling limbs; "Qsum" presents the relative error in the total modelled discharge for the presented period.
pendent of the root zone storage and should precede it; otherwise, these short, intense summer rainfall events are stored in the soil instead of discharging directly to the river. Models with a very quick runoff component which is affected by the root zone storage (WALRUS and NAM) and models with a very quick runoff component following the root zone storage (GR4H, wflow_hbv, PRESAGES) do perform better than models where the very quick runoff component is entirely lacking (M2 to M5), but they miss the sharpness of the response due to damping of the generated peaks. These findings are consistent for all studied catchments.

Transition from low to high flows
The largest differences in model results between the modelled catchments occur for the transition from low to high flows. The signatures used for this period are the ratio between the first and the highest peak of this specific period and the total discharged volume. For the transition period in 2003, runoff is overestimated for all models in the Ourthe Orientale (Fig. 8), while only to a minor extent in the other catchments. In the Lesse catchment, the performance during this transition period is the highest from the four selected periods for almost all models. In addition, the performance in the Lesse catchment is also higher than that for the calibrated Ourthe catchment for almost all models. The variability in performance between models and between subcatchments for this event prevented pinpointing model components that explain the differences in performance. Although all models overestimate the discharge of the Ourthe Orientale (Fig. 8), their response is different. They especially differ in simulating the two highest peaks: PRESAGES and WALRUS simulate the first one relatively well, but underestimate the second. The other models overestimate the first peak and vary in how they simulate the second one. As the transition period is controlled by the modelled rate of infiltration and its dependence on soil saturation state, one reason could be explained by differences in modelled evaporation in the antecedent period; however, the model with the lowest evaporation (PRESAGES) is not the model with the highest overestimation of discharge. FLEX-Topo strongly overestimates the discharges; this can partly be caused by the root zone storage capacity. This model has a climate-derived root zone storage capacity (de Boer-Euser et al., 2016), which is significantly higher for the Ourthe Orientale catchment than for the Ourthe catchment (see also section "Internal averaging within the Ourthe catchment"). The difficulty the models encounter to model this transition may be linked to the hysteretic behaviour in dry-to-wet transition periods (autumn) and in wet-to-dry transition periods (spring). This finding illustrates remaining knowledge gaps with regard to the rewetting of catchments after dry periods, which seems to work differently than what is currently assumed in our models.

The effect of a snow routine
The models that include a snow routine (GR4H, WALRUS, M4, M5, NAM, FLEX-Topo and wflow_hbv) did not perform significantly better than the others during snow events. Figure 9 shows the winter of 2010 for the Semois catchment: this is the catchment and period with the largest differences between models with and without snow. For this period, the timing and the discharged volume of the snowmelt peak at the beginning of March 2010 were used as signatures. The models with a snow routine can reproduce the snowmelt peak slightly better. The differences are, however, rather small. Although it could be expected that having a snow module improves the performance during a snow event, it was not clearly found in this study. Possible causes for the limited effect of the snow module are that snow cover mainly occurs for short periods of time and that the influence of snowmelt on the discharge is limited, but that some snow does occur every winter. In addition, the discharges corresponding to snowmelt periods are similar in magnitude to those originating from liquid precipitation. These aspects, in combination with the use of general metrics for calibration, lead to the possibility that (small) influences of snow on the discharge are compensated by other parameters when a model does not have a snow module.

Results for all catchments
Figures 4 to 9 show plots for specific catchments; Fig. 10 additionally shows the relative error for eight signatures, calculated for the periods shown in the plots for all models and all catchments. A red symbol indicates that the modelled value is too low, a blue symbol that the modelled value is too high; darker colours indicate larger errors, and light or white colours indicate that the modelled signature is very close to the observed signature. The figure shows that the cumulative flows can be reproduced well by all models for all catchments. The higher slope of the peak distribution is difficult to reproduce by most models for all catchments. This contrast between average performance and modelling peak flows was found by Donnelly et al. (2016) as well. Furthermore, it can be seen that, in the event of larger errors, the signatures are generally underestimated and not overestimated, except for the slope of the lower flow duration curve and the slope of the falling limbs in the case of FLEX-Topo.

Findings about the Meuse basin
The results of this study first of all show differences and similarities between catchments and models. In addition, the analysis of model behaviour under relatively dry conditions (Figs. 6 and 7) shows which model configurations are more suitable for these catchments than others: the conceptualisations of the very quick runoff component and the groundwater reservoir. The very quick runoff component is necessary and should precede infiltration into the root zone storage and not be affected by it. The groundwater reservoir is necessary as well and can best be implemented in parallel to fast runoff generation. The effect of a very fast runoff component is directly visible in the hydrographs and consistent for all catchments. The effect of a (different conceptualisation of the) groundwater reservoir is best visible in the lower parts of the flow duration curve and the strength of the effect varies per catchment. The results thus indicate that both components are important for a conceptual model of these catchments, especially when the model is aimed to be applicable for analysis of peak and low flows. High flows are best predicted when the root zone storage directly flows to the fast runoff store, with only limited splitting towards other reservoirs.
These findings show that in summer, during intense rainfall events, the infiltration capacity of the soil is exceeded by the rainfall intensity and leads to a very rapid response for at least part of the catchment. The results regarding the recession illustrate the preferential pathways that exist between the unsaturated soil and the groundwater and their importance during low flow periods. Regarding peak flows, we found that during these events, fast flowpaths in the entire catchment are activated and that all water is rapidly drained to the river. Very simple model structures (which may contain only two reservoirs) are then sufficient to model peak flows well in the Meuse. However, these simple model structures are not able to capture the full range of regimes, especially during low flows. These results highlight how difficult it is to develop a model structure which is able to capture all different regimes that occur in the studied basins.
We were able to generalise the importance of two components to improve low flow predictions, and we pointed out an important component to model peak flows better; other results were too variable in space and between model structures and could therefore not be linked to specific model structural components. The comparison consists of 11 model structures, each with specific details. Therefore, other differences and similarities in the modelled discharge could not be easily related to differences and similarities in model conceptualisations. This highlights remaining knowledge gaps with regard to important processes that occur during the transition from low to high flows (Fig. 10), which are not well understood and therefore not well implemented in our models. In Note the four graphs with precipitation and potential evaporation on top are the same. "Peak" presents the relative error in the ratio between the first and the highest peak; "Qsum" presents the relative error in the total modelled discharge for the presented period.
these periods, we believe that vegetation plays a crucial role, influencing infiltration and evaporation, but these dynamics seem to be lacking in our models (cf., Seibert et al., 2016). Another general result of this comparison study is the higher performance of the normal (non-blind) validation period compared to the calibration and blind validation periods. Although performance generally decreases during validation, some studies show an increase in performance (e.g. Hrachowitz et al., 2014) for the validation period. Often, this indicates that hydro-meteorological conditions in the validation period were easier to model. The same holds here: the validation period is the wettest period, and most conceptual models yield the best performance under wet conditions. The higher performance during validation and the hydrometeorological differences between the calibration and validation periods show that the models are transferable in time and space within our testing periods.

Benefits of an intercomparison study
Intercomparison studies can provide a more detailed overview of a model's potential than single model studies.
In that sense, they enable individual modellers to reflect on familiar model structures through comparative identification of lacking or relevant components. In a single model study, a poor performance may be easily blamed on data shortcomings or model structural errors. In an intercomparison study, it is less likely that the poor performance of a certain model is due to data errors if there is at least one model that performs well when forced with the same input data.
Comparing model performance following a fixed protocol and linking results to model components provides a strong basis for improved model development for all modellers. Dominant runoff processes and their model representations can be derived and added to the various model structures. New experiments and hypotheses on catchment understanding can be formulated and tested by all modellers in their specific model. This may ultimately lead to the development of model structures that are applicable for different hydrological regimes. As such, even though it is a time-consuming process, it is worthwhile to increase our understanding, to get to know other model structures and to stimulate the dialogue between different institutes and universities.
Preliminary results of the model comparison study were sent to the modellers with the instruction to evaluate the model results and speculate on how their model structure could be improved. One of the responses was that processes were not (or only recently) specifically included in the model (e.g. fast runoff caused by infiltration excess overland flow or snow) because they were not necessary for earlier ap- Figure 9. Modelled (blue) and observed (red) discharge for the Semois at Membre for the 2010 winter period. The green line shows the cumulative actual evaporation for the plotted period. Note that the plots with precipitation and temperature on top are the same. "Time" presents the offset in hours of the timing of the highest peak; "QsumE" presents the relative error in the modelled discharge of the snowmelt peak in March.
plications. In addition, the prescribed calibration objectives and lumped precipitation forcing used for most models were brought up as reasons for inferior model performance: the method used to calibrate VHM for this experiment differed from the normal calibration method applied ). This may have played an important role in the underestimations of peak flows. As explained by  and others, the E NSE objective function applied to the flows at all time steps gives weight to the high flows but less to extremely high flows because they are typically of shorter duration. When comparing the automatic calibration applied for this study to the manual calibration normally applied, focusing on the hydrological extremes , improved results for peak flows are obtained for the manual calibration, as illustrated in Fig. 11. This intercomparison study shows that the assessed models have different strengths in capturing specific characteristics of the runoff response. Single models may have been developed to perform better on a specific aspect at the expense of another one, as explained by Duan et al. (2007). Applying a multimodel ensemble instead of relying on a single model outcome provides more information on model structure uncertainty. This helps hydrologists to better understand the catchment functioning and improve uncertainty estima-tions. In an operational context, multimodel ensembles are useful to make more informed decisions.

Comparison of models
The choice of calibration method was left to the individual modellers, with the only constraint that E NSE and E NSElog had to be used as objective functions. This resulted in some modellers using a search algorithm, while others applied uniform sampling of the parameter space. In addition, the (width of the) parameter space before sampling varied per model. This freedom in calibration probably has affected the results; on the other hand, we considered that the calibration algorithm chosen is strongly linked to the model and modeller's experience. As some methods used a search algorithm while others applied uniform sampling, the range of the model realisations varied considerably between models: for some models, the 20 realisations were almost identical, while for others, there were large differences. This added an additional source of variability to the comparison, but this variability did not alter the conclusions.
After the calibration on E NSE and E NSElog , the models were compared, focusing on specific periods and statistics. Although the general metrics showed a high performance for all models, (large) differences are observed when focusing on Figure 10. Summary of the performance of a set of signatures for all models and all catchments. The five dots in each square represent the different catchments: left is Ourthe, top middle is Ourthe Orientale, top right is Ourthe Occidentale, bottom middle is Lesse and bottom right is Semois. A red symbol indicates that the modelled value is below the observed value, a blue symbol that the modelled value is higher than the observed value; darker colours indicate larger differences, and light or white colours indicate that the modelled signature is very close to the observed signature. the specific periods or statistics. This is especially true when modelling events under drier conditions which are the conditions when different model behaviours were most visible. A model evaluation based on visual inspection of the hydrographs during specific events may sound subjective, but because it focuses on very specific events, the human eye easily detects patterns that reflect model performance. Combining the visual inspection with the relative error for specific signatures enabled us to further identify similarities between models and catchments, as shown in Fig. 10. This emphasises again the importance of a broad but specific model evaluation, especially for a model comparison study.
The majority of the models considered in this study are lumped and used lumped forcing. Only two models, FLEX-Topo and wflow_hbv, used the semi-distributed and completely distributed forcing, respectively. The distribution of the forcing and the model did not seem to have a significant impact on model performance compared to the other models. The differences in model structure affected model performance more than the differences in distribution of forcing. This is in line with earlier studies (e.g. Euser et al., 2015;Vansteenkiste et al., 2014), which showed that distribution of forcing data has a smaller effect on performance than the selection of model structure.
The varying degree of experience of the modellers with both their model and calibration technique and with the studied areas is likely to influence the reproducibility of this experiment. However, the similar forcing data used and the de- Figure 11. Difference between observed (red) and modelled (blue) empirical frequency distribution of peak flows for the Lesse at Gendron for the entire modelled period (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) for the automatic (left panel) and the manual (right panel) VHM calibrations. "Slope1" presents the relative error in the slope of the distribution with T r < 1.5 years; "Slope2" presents the relative error in the slope of the distribution with T r > 1.5 years.
fined protocol enabled to reduce the degrees of freedom of the modellers and enabled the comparison of the results.
This study is a large step forward in the international cooperation between universities and institutes working in the Meuse basin. Sharing data and model results in this set-up has never been realised before, but it is fundamental to open up the dialogue and advance hydrological understanding of the studied catchment in a more coherent way.

Future intercomparison studies
We think that international model intercomparison studies are very important and are definitely valuable in future research programs. First of all, they are a good opportunity to increase cooperation and discussion between different institutes. In addition, they are a good means for young scientists to get to know the models used in neighbouring universities and institutes.
To increase the possibility to draw strong conclusions about the hydrological functioning of a catchment, a different set-up may be useful. If all modellers would select a very strong element of their model, this could be incorporated in all the other models. By doing this, in a controlled sequence, and actually creating a virtual laboratory, probably more insight could be obtained regarding hydrological functioning and suitable model conceptualisations. In addition, more independent data sources, besides discharge, would probably strongly increase the possibility to obtain insight about the hydrological functioning of the studied catchments.

Conclusions
For this study, we compared 11 models for five subcatchments of the Meuse basin. All models were calibrated on the Ourthe at Tabreux; they were then evaluated for two differ-ent periods and five different catchments. E NSE values for all models and all catchments were comparable, in some cases with even higher performances during the validation period. Although E NSE values were comparable, a more detailed analysis, focusing on specific events through hydrograph inspection and statistics, revealed clear differences between the models, especially for drier conditions. We found that a very quick runoff component preceding and not affected by the unsaturated store was relevant for modelling the hydrological response after short and intense summer precipitation events. This conceptualisation ensures that water is not stored in the soil but quickly flows to the river. Also, a groundwater reservoir implemented in parallel to the fast runoff generation, representing preferential pathways for groundwater infiltration, seemed necessary to model the recession best. For high flows, we found that very simple and lumped model structures with only an unsaturated store and a fast runoff component performed better than complex models. This highlights the difficulty to develop model structures which are able to cope with different hydrological regimes (high and low flows). The presence of knowledge gaps was further revealed by the inability of our models to predict the transition from a low flow period to high flows well, probably related to the lack of vegetation dynamics included in our models. Thus, from this study, we can conclude that often more detailed analyses are required to relate differences in the hydrograph to model structure components. A model intercomparison study is a valuable approach to draw conclusions about hydrological functioning of a system, and most of all, it is a great opportunity to reflect on one's model structure by comparing it with other models. This leads to the following question: "What is my model doing well in comparison to other models and why?". This points out the model structure components to keep and, in the end, focusing on this question will improve our hydrological understanding.
The Supplement related to this article is available online at doi:10.5194/hess-21-423-2017-supplement.