Moving beyond traditional model calibration or how to better identify realistic model parameters : sub-period calibration

Introduction Conclusions References


Introduction
Conceptual hydrological models represent an abstraction of real world processes, and are typically constituted of a number of interconnected reservoirs which are supposed represent the main catchment compartments and dominant processes . It is typically the case that several model parameters are not measureable, even 20 when they are supposed to represent physical catchment characteristics, and therefore they have to be determined by calibration (Wheater et al., 1993). Different approaches to infer parameter values and their likelihood distribution have been developed, for example single or multi-objective calibration (Gupta et al., 1998), generalized likelihood uncertainty estimation (GLUE, Beven and Binley, 1992), dynamic identifiability analysis A model is calibrated for a period of time and the parameter sets which are selected as behavioral in calibration period are evaluated for a different validation period. Different combination of calibration and validation were suggested by Mroczkowski et al. (1997) however the proposed combinations were constrained in calibration-validation framework for different time periods. This fact also is repeated in comprehensive model are affected by the calibration period. In this respect, previous research has shown that optimal parameter sets for different periods can change substantially. Wagener et al. (2003) (DYNIA) have developed a method to screen across the time series of model prediction in order to investigate the identifiability of model parameters. They show that uncertainties associated to model parameters can vary substantially in 10 different time periods.
Previously, Freer et al. (2003) assessed Dynamic TOPMODEL using GLUE based on different objective functions and rising or falling limbs of the hydrograph. They showed that it may be difficult to propose a consistently parameterized model structure due to the significant variability of the observed responses. They concluded that the model fails to meet even relaxed acceptable thresholds. Hartmann and Bárdossy (2005) investigated parameter transferability in different climatic conditions ("warm", "cold", "wet" and "dry") and for different time scales (days up to years). They designed a calibration method that allows a good performance on different time scales simultaneously. Li et al. (2011) investigate the transferability of model parameters for dry and wet con-an example, tracer data or remotely sensed evaporation into model calibration helps identification of more realistic model structure and parameter sets (Dunn and Colohan, 1999;Seibert and McDonnell, 2002;Weiler et al., 2003;Freer et al., 2004;Uhlenbrook and Sieber, 2005;Vaché and McDonnell, 2006;Son and Sivapalan, 2007;Winsemius et al., 2008;Dunn et al., 2008;Birkel et al., 2010).
Both, the multi-objective and the multi-criteria calibration strategies, constrain the feasible parameter space and facilitate parameter selection on basis of performance trade-offs, i.e. Pareto fronts. However, as argued by Beven (2006), the mere mappings of optimum parameter sets after calibration are "too simplistic, since they arbitrarily exclude many models that are very nearly as good as the 'optima' ". This simply means the parameter realization should include "sub-optimal" parameter sets as well.
This paper introduces a new framework for parameter identification including optimal and sub-optimal parameter sets which are more time consistent. The method is based on the calibration on different periods, and determines the parameter sets which perform best for all sub-periods. As the selected parameter sets are evaluated in dif-15 ferent periods, only the time consistent parameter sets are selected. The new method is applied on a case study and compared with a calibration-validation framework with respect to parameter identifiability and performance for the Wark catchment located in the Grand Duchy of Luxembourg, using the lumped conceptual model HyMod.
2 Sub-period calibration framework 20 The sub-period calibration framework involves two crucial steps in extracting the most realistic parameterizations for a given model structure. Firstly the available input and output data sets are split into (ideally equal-length) k sub-periods. These sub-periods and their lengths can be arbitrarily chosen (e.g. month, season, etc). It can, however, be convenient to base the analysis on full years. Alternatively, the full observation all Pareto fronts. In order to achieve this, the k-dimensional Pareto front of distances is determined. The sub-period calibration concept is illustrated in Fig. 1. The Parameters are acceptable which have the most consistent performance regarding the optimum performance of each sub-period.
The concept is further illustrated with an abstract 2-objective, 2-sub-period example 10 in Fig. 2. The conventional CPFs for the two sub-periods, are shown in Fig. 2a. Symbol 1 (circle) represents a parameter set that is a Pareto-member of the first sub-period; however, it does not perform well compared to the best possible outcome, i.e. CPF 2 , when applied in the second sub-period. The parameter set represented by symbol 2 (star), on the other hand, although not a member of the CPF 1 in the first sub-period, 15 performs rather well in the second sub-period as can be seen by the short distance to CPF 2 . In other words, parameter sets which are slightly sub-optimal in one sub-period may perform significantly better than "optimal" parameter sets, i.e. CPF members, in other sub-periods. Figure 2b and c illustrates the set-up of the sub-period calibration framework. For 20 each parameter set in each sub-period the distance to the two CPFs was calculated. The distances to the Pareto fronts of each parameter set represent a bi-dimensional space (Fig. 2c). Typically, model parameters on the CPF 1 will not be part of the CPF 2 . Hence, there will be a tradeoff when the objective is to minimize the distance between both Pareto fronts. Hereafter this tradeoff will be referred as the Minimum Distance Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | The sub-period calibration framework can be expressed in formal notation as follows: where Y , γ, ξ and θ are the model output, the hydrological model, forcing and parameter set, respectively. The objective function (O) can be described as an error function (E ) which returns the difference between the observed and model values: where n is the number of objective functions which is used for evaluation of the model performance and have to be optimized and k is the number of sub-periods and Y oj indicates the observed time series for j -th subperiod. The calibration Pareto front (CPF) for each of the k sub-periods can be described by optimizing the objective function 10 related to the same sub-period: This results in k CPFs, each of which has the dimension n of the original objective space. The optimization space is then transformed into another multi-objective space with k (number of sub-periods) objective functions in which the difference in model 15 performance for each sub-period with its related Pareto front is evaluated: where G(.) quantifies the error between model performance for j -th sub-period and calibration Pareto front (CPF j ) for the same sub-period. The final solution can be obtained by minimizing L. The method will in the following be referred to as SuPer 3 Case study

Study area and data
The outlined methodology will in the following be illustrated with a case study using data from the Wark catchment in the Grand Duchy of Luxembourg. The catchment has an area of 82 km 2 with the catchment outlet located downstream of the town of The hydrological data including discharge at the outlet of the Wark catchment, evaporation estimated by the Hamon equation (Hamon, 1961) 20 The rainfall-runoff model applied in the Wark catchment to illustrate the effects of the sub-period calibration framework was the lumped conceptual HyMod (Wagener et al., 2001). HyMod was chosen for its limited number of parameters while still maintaining adequate process representation including slow and fast responses together HyMod is characterized by five states, including the soil moisture reservoir (S M (mm)), three linear reservoirs in series (S F 1 (mm), S F 2 (mm), S F 3 (mm)), mimicking the fast runoff component and one slow reservoir (S S 1 (mm)). The five parameters rep-5 resent the maximum soil moisture storage (S M,max (mm)), the spatial variability of soil moisture (β (-)), the partitioning between fast reservoirs and slow reservoir (α (-)), as well as the timescales of the fast and slow reservoirs (R Q ((12 h) −1 ), R S ((12 h) −1 )).

Hydrological model
P (mm (12 h) −1 ), E (mm (12 h) −1 ), E p (mm (12 h) −1 ) and Q m (mm (12 h) −1 ) represent precipitation, actual evaporation, potential evaporation and modeled runoff, respec-10 tively. The simulated runoff by the model (Q m ) is the summation of slow and fast components (Q m = Q S 1 + Q F 3 ). The water balance equations and constitutive relations are listed in Table 2 and HyMod schematic illustration is depicted in Fig. 3.

Implementation of sub-period calibration
Using 2001 as model warm-up period, the remaining 2002-2004 observation period 15 was decomposed into three 1-yr sub-periods (2002, 2003 and 2004). The three sub-periods were calibrated individually to obtain the independent calibration Pareto fronts for each sub-period (CPF 2002, CPF 2003and CPF 2004 ) as well as for the entire (CPF 2002-2004periods (CPF 2002−2004 ). Based on these premises, two example implementations of SuPer calibration are given below. In the 20 first implementation, the parameter sets minimizing the Euclidean distance of performance to the CPFs and generating the minimum distance Pareto front (MDPF, Fig. 2c thus not be excluded in order to ensure efficient exploitation of the information content in the available data. A full operational application of SuPer calibration, including CPF 2004 in a three-dimensional (i.e. three sub-periods) multi-objective practice is thus shown in the second implementation.
In this case study, HyMod was evaluated for high and low flows in a multi-objective 5 optimization approach. The respective objective functions used are the Root Mean Square Error of the flows (RMSE) and the Root Mean Square error of the logarithm of flows (LRMSE): where Q m is the modeled flow, Q o is observed flow, respectively and N is the number of time steps. RMSE was used rather than Nash Sutcliffe efficiency as RMSE does not need a base, which may be different in different year (or sub periods), for evaluating the performance (Schaefli and Gupta, 2007).
Here, the calibration to find the best parameter sets and the related CPF was based 15 on the MOSCEM-UA algorithm (Vrugt et al., 2003). This was chosen as SuPer calibration identifies parameter sets with the best performance relative to CPFs, and as MOSCEM-UA uses Zitzler strength Pareto ranking (Zitzler and Thiele, 1999), which gives a better and more uniform estimation of CPF. Note, however, that the choice of sub-periods, calibration objectives and criteria as well as of the calibration algorithm 20 used for SuPer calibration can in principle be arbitrarily adapted to available data and modeling requirements. Introduction  Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | blue crosses and stars, respectively. They exhibit significant skewed behavior towards similar part of the CPFs for the two sub-periods. For both sub-periods SuPer calibration chooses the parameter sets which perform close to the low flow (LRMSE) end of CPF CPF 2003 which shows the chosen model structure can simultaneously identify low flow better than high flow in both sub-periods.

Parameters identifiability
Parameter behavior of HyMod's fast reservoirs coefficient and slow reservoir coefficient was evaluated. The reason for this selection is that the slow reservoir coefficient has overlap values in the three sub-periods and fast reservoir coefficient does not have feasible overlap values for optimum parameter value. 5 The optimum parameter behavior is depicted for slow reservoir coefficient (R S ) of HyMod in Fig. 8. As it is clear from Fig. 8b- It also becomes evident in Fig. 8a that the actual parameter range as well as their distributions obtained from SuPer calibration based on MDPF 2002−2004 is more consistent with CPF 2004 , the best available parameter set of target period 2004, than parameter sets obtained from CPF , CPF 2003, CPF 2002and CPF 2002−2004 .
The optimum parameter behavior is depicted for fast reservoir coefficient (R Q ) of HyMod in Fig. 9 (Fig. 9b-d and Fig. 9i-k). This is also the case when the entire time series 2002-2004 is used for calibration and parameters are obtained according to CPF 2002−2004 (Fig. 9e,l). However, SuPer calibration can detect the inconsistencies 10 between the best parameter ranges of the individual sub-periods and thus widens the feasible ranges for these parameters (Fig. 9f,g,m,n).

Discussion
The fact that SuPer calibration focuses on different parts of sub-period calibration Pareto fronts, CPFs, helps to indicate how Pareto members should be retained as "re- 15 alistic" (Figs. 4 and 7b). Pareto fronts of a calibrated model (CPF) may show a skewed behavior with respect to one or more objective functions (CPF 2002and CPF 2002 in Fig. 4). For traditional calibration strategies this introduces the requirement for a subjective decision on the parameter acceptance threshold (Fig. 10) as highlighted by Efstratiadis and Koutsoyiannis (2010). Khu and Madsen (2005) suggested a method-20 ology to choose appropriate CPF members based on investigating the performance of CDF in its different sub-dimensional spaces. Birkel et al. (2010) selected "realistic" parameter sets by confronting the "best fit" parameter sets with tracer data. In contrast with mentioned methodologies, SuPer calibration does not require a subjective threshold for identifying parameter sets as this threshold is implicitly given by the 25 MDPF. This threshold is not subjective rather it is the best compromise between CPFs of sub-periods that can be achieved by a given model structure. Furthermore SuPer Introduction Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | calibration doesn't need additional data (although additional data can be incorporated with SuPer calibration); it uses no more information than the data which is needed to calibrate a rainfall-runoff model traditionally.
Behavior of optimal parameter sets by SuPer calibration can be used as a criterion for parameter time consistency in different sub-periods. With time consistent parame-5 ters it is expected that the parameter ranges obtained by SuPer calibration are lower or equal to those obtained from long-term calibration, e.g CPF or CPF 2002−2004 . By identifying non-time consistent parameters, SuPer calibration can be used as a diagnostic tool for identifying model structure deficiencies (cf. Clark et al., 2008). This design also allows the reduction of both, type I and type II errors on model selection 10 (false positives and false negatives, Beven, 2010). Furthermore, SuPer calibration can provide information about the behavior of each parameter with respect to the hydrological condition of that period. As an example, the fast reservoir coefficient R Q shows higher values for the sub-period 2003 than for 2002; 2003 is hydrologically distinct to the other two years 2002 and 2004 (Fig. 9). Analyses like this, similar to the DYNIA 15  can help the modeler to evaluate which and how a parameter or a function in the model structure should be changed or amended.
The proposed SuPer calibration framework is thus a method that allows identifying realistic model parameterizations based on the premises that acceptable parameterizations have to perform consistently well when predicting the response variable in in-20 dependent model validation, which is implicitly enforced in SuPer calibration. To some extent it also has the potential to reduce epistemic error in models, i.e. the error due to disinformation  or inaccurate input data (Kavetski et al., 2002(Kavetski et al., , 2006. As a thought experiment, consider a catchment with an adequate long term average representation of precipitation. In the case of a significant storm event 25 with small spatial extent, which is not picked up by the gauges, a peak in runoff will be observed. A model will, through traditional calibration, be forced to mimic this peak even if there was no observed precipitation. This implies that the model will have to reproduce the "correct" output with "incorrect" input, hence the best fit parameter set Introduction will be one that does exactly that: reproduce the "real" output with the "incorrect" input. As a consequence, the chosen parameters will misrepresent reality and result in low predictive power of the model. As it is unlikely that identical storm configuration and timing will occur in any of the other sub-periods, SuPer calibration will most likely discard this parameterization if it performs far from the calibration of the other sub-periods 5 (cf. Fig. 8). Furthermore SuPer calibration can be used for storm events with different magnitude and return period separately to retain their characteristic during calibration process, as an example, sup-periods can be defined as different part of flow duration curve . Although SuPer calibration framework can in principle be implemented with different calibration methods, its dependency on Pareto fronts requires calibration methods which represent the Pareto front position in the objective space adequately well. The uncertainty in Pareto front identification may introduce uncertainty in the final selected parameter set chosen by SuPer calibration. In this study MOSCEM-UA (Vrugt et al., 2003) was used to generate Pareto fronts in both steps of the procedure (creating 15 CPFs and MDPFs). However, future research should investigate the effectiveness of MOSCEM-UA for the generation of MDPF in the second step of SuPer calibration, as the distance to Pareto fronts (e.g. line or surface) needs to be minimized instead of the vector toward a point (origin of objective space), which MOSCEM-UA was originally designed for. To ensure that using MOSCEM-UA in the second step of SuPer calibration 20 performs well in parameter identification, SuPer calibration was implemented in both steps with Monte-Carlo sampling using the same parameter rang for a million random samples. The result were consistence with the result obtained by MOSCEM-UA; however this may be case specific and not valid for other case studies or models with higher complexity therefore investigation the performance of optimization algorithm specially 25 in second step of SuPer calibration is highly recommended. 1900

Conclusions
In this paper a calibration framework, based on splitting the available data sets into sub-periods was proposed. The SuPer calibration framework is based on the extension of traditional split sample tests which can also be seen as an additional layer of model testing, independent from modeling objectives and criteria as well as calibration 5 algorithms. By extracting more information from the available data and by avoiding the "loss" of data otherwise used for validation, it allows the identification of more realistic model parameterizations. Although this comes at the cost of potentially reduced performance during calibration, model parameterizations as obtained by SuPer calibration give consistently better prediction performances, which is what modelers actually 10 should look for. The design of SuPer calibration is such that acceptable parameterizations have to perform consistently well when predicting any of the defined sub-periods, which is implicitly enforced in SuPer calibration, thus avoiding the need for explicit model validation. Furthermore, by the transformation of the traditional objective-space into a minimum Euclidean distance space the need for subjective choices of parameter 15 acceptance thresholds is avoided. It should be again emphasized here that SuPer calibration is not a calibration algorithm, nor is it explicitly addressing parameter uncertainty. It is rather a more advanced method of model testing, building on traditional split sample tests and making more efficient use of available data. SuPer calibration can in principle be done with any 20 number and type of objective functions (e.g. NSE or RMSE) but also with any number and type of calibration criteria (e.g. only using runoff or using runoff and tracer dynamics). A Matlab function of the SuPer calibration framework based on Monte-Carlo calibration strategy for the same case study presented in this paper is available at http://supercalibration.weblog.tudelft.nl/ or can be obtained by personal communica-25 tion with the lead author. 1901 classification: investigating the performance of HAND based landscape classifications in a central European meso-scale catchment, Hydrol. Earth Syst. Sci., 15, 3275-3291, doi:10.5194/hess-15-3275-2011, 2011. 1892 ceptual catchment models, Water Resour. Res., 33, 2325-2335, doi:10.1029/97WR01922, 1997 1904 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Fig. 2. (a)
Calibration-validation of a two dimensional abstract optimization problem; the lines represent the best available performance during calibration and validation periods Pareto fronts (CPF 1 and CPF 2 for sub-period calibration, respectively). The blue circle shows a CPF 1 member which performs poorly when validating it during the second sub-period, i.e. it plots far from the best available results as shown by CPF 2 , the reverse situation is illustrated by the green triangle which is a member of second sub-period (CPF 2 ) but performing far from first sub-period Pareto front (CPF 1 ), while the stars shows the performance of a non-CPF parameter set which performs relatively well in both sub-periods, i.e. for calibration and validation (CPF 1 and CPF 2 ), (b) proposed method of calibration with reducing the distance to the optimal solution, i.e. to the Calibration Pareto Fronts (CPF 1 and CPF 2 ), of each sub-period (c) Minimum Distance Pareto Front (MDPF) as generated by sub-period calibration; Star shows the trade of between performance related to each sub-period performance (CPF 1 ,CPF 2  S S1 S1 R S S S1 S S1 Q m Q m  , 2004(CPF 2002, CPF 2003, CPF 2004, CPF 2002 are illustrated by dark blue, yellow, black and red lines, respectively. The dots of the same colors represent model performances using the CPF 2002, CPF 2003and CPF 2002 -2004(MDPF 2002−2004 ). The color bar represents vertical values or the same distance to the Pareto front of the third sub-period (CPF 2004 -2004, respectively (CPF 2002, CPF 2003, CPF 2002, CPF 2002−2004 ) for RMSE (high flow). (f,g) Shows the parameter ranges obtained from SuPer calibration: light blue and green dots show the ranges of R S regarding SuPer calibration based on -2004(MDPF 2002and MDPF 2002−2004 -2004, respectively (CPF 2002, CPF 2003, CPF 2002, CPF 2002−2004 for LRMSE (low flow). (m,n) Shows the parameter ranges obtained from SuPer calibration: light blue and green dots show the ranges of R S regarding SuPer calibration based on -2004(MDPF 2002and MDPF 2002−2004, respectively for LRMSE (low flow).  -2004, respectively (CPF 2002, CPF 2003, CPF 2002, CPF 2002−2004 ) for RMSE (high flow). (f,g) Shows the parameter ranges obtained from SuPer calibration: light blue and green dots show the ranges of R Q regarding SuPer calibration based on -2004(MDPF 2002and MDPF 2002−2004 -2004, respectively (CPF 2002, CPF 2003, CPF 2002, CPF 2002−2004 ) for LRMSE (low flow). (m,n) Shows the parameter ranges obtained from SuPer calibration: light blue and green dots show the ranges of R Q regarding SuPer calibration based on -2004(MDPF 2002and MDPF 2002−2004, respectively for LRMSE (low flow). (o) Black dots show the range of R Q for CPF 2004 members with respect to LRMSE (low flow). MDPF indicates the trade of between performance of a parameter set regarding sub-period calibration Pareto fronts (CPFs).