An approach to identify time consistent model parameters: sub-period calibration

Conceptual hydrological models rely on calibration for the identification of their parameters. As these models are typically designed to reflect real catchment processes, a key objective of an appropriate calibration strategy is the determination of parameter sets that reflect a “realistic” model behavior. Previous studies have shown that parameter estimates for different calibration periods can be significantly different. This questions model transposability in time, which is one of the key conditions for the set-up of a “realistic” model. This paper presents a new approach that selects parameter sets that provide a consistent model performance in time. The approach consists of testing model performance in different periods, and selecting parameter sets that are as close as possible to the optimum of each individual sub-period. While aiding model calibration, the approach is also useful as a diagnostic tool, illustrating tradeoffs in the identification of time-consistent parameter sets. The approach is applied to a case study in Luxembourg using the HyMod hydrological model as an example.


Introduction
Conceptual hydrological models represent an abstraction of real world processes, and are typically constituted of a number of interconnected reservoirs which are supposed to represent the main catchment compartments and dominant processes . Typically, several of these model parameters are not measurable, even if they are supposed to represent physical catchment characteristics, and as a result they have to be determined by calibration (Wheater et al., 1993). Different approaches to infer parameter values and their distributions have been developed, for example single or multi-objective calibration (Gupta et al., 1998), generalized likelihood uncertainty estimation (GLUE, Beven and Binley, 1992), dynamic identifiability analysis (DYNIA,  and Bayesian inference (Wood and Rodríguez-Iturbe, 1975).
A key objective for hydrological modeling is the development of "realistic" models, that is, models which are able to reflect real catchment processes . The set-up of a realistic model requires the determination of a realistic model structure and a suitable parameterization. While the determination of a suitable model structure is a theoretical development in its own right (e.g. Wagener et al., 2002;Fenicia et al., 2007Fenicia et al., , 2011Clark et al., 2008;Savenije, 2009), we focus here on the determination of realistic parameter sets, and in particular, on parameter sets that reflect a consistent model behavior in time.
Model transposability in time is in fact recognized as one of the main requirements to a successful "validation" of model performance (Klemeš, 1986). Hartmann and Bárdossy (2005) advocate that "if a model is to be used under nonstationary conditions, its parameters and process descriptions should be transferable".
The calibration-validation approach (or the split-sample test proposed by Klemeš, 1986) has become standard in hydrological practice (Andréassian et al., 2009). A model is calibrated for a period of time and the parameter sets which are selected as behavioral in the calibration period Published by Copernicus Publications on behalf of the European Geosciences Union.

S. Gharari et al.: Sub-period calibration
are subsequently evaluated for a different validation period. Several combinations of calibration and validation for multiple-response data were suggested by Mroczkowski et al. (1997). Calibration and validation is proposed as a crucial step in the comprehensive model developing scheme proposed by Refsgaard et al. (2005). Seibert (2003) pointed out that the success of identifying the best parameter set (or model structure) relies on the selection of time periods with similar characteristics. He argued that the reason for the scarce literature on models which perform well in time periods with characteristics different from the calibration period is due to the fact that they most probably fail this test (i.e. the differential split sample test proposed by Klemeš, 1986). Kirchner (2006) criticized commonly used model evaluation methods. He argued that "such models are often good mathematical marionettes; they often can dance to the tune of the calibration data. However, their predictive validity is often in doubt". This shortcoming was repeatedly addressed in the literature (Anderson and Woessner, 1992;Hassan, 2004;Gupta et al., 2008;Refsgaard and Hansen, 2010).
The failure of validation tests has its counterpart in the fact that calibrated model parameters are inherently linked to the calibration time period, and may be inadequate to represent other periods.  developed a method to screen across the time series of model prediction in order to investigate the identifiability of model parameters. They show that uncertainties associated to model parameters can vary substantially in different time periods. Coron et al. (2012) used a similar concept to investigate the performance of the three models in contrasting climate conditions. They questioned the validity of parameter transferability in time due to varying climate conditions. Previously, Freer et al. (2003) evaluated the dynamic TOP-MODEL using GLUE with different objective functions based on the rising or falling limbs of the hydrograph. They showed that it may be difficult to propose a consistently parameterized model structure due to the significant variability of the observed responses. They concluded that the model fails to meet even relaxed acceptability thresholds. Hartmann and Bárdossy (2005) investigated parameter transferability in different climatic conditions ("warm", "cold", "wet" and "dry") and for different time scales (from days to years). They designed a calibration method that allows a good performance on different time scales simultaneously. Li et al. (2012) investigated the transferability of model parameters for dry and wet conditions. They showed that the dry period contains more information for model calibration than the wet period. Bárdossy and Singh (2008), using the depth function (Tukey, 1975), concluded "that equally performing parameters are not necessarily equally transferable or equally sensitive".
While the decrease of model performance in the validation period can have many causes, we focus here on how it is affected by the parameter selection approach. Various approaches have been proposed to extract meaningful hydrological information from the observed time series. Boyle et al. (2000Boyle et al. ( , 2001 used the multi-objective calibration approach proposed by Gupta et al. (1998) to calibrate a model for different flow segments of the hydrograph. The multiobjective approach makes it possible to identify optimal parameter sets for a set of objective functions. This approach was extensively used in several applications (for a review see Efstratiadis and Koutsoyiannis, 2010). Incorporating multiple calibration-criteria, for instance tracer data or remotely sensed evaporation, into model calibration helps in identifying a more realistic model structure and parameter sets (e.g. Weiler et al., 2003;Freer et al., 2004;Uhlenbrook and Sieber, 2005;Vaché and McDonnell, 2006;Son and Sivapalan, 2007;Winsemius et al., 2008;Dunn et al., 2008;Birkel et al., 2010;Fenicia et al., 2010;Hrachowitz et al., 2012).
Both multi-objective and multi-criteria optimizations constrain the feasible parameter space and facilitate parameter selection on the basis of performance trade-offs, i.e. Pareto fronts. However, as argued by Beven (2006), the mere mappings of optimum parameter sets after calibration are: "too simplistic, since they arbitrarily exclude many models that are very nearly as good as the optima". As argued by Andréassian et al. (2012), mathematically optimum parameter sets may be far different from hydrologically optimum parameter sets. These arguments simply imply that the parameter realization should include "sub-optimal" parameter sets as well.
Hence the question of how to retain model parameters that have a consistent model behavior in time deserves further investigations. A related challenge is how to establish the tradeoff between behavioral and non-behavioral parameters in a meaningful way.
With the attempt to address this question, we introduce a new approach for parameter identification including optimal and sub-optimal parameter sets which are more time consistent. The method is based on the calibration on different periods, and determines the parameter sets which perform best for all these sub-periods. As the selected parameter sets are evaluated in different periods, only the time consistent parameter sets are selected. The new method is applied to a case study in the Wark catchment in Luxembourg, using the lumped conceptual model HyMod, and compared with a calibration-validation approach with respect to parameter identifiability and performance.

Sub-period calibration
The aim of the sub-period calibration is to identify a time consistent parameterization for a certain model structure and data set. The approach involves two steps. First, the available input and output data sets are split into (ideally equallength) k sub-periods. These sub-periods and their lengths can be arbitrarily chosen. They can, for example, be months, seasons, years, or wetness conditions (e.g. Hartmann and Bárdossy, 2005;Seiller et al., 2012). Additionally, a number n of objective functions is defined. Each sub-period is then calibrated individually by sampling the parameter space and identifying the n-dimensional Pareto front for each sub-period. Therefore k n-dimensional calibration Pareto fronts (CPF) are obtained.
Subsequently, the parameter space is sampled to find parameter sets which minimize the distance to the k Pareto fronts. Distance measures can, for example, be the Euclidian distance to the Pareto front or any other measure which evaluates the performance of a parameter set relative to the Pareto front. This leads, for each parameter set, to k distances for each of the k sub-periods.
The goal is to find parameter sets that minimize the distances to all Pareto fronts. In order to achieve this, in a kdimensional space, we represent each parameter set by its distance to each of the k Pareto fronts. The Pareto front of this cloud of points represents the parameter sets with minimum distance to all Pareto fronts. We call it the minimum distance Pareto front (MDPF). It contains the parameter sets that have the most consistent performance in each sub-period.
The concept is illustrated in Fig. 1 with a schematic 2objective function, 2-sub-period example. The CPFs for the two sub-periods are shown in Fig. 1a. The circle represents a parameter set that is a Pareto member of the first subperiod (zero distance to the CPF 1 ); however, it does not perform well compared to the optimum in the second sub-period (large distance to the CPF 2 ). The parameter set represented by the triangle, although sub-optimal in the first sub-period, is a Pareto member in the second sub-period. The parameter set represented by the star, on the other hand, although not a Pareto member in both sub-periods, performs rather well overall (small distance to both the CPF 1 and CPF 2 ). Figure 1c plots the distance of each parameter set to the Pareto fronts. The circle has zero distance to CPF 1 , and large distance to CPF 2 . It does not belong to the MDPF. The triangle has zero distance to CPF 2 , and small distance to CPF 1 , indicating the edge of MDPF. The star has small distances to both CPF 1 and CPF 2 , and it belongs to the MDPF at some intermediate position.

Study area and data
The outlined methodology will, in the following, be illustrated with a case study using data from the Wark catchment in the Grand Duchy of Luxembourg. The catchment has an area of 82 km 2 with the catchment outlet located downstream of the town of Ettelbrück at the confluence with the Alzette River (49.85 • N, 6.10 • E). With an average precipitation of 850 mm yr −1 and an average potential evaporation of 650 mm yr −1 the average runoff is approximately 250 mm yr −1 . The geology in the northern part is dominated by schist while the southern part of the catchment is mostly underlain by sandstone and conglomerate. The dominant land uses are forest on hillslopes, agricultural land on plateaus and pastures in the valley bottoms. The elevation varies between 195 to 532 m a.s.l. with an average of 380 m a.s.l. The slope of the catchment varies between 0-200 %, with an average of 17 % (Gharari et al., 2011). The hydrological data include: discharge at the outlet of the Wark catchment, potential evaporation estimated by the Hamon equation (Hamon, 1961) with data measured at Findel (Luxembourg airport; Fenicia et al., 2008), and precipitation by three tipping bucket rain gauges. The data series has been discretized at 12-h resolution. For model evaluation, the period 1998-2009 was used. The meteorological conditions of each year are summarized in Table 1.

Hydrological model
The rainfall-runoff model applied to the Wark catchment is the lumped conceptual HyMod model (Wagener et al., 2001). HyMod was chosen for its low number of parameters while still maintaining adequate process representation including slow and fast responses together with a non-linear soil moisture component.
HyMod is characterized by five reservoirs, including the soil moisture reservoir (S M [L]), three linear reservoirs in series (S F 1 [L], S F 2 [L], S F 3 [L]) mimicking the fast runoff component, and one slow reservoir (S S 1 [L]). It has five parameters representing the maximum soil moisture storage capacity (S M,max (L)), the spatial variability of soil moisture (β [-]), the partitioning between fast reservoirs and slow reservoir (α[-]), as well as the timescales of the fast and slow reser- . Model equations were solved using the forward explicit Euler method using 12-h resolution time series.
represent precipitation, actual evaporation, potential evaporation and modeled runoff, respectively. The simulated runoff by the model is the summation of slow and fast components (Q m = Q S 1 + Q F 3 ). The water balance equations and constitutive relations are listed in Table 2 and the HyMod schematic illustration is depicted in Fig. 2.

Implementation of sub-period calibration
In the following, two case studies are presented where we compare performance and selected parameter sets by two approaches: (1) calibration over the entire length of a (sub-)period, which for sake of simplicity thereafter is referred to as standard calibration; and (2) calibration over decomposed sub-periods which is referred to as SuPer (subperiod) calibration. The case studies are designed to show the performance of SuPer calibration for parameter identification extracting information from sub-periods. The first case study intends to make the best use of limited available data by decomposing it into different sub-periods. The second case study intends to investigate how standard calibration might average out the characteristics of the sub-periods over the long time series.

Case study 1 -Short data series
The 3 Table 1 for the two years. Year 2002 is wet compared to 2003.

Case study 2 -Long data series
The available time series of the Wark catchment are divided into three parts.  tion (1998-2005) is assessed with 3 different approaches (graphically illustrated in Fig. 3): 1. Pareto optimal parameter sets (CPF 1998(CPF −2005 ).
2. Parameter sets within a pre-defined distance to the origin. In this case study, the parameter sets with a distance smaller than 1.05 times of the closest Pareto member to the origin.
3. Parameter sets contained within the quadrant determined by the single objective optima. parameter sets of each individual year of the entire calibration and validation periods (CPFs).
In the two case studies presented, HyMod was evaluated by two objective functions. These are the root mean square error of flows (I RMSE ) and the root mean square error of the logarithm of flows (I LRMSE ), which emphasize high flow and low flow respectively: where Q m,i and Q o,i are the modeled and observed flow for time step i, and N is the number of time steps. I RMSE was used instead of the Nash-Sutcliffe efficiency (I NSE ), as I NSE depends on the average of the observations, which may be different in different sub-periods (Schaefli and Gupta, 2007).
The relative performance of a parameter set is presented by calculating the Euclidian distance to the calibration Pareto front (CPF) for every individual sub-period. We assume that the two objective functions in this case study are in the same order of magnitude and therefore do not need normalization.
Parameter search was performed using the MOSCEM-UA algorithm (Vrugt et al., 2003) for both calibration Pareto fronts (CPFs) and minimum distance Pareto front (MDPF). SuPer calibration selects parameter sets with the best performance relative to CPFs; therefore MOSCEM-UA was chosen as it uses Zitzler strength Pareto ranking (Zitzler and Thiele, 1999), which allows robust estimation of CPF.

Case study 1 -Short data series
The calibration Pareto fronts, CPF 2002  It can be observed that model performance in periods outside the calibration period may differ significantly from the optimal performance. Even the standard calibration based on the entire time period (2002)(2003) deviates significantly from the optimal performance in each sub-period.
The parameter sets as identified with the SuPer calibration approach are shown by dots in Fig. 4 for the sub-periods 2002 and 2003 with the same color as the sub-period CPFs. As shown in Fig. 4, SuPer calibration picks parameter sets with relatively good performance in both sub-periods, excluding parameter sets that work well in one period, but very poorly in another. Moreover, Fig. 4 shows SuPer calibration emphasizes on the parameter sets with better performance regarding I LRMSE , indicating low flow can be modeled more consistent over time. The relative performance of parameter sets selected with SuPer calibration to CPFs of every sub-period (2002 and 2003) are illustrated in Fig. 5 (MDPF 2002(MDPF −2003. Figure 6 illustrates the distribution of the parameters of the fast and slow reservoirs (R F , R S ) selected by different

Case study 2 -Long data series
The comparison between standard calibration and SuPer calibration using each year as an individual sub-period over the period of 1998-2005 is illustrated in Fig. 7. The parameter sets obtained by SuPer calibration are different from those identified by the different selection rules (Pareto optimal, radial and quadrant see Sect. 3.3), but similarly to the previous case study, SuPer calibration tends to select parameter sets towards the I LRMSE objective function, indicating that low flow parameters are more consistent in time.
The distance to Pareto front (relative performance) of parameter sets obtained by different selection methods for behavioral parameters using standard calibration (Pareto, radial and quadrant rules) and those obtained by SuPer calibration are illustrated in Fig. 8 for the entire validation period (2006)(2007)(2008)(2009) as well as for every individual year (2006, 2007, 2008, and 2009). The Pareto front members (CPF 1998(CPF −2005  by quadrant or radial rules, similar to Pareto front members (CPF 1998(CPF −2005, the distance to Pareto fronts during the validation period varies significantly for sub-periods. However, for every individual year as well as for the entire validation periods, the 25/75th interquartile ranges of the parameter sets retained by SuPer calibration show significant overlap. Overall, the parameter sets selected by SuPer calibration tend to show more consistency over individual years (CPF 2006,..,CPF 2009 ), as well as over the entire validation period (CPF 2006(CPF −2009 ) compared to parameter sets retained by calibration over the entire period. The distributions of two characteristic parameters for the calibration (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005) and validation (2006)(2007)(2008)(2009) periods (over the entire time series and every individual year) are shown for different parameter identification approaches in Fig. 9. The comparison between parameter distributions of sub-period Pareto members for the slow reservoir coefficients (R S ) shows that SuPer calibration is less affected by an anomaly of one sub-period (2001). As can be seen in Fig. 9a, the parameter distribution of standard calibration retained by the quadrant rule, emphasizes also on the values which are not optimal in sub-periods (1998, ..., 2005). Comparing in Fig. 9b, the distribution of the fast reservoir coefficient (R F ) obtained by standard calibration and retained by the quadrant rule with SuPer calibration, indicates that SuPer calibration selects parameter sets which overlap for every sub-period, while standard calibration over the entire calibration period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005) may cover values which do not have any over- lap with sub-periods (1998, ..., 2005). Figure 9a, b, indicating Pareto optimal members identified by standard calibration over the entire calibration period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005), shows a narrower range compared to parameter sets retained by Su-Per calibration. However the Pareto optimal distribution does not have any intersection with the parameter distribution of Pareto members of sub-period 2003 for the slow reservoir coefficient (R S ), meaning that they cannot perform optimally in that specific sub-period, while the distribution of the parameter set selected by SuPer calibration covers the distribution range of every sub-period. As was also illustrated in Fig. 8, the performance of Pareto optimal members, although confined to narrow ranges, may not perform optimally in every sub-period.

Discussion
SuPer calibration focuses on different parts of sub-period calibration Pareto fronts (CPFs), and helps to identify parameter sets with a time consistent behavior. These parameter sets may therefore be regarded as more "realistic" (Figs. 6 and 9). We attribute this to the fact that the processes identified by some objective functions (in the present case low flows) may have a more time consistent behavior than the processes represented by other objective functions (e.g. high flows).
SuPer calibration identifies parameter sets which perform optimally in sub-periods. The corresponding parameter ranges, although maybe not optimal over the entire time series, are the narrowest ranges considering optimal behavior in every sub-period (Figs. 8 and 9); therefore making it possible to obtain parameter distributions that are just dependent on data quality, sub-period characteristics and the selected hydrological model. Moreover, unlike common selection methods of behavioral parameter sets, which as highlighted by Efstratiadis and Koutsoyiannis (2010) require the specification of a subjective threshold for identifying behavioral parameter sets, SuPer calibration does not require this. The difference between parameter sets selected by calibration and SuPer calibration is illustrated graphically in Fig. 10. Our results have indicated that parameter sets selected with this approach may be grouped towards one or more objective function at the expense of others (Figs. 4 and 7). One might argue that SuPer calibration can be achieved by applying multi-objective calibration to different objective functions of different sub-periods in a single step. As an example, the first case study can be presented by introducing the 2 objective functions (I RMSE and I LRMSE ) for different sub-periods (2002 and 2003), therefore parameter identification will be formulated as a four-dimensional optimization practice with I RMSE 2002, I RMSE 2003, I LRMSE 2002and I LRMSE 2003 as objective functions. This approach would determine the trade-off of the model performance in different sub-periods. However, parameter identification is still based on the selection of Pareto front members and therefore the challenge of selecting behavioral parameter sets, or in this case time consistent parameter sets, from the Pareto front members remains the same as mentioned by Efstratiadis and Koutsoyiannis (2010) (Fig. 10).
SuPer calibration can also be used as a tool to analyze parameter time consistency in different sub-periods. By identifying non-time consistent parameters, SuPer calibration can be used as a diagnostic tool for identifying model structural deficiencies (see Clark et al., 2008). This approach can also provide information about the behavior of each parameter with respect to the hydrological condition of that period. As an example, the fast reservoir coefficient (R F ) shows higher values for the sub-period 2003 than for 2002. The years 2002 and 2003 are hydrologically distinct years (Table 1). This analysis, similar to the DYNIA , can help the modeler to identify a model deficiency and guide towards model improvements.
Although in this work we used hydrological years as the basis for sub-period analysis, periods can be selected in different ways. For example, they can be applied to storm events with different magnitude and return period to retain their characteristics during the calibration process. Sub-periods can also be defined as different parts of the flow duration curve (Westerberg et al., 2011) or can be used for calibration based on unusual events (Singh and Bárdossy, 2012;Krauße and Cullmann, 2012). Building on previous studies (e.g. Seiller et al., 2012), we support the conclusion that looking individually at different periods is an approach to extract more information from the data, rather than considering the data series as a whole.
Sampling strategies for the parameter space were not discussed and in principle different approaches can be used. As the method requires the identification of Pareto fronts, methods that sample the vicinity of the optimal parameter sets are preferable. The uncertainty in Pareto front identification may introduce uncertainty in the final selected parameter set selected by SuPer calibration. In this study MOSCEM-UA (Vrugt et al., 2003) was used to generate Pareto fronts in both steps of the procedure (creating CPFs and MDPFs).
Limitations of the presented SuPer calibration approach include, at least in its current implementation, that it cannot be applied to represent meaningful uncertainty estimates; the potential application of this approach in a Bayesian framework remains to be investigated.

Conclusions
In this paper a calibration approach based on splitting the available data sets into sub-periods has been proposed. The sub-period calibration approach makes use of calibration in individual sub-periods, and extracts parameter sets with a time consistent performance. Although this comes at the cost of potentially reduced performance during the calibration of each individual period, model parameterizations obtained by SuPer calibration perform consistently better in the validation period, which is what modelers actually should look for. The design of SuPer calibration is such that acceptable parameterizations have to perform consistently well when predicting any of the defined sub-periods, which is implicitly enforced in SuPer calibration, thus avoiding the need for explicit model validation. Furthermore, by the transformation of the traditional objective-space into a minimum Euclidean distance space, the need for subjective choices of parameter acceptance thresholds is avoided.
It should be again emphasized here that SuPer calibration is not a calibration algorithm, nor is it explicitly addressing parameter uncertainty. It is rather a more advanced method of model testing, building on traditional split sample tests and making more efficient use of available data. SuPer calibration can in principle be done with any number and type of objective functions (e.g. I NSE or I RMSE ) but also with any number and type of calibration criteria (e.g. only using runoff or using runoff and tracer dynamics). A Matlab function of the SuPer calibration approach can be obtained by personal communication with the lead author.