Is bias correction of regional climate model (RCM) simulations possible for non-stationary conditions?

: In hydrological climate-change impact studies, regional climate models (RCMs) are commonly used to transfer large-scale global climate model (GCM) data to smaller scales and to provide more detailed regional information. Due to systematic and random model errors, however, RCM simulations often show considerable deviations from observations. This has led to the development of a number of correction approaches that rely on the assumption that RCM errors do not change over time. It is in principle not possible to test whether this underlying assumption of error stationarity is actually fulfilled for future climate conditions. In this study, however, we demonstrate that it is possible to evaluate how well correction methods perform for conditions different from those used for calibration with the relatively simple differential split-sample test. For five Swedish catchments, precipitation and temperature simulations from 15 different RCMs driven by ERA40 (the 40yr reanalysis product of the European Centre for Medium-Range Weather Forecasts (ECMWF)) were corrected with different commonly used bias correction methods. We then performed differential split-sample tests by dividing the data series into cold and warm respective dry and wet years. This enabled us to cross-evaluate the performance of different correction procedures under systematically varying climate conditions. The differential split-sample test identified major differences in the ability of the applied correction methods to reduce model errors and to cope with non-stationary biases. More advanced correction methods performed better, whereas large deviations remained for climate model simulations corrected with simpler approaches. Therefore, we question the use of simple correction methods such as the widely used delta-change approach and linear transformation for RCM-based climate-change impact studies. Instead, we recommend using higher-skill correction methods such as distribution mapping. Abstract. In hydrological climate-change impact studies, regional climate models (RCMs) are commonly used to transfer large-scale global climate model (GCM) data to smaller scales and to provide more detailed regional information. Due to systematic and random model errors, however, RCM simulations often show considerable deviations from observations. This has led to the development of a number of correction approaches that rely on the assumption that RCM errors do not change over time. It is in principle not possible to test whether this underlying assumption of error stationarity is actually fulﬁlled for future climate conditions. In this study, however, we demonstrate that it is possible to evaluate how well correction methods perform for conditions different from those used for calibration with the relatively simple differential split-sample test. were corrected with different commonly used bias correction methods. We then performed differential split-sample tests by dividing the data series into cold and warm respective dry and wet years. This enabled us to cross-evaluate the performance of different correction procedures under systematically varying climate conditions. The differential split-sample test identiﬁed major differences in the ability of the applied correction methods to reduce model errors and to cope with non-stationary biases. More advanced correction methods performed better, whereas large deviations remained for climate model simulations corrected with simpler approaches. Therefore, we question the use of simple correction methods such as the widely used delta-change approach and linear transformation for RCM-based climate-change impact Instead, we higher-skill correction methods


Introduction
In hydrological climate-change impact studies, large-scale climate variables for current and future conditions are generally provided by global climate models (GCMs). To resolve processes and features relevant to hydrology at the catchment scale, regional climate models (RCMs) are commonly used to transfer coarse-resolution GCM data to a higher resolution. Although this provides more detailed regional information (Fowler et al., 2007;Grotch and MacCracken, 1991;IPCC, 2007;Maraun et al., 2010;Salathé Jr., 2003) for hydrological simulations, there is still a mismatch of scales especially for meso-and small-scale watersheds that are often captured by only one RCM grid cell. In addition, impact modelers are also facing a risk of improper RCM simulations (Christensen et al., 2008;Teutschbein and Seibert, 2010;Varis et al., 2004) due to systematic (i.e., biases) and random model errors. Mismatching scales in combination with such errors have led to many recently developed correction approaches (Chen et al., 2013;Johnson and Sharma, 2011;Maraun et al., 2010;Teutschbein and Seibert, 2012;Themeßl et al., 2011) that help impact modelers to cope with the various problems linked to biased RCM output.
These correction approaches can be classified according to their degree of complexity and include simple-to-apply Published by Copernicus Publications on behalf of the European Geosciences Union. methods such as linear transformations but also more advanced methods such as distribution mapping. The correction procedures usually identify possible differences between observed and simulated climate variables, which provide the basis for correcting both control and scenario RCM runs with a transformation algorithm. Although the correction of RCM climate variables can considerably improve hydrological simulations under current climate conditions (Chen et al., 2013;Teutschbein and Seibert, 2012), there is a major drawback: most methods follow the assumption of stationarity of model errors, which means that the correction algorithm and its parameterization for current climate conditions are assumed to also be valid for a time series of changed future climate conditions. Whether or not this condition is actually fulfilled for our future climate cannot be evaluated directly. This motivated us to address this issue and to test how well different correction methods perform for conditions different from those used for calibration. We applied the idea of a differential split-sample test, originally proposed by Klemeš (1986) for hydrological models, to analyze the performance of different correction methods for use with simulations under changed conditions. The testing presented here was done for different commonly used and rather simple correction procedures (Johnson and Sharma, 2011;Maraun et al., 2010;Teutschbein and Seibert, 2012) based on 15 RCM-simulated temperature and precipitation series for five mesoscale catchments in Sweden.
We would like to emphasize that this paper was written by local-impact modelers for impact modelers. Thus, our intention was not to compare all available methods to deal with biased RCM simulations and bias non-stationarities (this study is by no means exhaustive). We simply present one possible approach to analyze correction methods that are frequently used by impact modelers -especially on smaller scales. In addition, we outline the most common terminology related to climate models and biases as some of these terms are not used consistently by impact modelers.

Terminology
The terms climate model bias and bias correction are frequently used in climate change and impact research. However, these terms are not always used consistently in the literature and in many studies it is not clear whether they are actually dealing with model biases or rather model errors, model shortcomings or other uncertainties. For clarification we, therefore, briefly summarize the most commonly used terminology. Allen et al. (2006) suggested differentiating between the terms model shortcomings, model errors and model biases.

Distinction between model shortcomings, model errors and model biases
Model shortcomings are based on the fact that some models do not represent some parts of the climate system or are not able to resolve certain processes. Model shortcomings might also originate from numerical issues causing problems such as the violation of mass conservation observed in several climate models (Liepert and Previdi, 2012). These deficiencies can generally be resolved by improving the model, for instance, through the introduction of new physical descriptions or through increased spatial and temporal resolution (Allen et al., 2006). Model shortcomings can lead to model errors.
Model errors can be caused by initial and boundary conditions, parameterizations, physical and numerical formulations, lacking knowledge of external factors or general model shortcomings (Deser et al., 2012;Eden et al., 2012;Jung, 2005;Ménard, 2010;Palmer et al., 2005). Model errors can appear as unsystematic (random) and systematic errors (Ménard, 2010). Mathematically, the time-dependent model error (e t ) is the difference between the model simulation (s t ) and the best estimate of the truth (o t ), i.e., observations (Eq. 1, modified from Jung, 2005).
Unsystematic (random) model errors cause random variations in model simulations. They have their origin in the internal variability of climate models, i.e., in hidden non-linearities and complex (random) dynamical processes (Allen et al., 2006;Deser et al., 2012;Majda and Gershgorin, 2010;Ménard, 2010). This internal variability is associated with a model's degree of freedom to develop its own dynamic feedback mechanisms (Christensen et al., 2001). For shorter (decadal) timescales, these random errors (internal variability) are the dominant sources of uncertainty in model simulations (Hawkins and Sutton, 2011). Systematic model errors, also commonly termed model biases, produce predictably inaccurate (i.e., biased) model simulations. They are defined as systematic differences between model simulations (s t ) and observations (o t ), which is for a certain diagnostic d given by Eq. (2) (Jung, 2005), whered SE stands for the estimated systematic error of the diagnostic (d) with the hat ind indicating that this is an estimate of the true value. The diagnostic d can be the mean value, but can also address other aspects of the model error.
Systematic model errors can originate either from inadequately constrained parameters or from model structures that are unable to describe the physical process of interest (Allen et al., 2006). These systematic model errors, or model biases, are generally the most dominant source of uncertainty for longer (centennial) timescales (Hawkins and Sutton, 2011).

Climate model bias: definition and detection
Model bias is defined as a systematic distortion of statistical findings from the expected value. According to this definition, climate model biases describe systematic climate model errors (see definition above) only. It should, however, be noted that the term bias in the context of climate change impact studies is often misleadingly used to describe model errors in general (i.e., a combination of both systematic and random error). Biases in climate model simulations are commonly detected by validation (i.e., comparison) with observations (Eq. 2), where the observations are considered to be "true" and unbiased (Jung, 2005;Ménard, 2010). Jung (2005) highlights the mean (µ) as one of the simplest and most widely used diagnostics to detect climate model biases, so that Eq. (2) can be modified as follows: whereμ SE is the estimated mean systematic error over the time period. One should be aware thatμ SE can be zero (i.e., detecting no systematic error) due to error cancelation although simulations (s t ) and observations (o t ) might be characterized by different variability or distributions. This emphasizes the need for considering systematic errors for other diagnostics, i.e., replacingμ in Eq. (3) by other statistics, amongst others the standard deviation ( ), 10th/90th percentiles (X 10 /X 90 ) or probabilities (P ). The detection and estimation of climate model biases by comparing model simulations to observations is, however, not solely limited to Eqs.
(2) and (3). For example, Hanna (1993) and Chang and Hanna (2004) recommended to use the fractional bias (FB), the geometric mean bias (MG), the normalized mean square error (NMSE) and the geometric variance (VG). These additional performance measures are mentioned here to show further possibilities of analyzing climate model biases. Depending on the focus of a climate change impact study, other measures can be defined as well (Chang and Hanna, 2004). As each of these measures has advantages and disadvantages (for more information see Chang and Hanna, 2004), any bias analysis should always be based on multiple diagnostics.

Bias correction methods
RCM simulations are typically affected by systematic and random model errors. Misestimated climate variables in general, incorrect seasonal variations of precipitation (Christensen et al., 2008;Terink et al., 2009;Teutschbein and Seibert, 2010) and the simulation of too-many drizzle (i.e., lowintensity rain) days (Ines and Hansen, 2006) are just a few examples of common systematic errors (biases). In other words, climate variables simulated by individual RCMs often do not agree with observed time series (Fig. 1). This poses a problem for using these simulations as input data for hydrological impact studies. One possible solution is to use an ensemble of RCM simulations Giorgi, 2006) as ensembles have two advantages: (1) the spread of individual ensemble members covers a more realistic range of uncertainty and (2) the ensemble median may fit observations better , which is especially true for temperature simulations (Fig. 1, top). However, for precipitation simulations even the ensemble median often deviates considerably from observations and is not able to capture the variability in the observations (Fig. 1, bottom). This shows that it is not enough to only employ an RCM ensemble, but that additional correction procedures are needed. Several bias correction methods have already been applied in weather forecasting under the name model output statistics (MOS) about four decades ago (Glahn and Lowry, 1972;Klein and Glahn, 1974). In the context of correcting RCM output, however, it is today a controversial subject (Ehret et al., 2012;Muerth et al., 2013): despite their advantageous ability to reduce errors in climate model output, most correction methods are criticized to diminish the advantages of climate models (Ehret et al., 2012) and to not have much added value in a complex modeling chain when considering other sources of uncertainty (Muerth et al., 2013).
Typical correction approaches aim at correcting the systematic error (bias) in RCM-simulated climate variables by employing a transformation algorithm and are therefore called bias correction methods. The concept is based on the identification of possible biases between observed and simulated climate variables, which is the starting point for correcting both control and scenario RCM runs. It should be noted that there is a risk of not only correcting systematic errors (biases) but also unintentionally modifying simulations due to unsystematic (random) model errors (Maraun et al., 2010).

Stationarity assumption of model bias
A common assumption of most bias correction methods is stationarity, or time invariance, of the model errors. This implies that the empirical relationships in the correction algorithm and its parameterization for current climate conditions do not change over time and are also valid for future conditions. This assumption is, however, likely not met under changing climate conditions (Ehret et al., 2012;Maraun, 2012;Maraun et al., 2010;Vannitsem, 2008). In fact, Maraun (2012) was able to identify and distinguish between different types of bias changes, which are briefly described in Table 1. This highlights that there are potential issues when correction methods are applied to adjust RCM simulations.

Study catchments
The analysis in this study was performed for five mesoscale catchments ( Fig. 2) with areas ranging from 147 to 293 km 2 , as this scale is relevant for local climate change impacts (e.g., local heavy rainfall events, flooding, permafrost melt or droughts). These catchments fall all below the standard RCM grid cell size of approximately 25 km ⇥ 25 km and are, therefore, potentially affected by the scaling issue. The chosen catchments represent different typical Swedish climatic conditions and land-use types (Table 2). Continuous temperature and precipitation measurements for all five catchments were available for the period 1961-2000.

Data
Daily temperature and precipitation measurements for the period 1961-2000 were taken from the spatially interpolated 4 km ⇥ 4 km national grid PTHBV (Johansson, 2002) provided by the Swedish Meteorological and Hydrological Institute (SMHI). Climate simulations were obtained from the ENSEMBLES project (Van der Linden and Mitchell, 2009): we used daily precipitation and temperature series for the period 1961-2000 simulated by 15 RCMs (Table 3), which were all driven by ERA40 data (the 40 yr reanalysis product of the European Centre for Medium-Range Weather Forecasts (ECMWF)). The chosen RCMs have a resolution of 25 km and, thus, the area of a single grid cell clearly exceeds the size of the study catchments. We chose to average precipitation and temperature values from the RCM grid cell with center coordinates closest to the center of the catchment and its eight neighboring grid cells.

Bias correction methods
In addition to the original (i.e., uncorrected) RCM output data, we applied and analyzed the following six bias correction methods (Table 4) to adjust RCM simulations: (1) linear transformation, (2) local intensity scaling (LOCI), (3) power transformation, (4) variance scaling, (5) distribution mapping and (6) the delta-change approach. Furthermore, a precipitation threshold was used in combination with other bias correction procedures (namely LOCI, power transformation and distribution mapping), but not considered an appropriate "stand-alone" method. More detailed descriptions of these methods can be found in Teutschbein and Seibert (2012), Gudmundsson et al. (2012), Johnson and Sharma (2011) and the original method publications provided in Table 4. All bias correction methods were applied to daily values on a monthly basis as described by Teutschbein and Seibert (2012

MABC
The overall bias might be a mixture of several underlying biases depending on different weather types. As the relative occurrence of these weather types might change in the future, associated bias changes are defined as mixture related apparent bias changes (MABC). The above mentioned methods were chosen based on their frequent application in climate change impact studies. Although some of these methods might seem outdated from a climate modeler's perspective, they are all still commonly used by impact modelers especially at smaller scales, partly because they are relatively simple to apply.

Testing the stationarity assumption
To test how well bias correction methods work for conditions different from those that they were calibrated to, we employed an approach that is based on one of the operational testing methods presented by Klemeš (1986). Klemeš (1986) presented two methods of interest for systematic testing of hydrological model transposability: split-sample testing (SST) for stationary conditions and differential splitsample testing (DSST) for non-stationary conditions. SST implies the splitting of an available data record into two (preferably equally sized) segments in order to use one as calibration and one as validation period. DSST however should, according to Klemeš (1986), be used under changing con-ditions. The first step of this test includes the identification of two periods with the climate variable of interest having different values, for instance a warm versus a cold or a wet versus a dry period. The model is then calibrated on the period with one condition and validated on the period with the other condition, which allows analyzing the model's ability to perform under shifting conditions. SST can automatically transform into DSST, if the two segments by nature show substantial differences in their conditions (Klemeš, 1986).
To test the ability of different correction procedures to reliably work for changed climate conditions, we applied DSST proposed by Klemeš (1986) that was originally intended for hydrological models. Both SST and DSST are seldom used to evaluate bias correction methods. We are aware of only a few other studies using such a test: Bennett et al. (2010) and Terink et al. (2010) evaluated bias correction methods using SST with two different time periods for which observations were available. A major limitation of this approach is that the periods should be long enough to represent natural climate variability satisfactorily (Bennett et al., 2010). Furthermore, unless the two periods are different in their conditions, the  (Li et al., 2012;Seiller et al., 2012;Tramblay et al., 2013). The available 40 yr period 1961-2000 was separated into two 20 yr subsets with different climate conditions, one representing current climate and the other one future climate. Our available 40 yr period was not long enough to show a considerable trend in precipitation or temperature data (Fig. 3a), so we instead constructed the two subsets for each catchment as follows.
1. All years were sorted ascending according to their annual amount of observed precipitation (Fig. 3b).
2. The first 20 yr of the sorted data (i.e., the driest years) were included in the first subset and the last 20 yr (i.e., wettest years) in the second subset for the precipitation-bias correction assessment.
3. Each RCM-simulated precipitation time series was rearranged to match the annual order of sorted observed precipitation data and thereafter split into two subsets as above.
The same procedure was used for constructing two subsets for the evaluation of temperature-bias correction methods. Ranking all years according to their observed annual mean temperature resulted in two series, where the first consisted of the 20 coldest years and the second of the 20 warmest years (Fig. 3c). Again, each RCM-simulated temperature time series was rearranged in the same annual order as the sorted observed temperature data and thereafter split into two subsets. This procedure resulted in series where the years were not consecutive and the two subsets consisted of different years for the evaluation of precipitation (Fig. 3b) and temperature-bias correction methods (Fig. 3c). Note that the procedure ensured that the two subsets included the same years for each RCM simulation and the observations. To fully apply DSST, we performed a twofold crossvalidation ( Fig. 3b and c). First, all correction methods were calibrated based on the first subset of years and then evaluated for the second subset of years (case 1). In addition, the two periods were switched and the correction methods were calibrated based on the second subset and validated using the first subset (case 2). This way, DSST allowed the evaluation of bias correction methods under challenging conditions, namely considerably varying climate conditions for calibration and validation (Coron et al., 2012).

Evaluation of bias correction methods
Different diagnostics (Table 5) were used to detect and estimate model errors in uncorrected RCM simulations according to Eq. (2) for both the calibration and validation period of temperature as well as precipitation. Then, the same calculations were done to analyze the performance of each bias correction method. This implies that we (1) studied whether model errors were still present for the calibration data, (2) estimated the amount of model errors present for the validation data and (3) assessed the model error growth, i.e., the absolute difference between model errors in validation and calibration data. The model error growth measure allowed studying the transferability of a bias correction method to different climatic conditions.
As the above diagnostics were applied to the entire data series of a subset, they give no information about seasonal differences. Thus, we additionally included an analysis of the four seasons: winter (DJF), spring (MAM), summer (JJA) and autumn (SON).

DSST-induced climate change signal
The two designed subsets used in the conducted DSST featured different climate conditions and were clearly nonstationary. In this study, the differences between the two subsets were within a range of 6-30 % for precipitation (Fig. 4, left) and 0.9-1.7 C for temperature (Fig. 4, right). These values are in the same order of magnitude as the climate change signals for Sweden that are projected by the ENSEMBLES project (Van der Linden and Mitchell, 2009) and the Cli-mateCost project (Christensen et al., 2011)   • a precipitation threshold can be introduced a priori to avoid too many drizzle days (i.e., very low but non-zero precipitation) • is a non-linear correction in an exponential form (a ⇥ P b ) that combines the correction of the coefficient of variation (CV) with a linear scaling + corrects mean and standard deviation (variance) + events are adjusted non-linearly + variability of corrected data is more consistent with original RCM data ± adjusts wet-day frequencies and intensities only to some extend • matches the distribution functions of observations and RCM-simulated climate values • a precipitation threshold can be introduced to avoid substantial distortion of the distribution caused by too many drizzle days (i.e., very low but non-zero precipitation) • also known as "quantile-quantile mapping", "probability mapping", "statistical downscaling" or "histogram equalization" + corrects mean, standard deviation (variance), wet-day frequencies and intensities + events are adjusted non-linearly + variability of corrected data is more consistent with original RCM data  ascending (b, c). For the twofold cross-validation, first the lower-value years were used for calibration and the higher-value years for validation. In a second step, calibration and validation periods were switched. This procedure was performed independently for precipitation and temperature.  In direct comparison to observations, the RCMs tended to underestimate the mean climate change signals for both precipitation (Fig. 4, left) and temperature (Fig. 4, right), which was most likely directly related to an underestimation of interannual variability by the RCMs.

RCM precipitation: model errors
The calculated precipitation model errors were displayed in gridded plots as a function of bias correction method (x axis) and catchment location (y axis) separately for each statistical diagnostic and separately for calibration and validation period. Considering the case-1 evaluation procedure from dry to wet years (Fig. 5), all precipitation-bias correction methods resulted in good estimates of the mean (µ) showing only small model errors during the designed calibration period (Fig. 5, upper left panel). Analyzing other statistical diagnostics, however, showed considerable differences between the methods already during the calibration period (Fig. 5, left panel column). Raw RCM simulations generally had large model errors. Linear transformation was not able to considerably improve other statistical properties than µ (Fig. 5, left panel column). For standard deviation ( ), 90th percentiles (X 90 ) and maximum 5 day precipitation (Precip 5max ), power transformation and distribution mapping seemed to work best. The same could be observed for the probability of wet days (P wet ) and the intensity of wet days (i wet ), which were in addition also most correct after applying LOCI. The deltachange approach always performed perfectly during calibration by its definition.
The overall model error pattern was fairly similar for calibration and validation period. The major difference was that the model error during validation period increased considerably (Fig. 5, central panel column, shown as darker blue shading). This fact was also supported by the calculated model error growth (Fig. 5, right panel column). Linear transformation tended to have a slightly larger model error growth, whereas distribution mapping had the least. An interesting fact is that the delta-change approach, despite the illusory perfect fit during calibration, was outperformed by other methods during validation: Delta-change corrected precipi- tation showed large deviations in µ as well as P wet and i wet . Consequently, the delta-change method showed the strongest model error growth. The case-2 evaluation procedure from wet to dry years (Fig. 6) mostly confirmed the results of the case-1 evaluation.

RCM precipitation: seasonal analysis
The analysis of seasonally averaged raw and corrected RCMsimulated precipitation for the validation period revealed only a weak pattern in terms of the influence of different correction methods, seasons and catchments on model errors (Table 6). The mean absolute error (MAE) was generally large for raw RCM-simulated precipitation, except for autumn. During autumn, which is characterized by medium to high monthly precipitation, raw RCM simulations were relatively close to observations and the correction methods were not able to provide further enhancement (except for catchment 2, Storbäcken). During all other seasons the cor- rection methods were generally able to improve raw RCM simulations (except for catchment 5, Rönne Å, in winter and catchment 2, Storbäcken, in spring). Power transformation and distribution mapping performed better than other methods in winter, summer and autumn, which are seasons that are characterized by somewhat higher monthly precipitation. On the other hand, linear scaling and LOCI performed generally better in spring, a season with lower monthly precipitation. Furthermore, bias correction methods worked better for catchments in south central Sweden (i.e., catchments 3, Vattholmaån, and 4, Brusaån), which are generally drier than the other three catchments.

RCM temperature: model errors
The same type of gridded plots were created to demonstrate the calculated temperature model errors: in terms of the case-1 evaluation procedure from cold to warm years (Fig. 7), all temperature-bias correction methods resulted in very good µ estimates during the designed calibration period (Fig. 7, upper left panel). Substantial differences between the correction methods became apparent with help of other statistical diagnostics (Fig. 7, left panel column): linear transformation was the only method not able to sufficiently correct . Variance-scaled and distribution-mapped RCM temperature had both the most correct X 10 and X 90 during the control period. Again, the delta-change approach is perfect (i.e., model-error-free) by definition.
During validation with warmer years (Fig. 7, central panel row), the bias correction methods performed somewhat differently. Especially variance scaling showed larger model errors in and X 10 compared to the other methods. Distribution mapping, however, had relatively low µ, , X 10 and X 90 model errors. The model error growth (Fig. 7, right panel column) identified variance scaling and the delta-change method as the two approaches with the largest model error increase. The delta-change approach again had one of the largest model errors during validation and, thus, the strongest model error growth. Overall, a north-south gradient became apparent. Raw and adjusted RCM temperatures were characterized by a larger model error and a stronger model error growth for the northern catchments with cold climate conditions compared to the catchments with a warmer climate.
These findings were confirmed by the case-2 evaluation procedure from warm to cold years (Fig. 8). The results were essentially the same as for the case-1 evaluation; only that linear transformation performed worse and showed larger model errors during validation (Fig. 8,central panel column). Moreover, the north-south gradient was even more pronounced in all panels for validation and model error growth (Fig. 8, central and right panel column).

RCM temperature: seasonal analysis
The evaluation of seasonally averaged raw and corrected RCM-simulated temperature showed clear differences between correction methods, seasons and catchments for the validation period (Table 7). The mean absolute error (MAE) was generally large for raw RCM-simulated temperature. A north-south gradient was visible with northern catchments showing larger model errors in raw and corrected temperature. Furthermore, there was a clear seasonal difference: winter temperatures (cold season) were much more flawed than temperatures in all other (warmer) seasons. Distribution mapping consistently showed lowest MAE values.

Overall performance of bias correction methods
In order to obtain information on the overall performance of each bias correction method and its transferability to different climate conditions, we first normalized the model errors of raw and corrected RCM simulations to bring them to a common comparable scale . These normalized errors were then averaged over different diagnostics, over both the case-1 and case-2 evaluation and over both subsets (calibration and validation). Thus, we obtained a combination of the results in Figs. 5 and 6 for precipitation and in Figs. 7 and 8 for temperature.
The obtained signal was reasonably clear for both precipitation (Fig. 9, left) and temperature (Fig. 9, right): raw RCM simulations had the largest model error. In general, the more advanced the algorithm of a bias correction method was, the smaller was the model error present after correction (Fig. 9, from left to right on the x axis). This means that linear transformation had the largest and distribution mapping the smallest model errors. Furthermore, the simple delta-change approach resulted in relatively large model errors.

Discussion
Based on all findings in this study, distribution mapping showed the best overall performance and transferability to potentially changed climate conditions, as it was able to correct statistical moments other than the mean and standard deviation. LOCI and power transformation (both for precipitation) as well as variance scaling (for temperature) performed moderately. It should be noted that variance scaling is not advisable as it is based on the invalid assumption that all local variability is related to larger-scale variability and, furthermore, tends to augment the mean square errors of uncorrected data (Von Storch, 1999). Linear transformation and the delta-change method were the least able to correct for overall model errors in the validation period.
In this study, we did not try to answer the "main question [. . . ], whether and when the application of bias correction methods [. . . ] is justified or not" (Ehret et al., 2012). Bias correction methods are often criticized to diminish the advantages of climate models, but even with today's much advanced climate models, bias correction is often unavoidable for climate-change impact studies as uncorrected RCM simulations are a source of large uncertainties and would consequently hamper subsequent impact simulations. However, one needs to be aware that there are several problematic aspects related to bias correction methods (Ehret et al., 2012): physical causes of model errors are not taken into account and, thus, a proper physical foundation is missing ; spatiotemporal field consistency and relations between climate variables are modified (Ehret et al., 2012); conservation principles are not met (Ehret et al., 2012); feedback mechanisms are neglected (Ehret et al., 2012); the stationarity (time invariance) assumption is likely not met under changing climate conditions (Ehret et al., 2012;Maraun, 2012;Maraun et al., 2010;Vannitsem, 2008); variability ranges might be reduced without physical justification (Ehret et al., 2012); the climate-change signal might be altered (Dosio et al., 2012;Hagemann et al., 2011); the choice for a correction technique is an additional source of uncertainty Teutschbein and Seibert, 2012;Teutschbein et al., 2011); the added value of bias correction methods is questionable in a complex modeling chain with other major sources of uncertainty (Muerth et al., 2013); impacts of bias correction methods and related uncertainties are not communicated to end-users (Ehret et al., 2012); effects of unsystematic (random) model errors could by mistake be blamed on systematic errors and, therefore, accidentally be modified by correction methods (Maraun et al., 2010).
For current climate conditions, Teutschbein and Seibert (2012) demonstrated that most correction approaches applied are able to improve raw RCM data to some extent, but that there are considerable differences in the quality of adjusted RCM temperature and precipitation. In this paper, we showed how DSST can be used to analyze the transferability of correction approaches to different climate conditions. Using DSST allows identifying clear differences in reproducing conditions similar to and conditions different from those that the correction approaches were calibrated to. These differences are an indicator for improper algorithm and parameter transfers. By using the coldest/driest and warmest/wettest years for separation of the periods, we certainly pushed the correction methods. This was done on purpose, because we believe that reliable simulations of the more extreme years are essential for certain impact assessments, such as drought and flood modeling under future climate conditions. To test the transferability of correction approaches on conditions of a less extreme climate-change signal, it is also possible to use more moderate extrapolations by applying, for instance, the generalized split-sample test (GSST) as proposed by Coron et al. (2012).
We would like to emphasize that DSST is a rather simple approach to demonstrate algorithm transferability to different conditions. Another feasible approach to identify bias non-stationarities is to use a pseudo-reality, which is based on using one RCM simulation as reference and performing an inter-model cross-validation with other RCM simulations (Maraun, 2012;Räisänen and Räty, 2012). Such a pseudo-reality can identify potential issues with bias correction methods, but it is a rather complex exercise, expensive in terms of computing power and does not necessarily identify where correction approaches might be successful when compared to real observations (Maraun, 2012).

Conclusions
The choice between bias correction algorithms plays a large role in assessing hydrological climate change impacts. For current conditions, we could easily limit this choice to the one that performed best. For simulations of future climate this is more difficult and the fundamental question is how transferable the different methods are. The differential splitsample test suggested here is a simple and yet powerful tool to evaluate this. It is possible to create two subsets of data with considerably different climate conditions and nonstationary model errors based on time series of observations and RCM simulations of current climate (no future simulations necessary). Thus, the transferability of different bias correction methods can be tested under non-stationary conditions.
The delta-change approach and the linear transformation are the two most common transfer methods and have been widely used (Gellens and Roulin, 1998;Graham et al., 2007a, b;Lettenmaier et al., 1999;Middelkoop et al., 2001;Moore et al., 2008;Shabalova et al., 2003), because they are straightforward and easy to implement due to their simplicity. Yet, our validation of these correction approaches with the differential split-sample test showed that these two methods result in large deviations and are the least reliable under changed conditions. These findings remain to be confirmed for other catchments and other geographic regions, but based on the findings in this study we question the use of the delta-change method or the linear transformation to correct RCM scenarios of future conditions for climate change impact studies. Instead, we would like to recommend distribution mapping as the best-performing correction method, because it was best able to cope with non-stationary conditions. However, regardless of the used method, our results demonstrate that the -in most climate impact studies unavoidable -use of bias correction approaches for conditions different from those being used for their parameterization, might result in significant uncertainties. In this study RCMs driven by ERA40 reanalysis data were evaluated and uncertainties can be expected to be even larger when using GCM driven RCMs.