Interactive comment on “ Are we using the right fuel to drive hydrological models ? A climate impact study in the Upper Blue Nile

This study evaluates potential impact of climate change on streamflow in the Upper Blue Nile basin (Ethiopia). The projections of climate change are defined by two model ensembles. The impact is assessed by using an eco-hydrological model. As a result, the authors evaluate and discuss the performance and effect of bias correction on rainfall characteristics in the reference period and future streamflow projections. They conclude that multi-model means indicate an increase in mean annual streamflow and a seasonal shift in the Upper Blue Nile.


Introduction
Ethiopia is a country where about 80 % of the population is engaged in the agricultural sector (Dile et al., 2013;Deressa et al., 2011), the main source of income for rural communities (Bryan et al., 2009).Around 90 % of the country's grain is produced by smallholder farms.Subsistence and rain-fed farming systems dominate and, with few exceptions, irrigation is not practised1 .Consequently, agricultural and livestock production, people's livelihoods, and food security depend strongly on weather conditions, mainly on rainfall patterns such as amounts and timing.Hence, a large share of Ethiopia's population is very vulnerable to weather conditions and in particular to its inter-annual variability (Busby et al., 2014;Megersa et al., 2014;Headey et al., 2014;Zaitchik et al., 2012;Simane et al., 2012).
The Ethiopian highlands, where the Blue Nile rises, are considered to be the "water tower" of East Africa.The Blue Nile, for instance, contributes about 55-65 % of the flow of the Nile at the confluence with the White Nile (King, 2013; S. Liersch et al.: Impact of regionalization and bias correction in the Upper Blue Nile Sutcliffe and Parks, 1999).The river is therefore the most important water resource, not only for Ethiopia but also for the downstream riparian countries of Sudan and Egypt.Water politics in the Nile basin have a long history and are a central geopolitical feature in this region (Gebreluel, 2014;Ibrahim, 2012).With growing populations, industrialization, and climate change and its variability, the situation is becoming more and more tense (Gebreluel, 2014).Knowledge about availability of future water resources in this region and therefore studies providing insights into climate change and variability, and their impacts on hydrology, are of utmost importance.
A review of future hydrological and climate studies in the River Nile basin is provided by Di Baldassarre et al. (2011) and a review on hydrological extremes in the Upper Blue Nile catchment (UBN) by Taye et al. (2015).Recent studies on climate change and variability in the UBN or its tributaries served different purposes.The studies by Mengistu et al. (2014), Taye and Willems (2012), Conway and Schipper (2011), and Conway and Hulme (1993) investigated for instance trends of past climate change using observed and/or generated climate data.Diro et al. (2009) analysed the quality of rainfall data using two numerical weather prediction models.Another category of studies investigates the performance and projected trends of climate models (e.g.Conway and Schipper, 2011;Diro et al., 2011).
As a matter of fact, climatic variables such as air temperature, precipitation, and radiation simulated by global and regional climate models usually have a bias in the historical (reference) period (e.g.Addor and Seibert, 2014;Berg et al., 2012;Gudmundsson et al., 2012;Hagemann et al., 2011).Moreover, they often fail to adequately represent spatiotemporal dynamics at the regional scale.In climate studies, the absolute or relative changes between historical and projection periods are analysed and reported in the following manner: model X projects a temperature increase of 2.5 K in 2021-2050 and an increase of 8 % of rainfall relative to its reference period.Here, it does not matter whether model X was too cold/warm or too dry/wet during the reference period.Only the rate of change matters, which might be reasonable in this context.Moreover, in climate change studies it is common practice nowadays to analyse the entire available model ensemble and to calculate the multi-model mean, which is superior to any one individual climate model (Pierce et al., 2009).Unfortunately, a daily multi-model mean climate time series does not serve as reasonable input for impact models operating at the daily time step.Therefore, the application of climate model ensembles is always recommended for hydrological studies (Teutschbein and Seibert, 2010) and is considered nowadays as state of the art.
Quantitative and application-oriented impact studies require a certain accuracy of input data as well as adequate representation of the relevant processes by the models used.Small biases already present in temperature or precipitation may lead to considerable biases in impact models (Maraun et al., 2010).Therefore, various bias correction approaches were developed, particularly for hydrological applications (Piani et al., 2010;Dosio and Paruolo, 2011).The expectation of using bias-corrected input data is that they are quantitatively more precise than their uncorrected counterparts.
The authors of studies using complex model ensembles in the UBN, cited above, applied different approaches to generate climate input time series for hydrological modelling.Elshamy et al. (2009) used a distribution mapping approach to simultaneously downscale and bias-correct 17 CMIP32 GCMs (SRES A1B) and applied the corrected climate data to run the Nile Forecasting System in the UBN.The deltachange method was used by Mengistu and Sorteberg (2012) and Kim et al. (2008) to generate time series of temperature and precipitation used as input for hydrological modelling.Mengistu and Sorteberg (2012) used 19 GCMs of the CMIP3 model ensemble (SRES scenarios A2, A1B, and B1) to generate climate inputs for the SWAT model and Kim et al. (2008) used six GCMs (SRES A2) to run a monthly water balance model.Setegn et al. (2011) applied a downscaling approach for daily temperature and precipitation data to 15 CMIP3 GCMs (SRES scenarios A2, A1B, and B1) using a cumulative frequency distribution approach.They used the climate data to run the SWAT model in the Lake Tana basin.Beyene et al. (2010) performed a quantile mapping approach to bias-correct 11 CMIP3 GCMs (SRES A2 and B1) to run the VIC hydrological model for the entire Nile basin.Recently, Teklesadik et al. (2017) published a study comparing climate change impacts, particularly on actual evapotranspiration, using six hydrological models driven by the same four CMIP5 GCMs used in the study at hand.Liersch et al. (2017) used a climate model ensemble to analyse the impacts of the Grand Ethiopian Renaissance Dam on downstream discharges under current and future climate conditions based on the 10 "best" global and regional climate models identified in this study.
The study at hand falls into the same category using the most recent global and regional climate projections released for the IPCC 5th Assessment Report (IPCC, 2013).Uncorrected and bias-corrected climate simulations of five CMIP53 Earth system models (ESMs) and 10 uncorrected and biascorrected regional climate models (RCMs) from CORDEX Africa4 were used to run the Soil and Water Integrated Model (SWIM).The climate scenarios used by both model ensembles are the Representative Concentration Pathways (RCPs) RCP 4.5 and RCP 8.5 (van Vuuren et al., 2011;Meinshausen et al., 2011).Hence, we analyse 60 discharge simulations (two RCPs and 15 uncorrected and 15 bias-corrected climate model runs) for the reference period 1970-1999 and two future periods 2030-2059 and 2070-2099.The first objective of this study is to assess climate change and its impacts on the availability of future water resources in the UBN defined at gauge El Diem (Sudan border).The second objective is to discuss the implications of using different model ensembles to project future discharges by comparing the results of the whole range of uncorrected and biascorrected ESMs and RCM ensembles.Eventually an ensemble is assembled including only those members fulfilling certain performance criteria.These criteria are used to characterize the suitability of simulations for different purposes, such as for qualitative or quantitative studies.A qualitative impact study may have lower demands on the quality of climate simulations than a study investigating hydrological extremes or water management strategies.In the latter case, the requirements in terms of quantitative accuracy are much higher.The following questions were central to our investigations.d.In how far can we trust simulations that require a strong correction?

Study area
The entire Blue Nile River basin covers an area of about 296 000 km 2 .The study area considered here is the Upper Blue Nile catchment (UBN) defined at gauge El Diem at the border between Ethiopia and Sudan that covers an area of 172 000 km 2 .Elshamy et al. (2009) estimates a catchment area of 185 000 km 2 and Mengistu and Sorteberg (2012) an area of 174 000 km 2 for the UBN.These discrepancies are certainly based on different digital elevation models and GIS algorithms used to delineate the catchment area and thus may add to the uncertainties of such studies, which are not easily quantifiable.In Fig. 1, the UBN is encircled by a red line.
In addition, it shows the 576 subbasins that were delineated for the hydrological modelling exercise, the three gauging stations used to calibrate the hydrological model, and the coordinates of the climate data grid.The source of the Blue Nile River is Lake Tana in the Ethiopian highlands and the catchment is located in the north-western part of Ethiopia (Taye and Willems, 2012).It drains a major part of the western highlands (Sutcliffe and Parks, 1999) that is predominantly governed by a unimodal rainfall regime depending on the movement of the intertropical convergence zone (ITCZ).
The inter-annual variability of annual rainfall amounts in the Ethiopian highlands is high (Zaitchik et al., 2012) and ranges between 800 and 2200 mm, and the elevation of the UBN varies from 4000 to 500 m.a.s.l.(Taye and Willems, 2012).
The river has a length of almost 1000 km from the Lake Tana outlet to the Sudan border.

Data
Freely available WATCH Forcing Data (WFD) (Weedon et al., 2011) Krysanova et al. (2005).Although the quality of WFD varies in space (Rust et al., 2015), this gridded product with a spatial resolution of 0.5 • was used as input because observed climate data were not available for this study.The SRTM digital elevation model (Jarvis et al., 2008) was used to delineate the 576 subbasins and to derive some terrain-specific parameters.Required soil parameters were derived from the Digital Soil Map of the World (FAO et al., 2009) and land use cover data were reclassified from Global Land Cover (GLC2000) (Bartholomé and Belward, 2005).Observed monthly discharge data for model calibration and validation were provided by the Global Runoff Data Centre (GRDC5 ).

Hydrological model
The Soil and Water Integrated Model (SWIM), developed by Krysanova et al. (2005), is a semi-distributed, process-based eco-hydrological model that operates at the daily time step.It was developed on the basis of the MATSALU (Krysanova et al., 1989) and SWAT (Arnold et al., 1993) models and is continuously being further developed and adapted to new or specific requirements (Krysanova et al., 2015).Hydrological response units (HRUs), considered as areas with similar hydrological characteristics, are the smallest model units where all hydrological, nutrient, and vegetation processes are calculated.There is no lateral interaction between HRUs but area-weighted daily fluxes are calculated and aggregated at the subbasin scale and routed through the river network.SWIM distinguishes three flow components: surface runoff, subsurface runoff, and contributions of the shallow groundwater aquifer.Actual evapotranspiration is determined by simulated soil evaporation and transpiration from the vegetation cover.Water percolating from the shallow groundwater aquifer into the deep groundwater aquifer is lost from the system but is considered in the water balance.
A reservoir module, developed by Koch et al. (2013), was incorporated in SWIM and parameterized to better account for Lake Tana's storage effects and to consider the impact of the weir at the lake's outlet in future simulations that was constructed in the year 1996.
Radiation data required by SWIM as essential climate input were not available in all RCM runs.To maintain consistency and comparability in hydrological simulations, daily radiation data were computed after Hargreaves and Samani (1985) from daily minimum and maximum air temperature and the latitude of the respective subbasin.The simulated radiation data were calibrated to fit average annual observed radiation data of about 1800 kWh m −2 .

Climate models
The ESM ensemble used in this study consists of the following five CMIP5 models: GFDL-ESM2M, HadGEM2-ES, IPSL-CM5A-LR, MIROC-ESM-CHEM, and NorESM1-M.Projections of these five ESMs were linearly downscaled and bias-corrected by Hempel et al. (2013) in the frame of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP)6 (Warszawski et al., 2014).The uncorrected ESM simulations were interpolated to the WFD 0.5 • grid.
Table S1 in the Supplement provides an overview of the RCM runs organized by the CORDEX Africa initiative7 .The ensemble consists of four RCMs driven by different ESMs.The RCM SMHI-RCA4 was driven by seven ESMs, Can-RCM4 by CanESM2, and the RCMs KNMI-RACMO22T and DMI-HIRHAM4 by EC-EARTH.The 10 RCM runs were bias-corrected by the authors of this paper.Table S2 shows the model IDs of all 15 climate models used in some figures and tables.

Climate scenarios
For both the global and regional climate model ensembles, the two scenarios RCP 4.5 and RCP 8.5 were used because they represent a broad range of uncertainties with regard to possible future pathways and related climate projections.According to van Vuuren et al. (2011) and Meinshausen et al. (2011), RCP 4.5 represents the medium stabilization scenario (stabilization without overshoot pathway leading to +4.5 W m −2 radiative forcing (relative to pre-industrial forcing) and ∼ 650 ppm CO 2 equiv.by 2100) and RCP 8.5 the highest emission scenario (rising radiative forcing pathway leading to +8.5 W m −2 and ∼ 1370 ppm CO 2 eq.by 2100), assuming no stabilization in global greenhouse gas emissions.

Bias correction
Despite regional downscaling to finer resolution, RCM simulations often show considerable biases when compared to observed data (Addor and Seibert, 2014;Christensen et al., 2008).A review of bias correction methods (linear scaling, local intensity scaling, power transformation, and distribution or quantile mapping) is provided by Teutschbein and Seibert (2012).The authors conclude that the distribution or quantile mapping method achieves the best performance for most of the selected criteria.Although quantile mapping is a successful method to improve the representation of daily rainfall characteristics, it fails to correct multi-day and interannual variables, such as mean maximum 4-day precipitation, mean minimum 14-day precipitation, and inter-annual variability (Addor and Seibert, 2014).The drawback that all approaches have in common is that they are based on the stationarity assumption, which presumes that future physical processes in the atmosphere are comparable to the period used to correct the simulations.Bias correction of climate simulation data is nowadays a widely used practice in hydrological impact modelling, but it should be treated with caution.As Maraun et al. (2010) point out, the origins of the bias in climate simulations (mathematical formulations in climate models) are not solved by the post-processing and may disrupt internal physical coherence between weather variables.Hence, the correction is usually based on wrong reasons (Addor and Seibert, 2014).Alternatives to bias correction are so-called delta-change methods.Sophisticated approaches of this method are described by Anandhi et al. (2011), Bosshard et al. (2011), and Chiew et al. (2009).

Bias correction of ESMs
Bias-corrected data of five CMIP5 ESMs were available and provided by ISIMIP.In a first step ESM data were linearly interpolated to the WFD 0.5 • grid, implementing the standard Gregorian calendar.Temperature data were corrected using a trend-preserving additive approach where monthly mean values were adjusted for a systematic bias by adding a gridpoint-specific and month-specific constant offset.Therefore, the absolute projected temperature changes of the ESMs are not changed.The daily variability of ESM temperatures was adjusted to reproduce WFD variability by adding a monthly correction factor on temperature anomalies.
Precipitation data were corrected using a multiplicative approach where monthly mean precipitation was multiplied by a grid-point-specific and month-specific constant correction factor.Relative changes projected by the ESMs are thereby preserved.A known problem of this method is that extraordinarily high values of daily precipitation can occur in the biascorrected simulation if very high simulated daily precipitation data are multiplied by high correction factors.Therefore, the correction factor was limited to a value of 10.Remaining extremely high daily precipitation values were truncated to 400 mm.After the method introduced by Piani et al. (2010), daily precipitation variability and the frequency of dry days were corrected by applying a transfer function to fit the normalized simulated time series of wet months to the normalized WFD time series.A more detailed description of the bias correction procedure applied to the five CMIP5 ESMs used in this study is provided by Hempel et al. (2013).

Bias correction of RCMs
Precipitation biases in most CORDEX RCMs show a high seasonality for grid boxes within the evaluation domain of the UBN.This limits a bias correction based on seasonal or annual means.However, as some of these grid boxes do show almost no precipitation events for single months, a harmonicbased bias correction method analogous to the one applied to temperature is not feasible for precipitation.Furthermore, this results in a large uncertainty in the estimation of the corresponding monthly biases.Thus, based on the recommendation from Dobler and Ahrens (2008), a bias correction is only applied on months and grid boxes with more than 100 rainy days (rainfall above 1 mm day −1 ) within the calibration period .
The method applied is based on a local rainy day intensity scaling, correcting the frequency of rainy days and the mean precipitation on rainy days to fit the observed values in a specific calibration period (Schmidli et al., 2006).Details on the implementation and an evaluation are given in Dobler and Ahrens (2008).The method has been successfully applied before as a downscaling and bias-correcting method for precipitation in alpine regions (Dobler and Ahrens, 2008;Dobler et al., 2011).
The underlying idea is the assumption of a smooth seasonal cycle for the variables simulated by the RCM and the observational reference (WFD).These cycles are modelled with a series of harmonic functions using vector generalized linear models (Yee, 2015), and the difference in cycles between an RCM reference simulation and the observational product is used for bias correction of the RCM projection.

S. Liersch et al.: Impact of regionalization and bias correction in the Upper Blue Nile
The seasonality in the location parameter of a quantity (i.e. the expectation value in the case of a Gaussian distribution) can be modelled as with ω = 2π 365.25 , t = 1, . .., 366 being the time variable running over all possible days of the year; K and L are the orders of the harmonic function expansion for µ.A scale parameter σ can be modelled analogously in this framework.The result is a climatological distribution, i.e. a description of the probability distribution throughout the year.
Selection of orders K and L is based on a 10-fold cross validation using the Continuous Rank Probability Score (CRPS, Wilks, 2011) as the cost function.The difference in parameters between the RCM reference and the observational product (WFD) is subtracted from the parameters of the RCM projections for bias correction.Quantile mapping (e.g.Vrac and Friederichs, 2015) now maps the values from the uncorrected to the corrected climatological distribution.
Particular care needs to be taken when correcting minimum and maximum temperature to avoid inconsistencies such as T max < T min .Here, a variable transformation ensures physical consistency: After bias-correcting T 1 and T 2 , corrected values for T max and T min can be obtained by back-transforming the variables.

Evaluating the suitability of climate simulations
Evaluating the suitability of climate simulations for regional impact studies is a process that includes seemingly objective components (e.g.analysing performance criteria) and subjective components (choosing criteria and setting their thresholds).Data visualization and interpretation by the user might be considered as a mixture of both objectivity and subjectivity.The choice of periods used as reference and future projection does also influence the results.The former is often predetermined by data availability or conventions and the latter usually by the client.Moreover, there are uncertainties with regard to the quality of the dataset used as the comparison baseline, mostly observed and/or generated climate data.Evaluation of climate model performance is complicated by the fact that climate simulations cannot be compared to the reference dataset on a real-time daily, monthly, or annual basis, as is common practice with discharge simulations in hydrological modelling.Climate simulations are not supposed to reproduce or predict the weather for a certain day, month, or year.Hence, only statistical parameters, summarized over a period of usually 30 years (e.g. the annual cycle represented by average daily or monthly time series), or the mean, quantile values, and standard deviation of the entire daily time series can be used as a basis for comparison.
In the first step of climate model evaluation, daily and monthly precipitation characteristics of uncorrected (UC) and bias-corrected (BC) climate simulations were compared to WFD characteristics (reference climate).In a second step, SWIM was employed to simulate daily discharge using all climate simulations for reference and future periods.Since the main purpose of this study is to assess climate change impacts on hydrology, using hydrological performance indicators to evaluate climate simulations is a straightforward method.A similar approach was used by Elshamy et al. (2013) who used a GLUE-like methodology to exclude and weigh climate model performance.Another benefit of this approach is that a spatially semi-distributed hydrological model does not only account for temporal but also for spatial patterns of climate inputs.Therefore, the annual cycle represented by daily (n = 365) discharge simulations (sim), averaged over the 30-year reference period, was compared against the baseline simulation using WFD (ref).The performance criteria applied to these time series are the coefficient of determination (R 2 ), PBIAS, standard deviation (SD), and the normalized SD of discrepancies (SD D ) or the centred root mean square errors.
The characteristics of daily discharges were analysed using flow duration curves (FDCs), where every single discharge value is related to the percentage of time it is equalled or exceeded (Smakhtin, 2000).FDCs summarize discharge variability of a time series and display the complete range from low flows to flood events.In order to analyse and visualize average, low, and high flow characteristics, 17 percentile values (Q 0.01 -Q 99.99 ) were used to compute FDCs based on the entire daily discharge time series of the 30-year reference period.This method was applied to assess whether model performance is suitable to study non-extreme discharge conditions (NED) and/or high and low flow situations as well as their extremes.
In addition to the criteria used to evaluate model performance in the reference period, it is also important to consider model behaviour in future periods.In fact, unexpected behaviour in projection periods was observed in several simulations, particularly in some BC simulations.The hypothesis is that the stronger the necessity of bias correction, the higher the risk that the BC simulation will show unexpected behaviour in future periods.Therefore, another criterion was introduced that indicates the rate of change of PBIAS between the future and the reference period.Note that the def-Hydrol.Earth Syst.Sci., 22, 2163Sci., 22, -2185Sci., 22, , 2018 www.hydrol-earth-syst-sci.net/22/2163/2018/ inition of threshold values is somewhat subjective and was influenced by the simulation results of the model ensemble.However, if the thresholds had been set more critically, almost no climate model would have passed the evaluation process successfully.The model selection process and the definition of criteria thresholds are described in the following section.

Model selection
Beside analysing the impact of climate projections on future discharges using the whole UC and BC ESM and RCM ensembles, a climate model ensemble was assembled containing only those models that fulfil the criteria and their thresholds defined below.In order to become a member of the selected ensemble, a model must basically achieve all the following three criteria.
-Seasonality.The annual cycle based on average daily discharge simulations must achieve R 2 ≥ 0.85.Models with R 2 < 0.85 are assumed to represent discharge seasonality only poorly.
Models meeting these three criteria are assumed to be suitable for a qualitative impact assessment and are indicated in the column "pre" (preselection) in Table 1.In addition, the columns HF (high flows, FDC percentiles Q 10 , Q 5 , Q 1 , Q 0.1 , Q 0.01 ) and LF (low flows, FDC percentiles Q 90 , Q 95 , Q 99 , Q 99.9 , Q 99.99 ) indicate further whether a particular model adequately represents extreme discharge conditions and might be used for specific investigations.Again, the FDC values in the respective range should not exceed the threshold of ±30 %.
After simulating discharges using all climate scenarios it was found that several simulations project enormous increase in annual river discharge already in the period 2030-2059.This was particularly the case in simulations where bias correction resulted in stupendous increase of extreme daily rainfall and therefore extraordinary high peak discharges.Hence, another criterion was defined representing the rate of change.Simulations where average annual discharges changed by more than ±30 % in the period 2030-2059 (RCP 8.5) relative to the reference period were omitted from the selected ensemble, even if the first three criteria were achieved.This criterion is represented in Table 1 in the column "Change", which reveals that both UC and BC models either always achieve or do not achieve this criterion.

Model calibration and validation
The eco-hydrological model SWIM was calibrated to three discharge gauges in the UBN: (1) downstream Lake Tana, (2) Kessie, and (3) El Diem.Due to limited data availability, the model was calibrated to the monthly time step using a semi-automated approach.The calibration (1981-1986) and validation (1987-1992) periods for gauge El Diem were on the one hand chosen according to data availability and on the other hand to cover periods of wet and dry years.Data availability for the gauges Lake Tana and Kessie was limited to the years 1969-1975 and 1976-1979, respectively.The gauges were successively calibrated where a parameter sensitivity analysis was performed in a first step to assess reasonable parameter ranges as boundary conditions for the automatic calibration algorithm PEST (Model-Independent Parameter Estimation & Uncertainty Analysis software)8 .The objective functions to measure model performance are the Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) and PBIAS, where NSE was the primary criterion.
Figure 2 shows the results of monthly and average monthly discharges at gauge El Diem for calibration (left panel) and validation (right panel).According to Moriasi et al. (2007), NSE values of 0.92 (calibration) and 0.90 (validation) are considered to be very good for the monthly time step.The same classification is achieved for the volumetric errors in both periods.The percent bias (PBIAS) between simulated and observed data is −6.7 % (calibration) and −14.4 % (validation).SWIM simulates peak discharges adequately in most years with few exceptions of rather large underestimation in the years 1983, 1987, and 1988.One explanation for this is the lack of accuracy of WFD inputs and/or observed discharge in some years.The simulated amount of water percolating into the deep aquifer is about 7 % on average.Without this recharge component, it was not possible to achieve good simulations during the dry period.
Figure S1a and b in the Supplement show the calibration results for the gauges downstream Lake Tana and Kessie.The available GRDC discharge time series for both gauges are rather short and in the case of Tana, the data of the years 1973-1975 are not reliable.Compared to the discharge data given in Dile et al. (2013) and Setegn et al. (2011), maximum discharges are usually around 200-250 m 3 s −1 , as is the case in the years 1969-1972 (Fig. S1a).Monthly WFD precipitation volumes do not explain the high discharges observed in the last 3 years.Hence, only the first 4 years were used for calibration, where an NSE of 0.67 and a PBIAS of 23.1 % were achieved.Monthly discharges at gauge Kessie in the four years where GRDC data were available are underestimated by −18.8 % and achieved an NSE of 0.92.According Table 1.Selection of uncorrected (UC) and bias-corrected (BC) Earth system models (ESMs) and regional climate models (RCMs).
"×" is the criterion achieved; "∼" is the criterion almost achieved; "-" is the criterion not achieved."HF" refers to the high flows (≤ Q 10 ); "LF" refers to the low flows (≥ Q 90 )."Change ±30" is the volumetric change between the reference period and RCP 8.5 in 2030-2059.The abbreviation "pre" refers to the preselection; "final" refers to the models selected in the final ensemble.
to Moriasi et al. (2007) the results for the two gauges can be classified to be between good and very good.Despite the improvement of monthly medians and average annual precipitation sums, bias correction increased the range of monthly precipitation sums critically in several models in both ensembles.This phenomenon can be observed particularly if the deviation of monthly medians between UC simulation and WFD is rather large (e.g.IPSL from May to October, MIROC in July, NorESM in July and August).The effect of increasing variability of monthly precipitation sums is even higher with the method used to biascorrect RCMs and is true for all RCMs (Figs.S3 and S4).The extreme outliers in many models generated by both correction methods are also noticeable.
Not all UC models do adequately represent the unimodal rainfall regime in the UBN.UC NorESM shows for instance a distinct bimodal regime, which is also visible but less pro- nounced in GFDL and MIROC (Fig. S2) and only weakly visible in MIROC/RCA4 (Fig. S4).Although bias correction eliminated this deficiency, it is questionable at what cost.
The physical basis was certainly disrupted by the correction method applied.Tables S3 and S4 in the Supplement show the following statistical parameters of daily precipitation averaged over the catchment: average number of days with precipitation > 1 mm per annum (nDays > 1 mm), average daily precipitation (ave), maximum daily precipitation (max), standard deviation (SD), average precipitation in July, August, and September (ave JAS), and the standard deviation of daily precipitation in July, August, and September (SD JAS).Where Table S3 shows absolute values, Table S4 shows the differences to WFD precipitation (sim-WFD).The two SD parameters were computed by division, SD sim /SD WFD .The Tables show for instance that maximum daily precipitation is underestimated by all UC models except MIROC.Bias correction resulted in overestimation in 13 out of 15 models.All BC RCMs overestimate maximum daily precipitation, many of them significantly; yet the differences in average daily precipitation of BC simulations are, with exceptions, usually rather small.Large deviations in maximum daily precipitation and in the number of rainy days at the same time, while achieving only small differences in average daily precipitation, indicate that the distribution of daily rainfall can differ sometimes strongly among simulations.It is also noticeable that the SD of daily precipitation of all UC models is lower than the WFD SD.Almost all BC simulations show higher SD than the UC simulations, where all ESM SD values are still lower than WFD SD and all RCM SD values are greater than or equal to WFD SD.

Performance of average daily discharge using UC and BC climate input
Bias correction improved the performance of averaged daily discharge simulations (n = 365) considerably for all members of the ESM ensemble and for most members of the RCM ensemble.Figures 3 and 4 show the simulated hydrographs in the reference period comparing UC and BC simulations with WFD using R 2 and PBIAS to indicate discharge performance of the annual cycle.
All UC discharge simulations using ESM climate input, except the one based on GFDL, underestimate average annual discharges, which is indicated by negative PBIAS values (Fig. 3).IPSL shows the largest deviations, with a PBIAS of −84 %.All other models deviate less than 30 % from WFD discharges.R 2 values indicate that seasonal discharge patterns are more or less adequately represented by all models, except NorESM, which simulates a bimodal regime with a small peak in June and a high peak in October instead of one single major peak between August and September.Peak discharges simulated with GFDL and MIROC climate input occur approximately 4 weeks later than the peak simulated with WFD.Discharges simulated with HadGEM achieve an R 2 of Figure 3. Annual cycle of average daily uncorrected (UC) and bias-corrected (BC) simulated discharges at gauge El Diem using Earth system model input and WATCH Forcing Data (WFD) in the reference period .0.98 but are too low during the high flow season.Another example is the UC IPSL model, which achieves an R 2 of 0.9, although it underestimates discharge by −84 %.Hence, high R 2 values can be misleading if they are not combined with a volumetric criterion such as PBIAS.
In contrast to ESMs, the majority of discharge simulations based on UC RCMs overestimate average annual discharges in the reference period (Fig. 4).The deviations of six UC RCMs are larger than 30 %.However, seasonal discharge patterns are generally better represented using UC RCM climate input than UC ESM input.The lowest UC RCM R 2 value is 0.93 compared to an R 2 of 0.49 by NorESM of the UC ESM ensemble.Hence, bias correction improved R 2 values only slightly for 50 % of RCMs.In 60 % of the cases, the volumetric deviation (PBIAS) of BC RCMs is significantly lower than in the corresponding UC models.Based on these two indicators, the performance of BC RCM simulations is generally better than UC RCMs.However, there is a strong tendency of peak flow overestimation in six out of ten BC RCMs, which is not captured by R 2 and PBIAS.Therefore, a visual assessment of hydrographs is important as well as an analysis of daily discharge characteristics using FDCs (see following section).
Taylor diagrams (Taylor, 2001) are another method to visualize model performance showing three performance indicators (R 2 , normalized SD, and SD D ) in a single plot (see Fig. 5).They facilitate the visual assessment of model performance where outliers can be easily identified.A model with similar statistical characteristics to the reference dataset would be represented by a point at 1.0 on the x-scale and 0.0 on the y-scale.However, interpretation of normalized values is difficult in terms of numerical thresholds, though Figure 4. Annual cycle of average daily uncorrected (UC) and bias-corrected (BC) simulated discharges at gauge El Diem using regional climate model input and WATCH Forcing Data (WFD) in the reference period .
Fig. 5a identifies UC IPSL and UC NorESM clearly as outliers.IPSL is, for instance, an outlier because it shows deficiencies at representing SD (0.25 where 1.0 would be ideal) and SD D (0.79 where 0.0 would be ideal).UC NorESM performs poorly in terms of all indicators.After bias correction all ESMs show rather good performance (see Fig. 5b).Except BC IPSL, all models have lower SD than WFD.The characteristics of RCMs are different.Half of the UC RCMs' SDs (Fig. 5c) deviate more than ±0.25 from standardized WFD but perform much better in terms of R 2 .Interestingly, after bias correction (Fig. 5d), all models show a higher SD than WFD, which is consistent with higher SD of daily rainfall as described in the previous section.

Flow duration curves
FDCs are employed here to analyse and characterize strengths and weaknesses of daily discharge simulations with regard to NED conditions, high flows, low flows, and their extremes.Figure S5 in the Supplement shows FDCs of all ensembles, where the black line represents simulations using WFD.At least one obvious outlier can be clearly identified in both UC ensembles (IPSL and CanESM2-RCA4).Apart from the outliers, NED characteristics are slightly better represented by the UC ESM ensemble (Fig. S5a) than by the UC RCM ensemble (Fig. S5c).Most of the UC RCMs tend to overestimate NED and low flows.At a first glance, the biases were significantly reduced by the correction methods (Fig. S5b and d), especially for NED.However, compared to UC simulations, the correction led to higher biases in the   high and low flow segments and especially in their extreme values.Note that a logarithmic y-scale is used where large deviations in the extreme high flow section appear rather small on this plot although they are in fact extremely high.
Figure 6 overcomes this problem by showing relative deviations of FDCs between discharge time series simulated with climate model inputs and the baseline using WFD.The values corresponding to Fig. 6 are provided by Tables S5-S8 in the Supplement.Assuming that deviations in the range of ±30 % are tolerable, there is not a single UC model (Fig. 6a and c) which fulfils these requirements for all percentile values.However, the UC ESMs' MIROC and HadGEM (Fig. 6a) show acceptable deviations (±30 %) in NED conditions, but there is not a single UC RCM representing NED conditions in the given range (Fig. 6c).The best UC RCM result was achieved with NorESM1-RCA4.Figure 6b  and d show that bias correction was successful in correcting the biases of NED for all ESMs and seven out of ten RCMs.The correction method applied to ESMs leads to different patterns in the high and low flow sections compared to the method used to bias-correct RCMs.
Between Q 1 and Q 10 (high flows), the BC ESMs tend to underestimate values (but in the given range of acceptable deviations), whereas BC RCMs overestimate flows cor-responding to these percentiles.There is not a single BC RCM that represents Q 1 conditions in the given range of ±30 %.The smallest overestimation for Q 1 is 52.4 %.All BC RCMs strongly overestimate extreme high flows Q 0.1 and Q 0.01 .The highest Q 0.01 overestimation is 656.9 % and the lowest 100.4 % (Table S8).The BC ESMs perform better in the extreme high flow segments.However, only GFDL and HadGEM simulate Q 0.1 values in the acceptable range and only HadGEM for Q 0.01 (Table S6).
In the low flow section (between Q 90 and Q 99 ) there is no BC ESM that performs adequately for all percentile values.Except HadGEM that overestimates low flows, the other models tend to underestimate values.Extreme low flows (Q 99.9 and Q 99.99 ) are only represented by GFDL within the acceptable range.The BC RCMs all underestimate low flows, where four models are within the acceptable range of deviations for Q 95 ; there is only one model within this range for Q 99 (CanESM2-RCM4).Extreme low flow conditions (Q 99.9 and Q 99.99 ) are only represented adequately by EC-EARTH-RCA4; the other RCMs severely underestimate extreme low flows.
To summarize the evaluation of model performance based on FDCs, it can be stated that bias correction improved the performance of simulated NED significantly.However, with Figure 6.Relative deviations of FDCs from baseline discharge simulation at gauge El Diem using WATCH Forcing Data (WFD) in the reference period .Simulations based on uncorrected (UC) and bias-corrected (BC) Earth system model (ESM) input in the top row and regional climate model (RCM) input in the bottom row.
a few exceptions, both bias correction methods did not improve the performance of high and low flows.This is particularly true for extreme values, which are strongly exaggerated in most cases.

Temperature, precipitation, and evapotranspiration projections
Figures 7, 8, and 9 show precipitation, temperature, and actual evapotranspiration projections of the model ensemble (Sect.4.5) for the 21st century for RCP 4.5 and RCP 8.5 as anomalies to the reference period in the UBN.They indicate the total range of change and the 5-year moving average (MA5) for both scenarios.The precipitation MA5 does not show a distinct trend of change over the century, but average annual precipitation is projected to be up to 100 mm (∼ 7 %) higher than in the reference period.The increase is only marginally higher in RCP 8.5 than in RCP 4.5.
In Fig. S6 it is shown that a maximum of only three out of 15 UC climate models project decreasing average annual precipitation.The multi-model mean of the CMIP5 ESM ensemble projects showed increasing annual precipitation of 5 % in 2030-2059and 6 % in 2070-2099under RCP 4.5 and 8.4 % in 2030-2059and 15.6 % in 2070-2099 under RCP 8.5. Figure S7 shows where the five ESMs used in this study are situated within the entire CMIP5 ensemble.It is noticeable that only three out of 26 ESMs show declining precipitation trends under RCP 8.5.Projected surface air temperatures show a clearly increasing trend over the 21st century in both RCPs.Compared to the reference period, the multi-model mean of the selected ensemble projects an increase of 1.7 K (1.5 to 1.9 K) in RCP 4.5 and 2.2 K (1.9 to 3.5 K) in RCP 8.5 in 2050.At the end of the century average temperatures climb up to 2.5 K (1.9 to 4.1 K) under RCP 4.5 and 4.9 K (3.0 to 6.5 K) under RCP 8.5.The multi-model mean of the CMIP5 ESM ensemble projects showed increasing average annual temperatures of 1.6 K in 2030-2059and 2.3 K in 2070-2099under RCP 4.5 and 1.7 K in 2030-2059and 3.9 K in 2070-2099 under RCP 8.5.
Although surface air temperature already increases until 2050 in both scenarios by up to 2.2 K, actual evapotranspiration remains rather stable on the level of the reference pe-  .Range of selected model ensemble.  .Range of selected model ensemble.
riod.Only in the second half of the 21st century do the projected values increase by up to 50 mm per annum.Hence, it can be concluded that actual evapotranspiration is already at its maximum and can only increase if water availability increases too, as is the case after 2050.and BC simulations.The UC IPSL model projects an increase of 95.4 % in average annual discharge.A visual assessment supports the previously made assumptions that the IPSL model does not provide adequate climate simulations in the study area.This is true for both UC and BC climate simulations.Aich et al. (2014) applied the same five BC ESMs in four large African river basins and found that also in the Niger basin (comparable climate zone to the Blue Nile River) one of the five models projects extreme and unexplainable changes although it performed adequately in the historical period.In the case of the Niger River basin, it was the MIROC model that behaved awkwardly in the projection period, whereas the IPSL behaved normally in the range of the other models.The HadGEM model is the only model where bias correction changed the sign of the discharge signal.The simulation with UC climate input projects a decrease of average annual discharges of −2.9 % and the BC simulation an increase of +2.2 %.The results of the NorESM1 model are interesting.The UC model simulates a bimodal rainfall and runoff system with a dry period during the rainy season in July to September.Although the model was forced by bias correction into a completely different system, by pushing the dry season into a rainy season, the projections do not seem anywhere near as disrupted as the IPSL simulation.Hence, the NorESM1 results do not support the assumption that strong bias correction necessarily results in unexpected behaviour in future periods.Looking at the change of average peak magnitudes between UC and BC ESM simulations in the reference and the future period, the change signals are in a similar order, except for simulations based on IPSL.They are also in the order of average peaks simulated with WFD input; compare with Fig. 3.

Impact of bias correction on discharge projections
Figure 11 shows that maximal discharge peaks simulated with RCM climate input are often much higher than average peaks simulated with WFD (∼ 6000 m 3 s −1 ).Where Hydrol.Earth Syst.Sci., 22, 2163Sci., 22, -2185Sci., 22, , 2018 www.hydrol-earth-syst-sci.net/22/2163/2018/ Figure 10.Changes of average daily discharges at gauge El Diem based on uncorrected (UC) and bias-corrected (BC) Earth system model (ESM) input in the period (2030-2059) under RCP 8.5 relative to the models' reference period .R 2 and PBIAS values are computed to show the differences between the projection period and the reference period.
only two UC RCMs simulate higher peaks in the reference period (EC-EARTH-Hirham5 and EC-EARTH-RCA4), five BC RCMs simulate peaks higher than 7000 m 3 s −1 .Looking at projected peaks in the period 2030-2059 (RCP 8.5) shows that nine out of ten BC RCM-driven and five UC RCM simulations simulate peaks that are higher than 7000 m 3 s −1 .The projected changes of peak discharge magnitudes between UC and BC RCMs are significantly higher in BC simulations in 50 % of the models.This is not surprising because bias correction of RCMs already led to significant overestimation of high flows in the reference period, as was discussed in Sect.4.2.3.This behaviour is exaggerated in future periods.

Selected model ensemble
Table 1 summarizes the performance criteria for all UC and BC simulations using R 2 , PBIAS, deviations from FDC val-ues, and the change rate.The seasonality criterion R 2 > 0.85 was achieved by all simulations except the one based on UC NorESM.Seven out of 30 simulations failed to represent the volumetric deviation criterion PBIAS ± 30 %. Concerning the FDC criteria, 12 simulations passed the NED test, seven simulations the high flow criterion, and only one simulation the low flow criterion.The column "pre" (preselection) shows whether a model fulfilled the criteria in the first three columns.These models might be chosen for a qualitative impact assessment.However, four models that passed the preselection criteria were omitted from the selected model ensemble because they project very high changes in average annual discharges (column "Change").Sometimes both the UC and BC simulations were judged to be suitable.In order not to put too much weight on the results of one model, only the better simulation (UC or BC) was selected for the final  .R 2 and PBIAS values are computed to show the differences between the projection period and the reference period.
model ensemble and is denoted in the column "final".The latter column indicates that 10 out of 30 simulations passed all performance criteria and thus become members of the selected model ensemble.This ensemble consists of four BC ESMs, four BC RCMs, and two UC RCMs.

Climate impacts on discharges
In this section, the similarities and differences of projected climate change impacts on Blue Nile discharges at gauge El Diem are discussed.The two UC and BC ESM and RCM ensembles and the selected model ensemble are considered (see Table 1, column "final").In Figs. 12 and 13 and S8-S11, each model simulation is represented by a semi-transparent polygon, where blueish colours indicate an increase and reddish colours a decrease in monthly discharges.The more sat-urated the colour, the more models project the same rate of change.The figures show monthly changes relative to average annual discharges in the reference period.This method was chosen in order to avoid overemphasizing large relative changes in dry periods which are not significant compared to annual discharges.
Table 2 shows the total range of changes in average annual discharges projected by the multi-model means of UC and BC ESMs and RCMs and the selected model ensembles.In the near future (2030-2059) in both RCPs, the range of UC models is between 7.4 and 19 %, the range of BC models between 11.3 and 27.7 %, and the range of the selected ensemble between 5.8 and 11.3 %.In the far future (2070-2099) considering both RCPs, the range of UC models is between 7.5 and 21.6 %, the range of BC models between 20.3 and 56.7 %, and the range of the selected ensemble between 8.4 and 13.2 %.The following conclusions summarize the projected changes of average annual discharges more specifically.
-All ensembles in all RCPs and future periods have in common that they all project an increase of average annual discharges.An exception is the selected model ensemble of the UC ESMs under RCP 4.5 (2030-2059), which projects a decrease of −0.4 % (Fig. S8a).
-The multi-model means of both UC and BC RCM ensembles (all models) usually project a higher increase of average annual discharges than the ESM ensembles, except under RCP 8.5 (2070-2099); see Figs.S9d and  S11d.
-The multi-model means of BC simulations (both RCPs and periods) always project higher increases in average annual discharges than the UC multi-model means.
-The magnitude of change signals projected by selected models in the respective ensemble is always lower than the magnitude of the whole ensemble.This is mainly caused by the fact that models projecting changes of > ±30 %, between the reference period and 2030-2059  under RCP 8.5, were omitted from the ensemble of selected models.
-A noticeable difference between the UC RCM and ESM ensembles is that projected average annual discharges in the far future are lower (RCMs) and higher (ESMs) than in the near future.
There are also general findings concerning changes in seasonality.
-There is a trend of decreasing discharges at the end of the dry season projected by all ensembles in both RCPs and periods.The period indicating a drying trend projected by the ESM ensemble tends to be longer and starts a bit earlier (June/July to August) than the trend projected by RCMs (only July).
-There is a trend of increasing discharges during the rainy season projected by all ensembles in both RCPs and periods.The period indicating higher discharges starts earlier in the RCM ensembles (August to November) than in the ESM ensembles (September to November).
-Both ensembles agree that there is almost no change projected in the dry period between December and May.

Discussion and conclusions
Are we using the right fuel to drive hydrological models?What are the likely impacts of climate change on future discharges in the UBN and is there a strong agreement of projected trends?How far does bias correction influence the results and can we trust models that require strong correction?These questions, posed in the introduction, are discussed in the following.The majority (≥ 80 %) of the 15 climate models used in this study agree that average annual discharges in the UBN are likely to increase in future.The models project a trend towards decreasing discharges at the end of the dry period (June and July) and an increase during the rainy season (August to November).Due to the use of different climate model ensembles, downscaling approaches, study areas within the UBN, and periods of analysis, a direct comparison with other studies is difficult but clearly reveals that the selection of climate models predominantly influences the results and conclusions made.Setegn et al. (2011) found for instance that the CMIP3 GCMs they used to investigate climate impacts on discharges in the Lake Tana catchment (Blue Nile headwaters) project decreasing trends, but they also state that ". . . it seems that, by chance, the nine GCMs used in this study are those that show a precipitation decrease. . .". On the other hand, Dile et al. (2013) conclude that discharges may increase by up to 135 % in the same region.Taking the, sometimes contradicting, results of recent studies into account (Teklesadik et al., 2017;Dile et al., 2013;Mengistu and Sorteberg, 2012;McCartney and Menker Girma, 2012;Setegn et al., 2011;Conway and Schipper, 2011;Diro et al., 2011;Elshamy et al., 2009), one can conclude that climate impacts in the UBN are uncertain but there is a bias towards a wetter future.The findings of this study, using the most recent global and regional climate models as well as precipitation projections of the entire CMIP5 ensemble, underline the latter statement.
Apart from discussing whether the future in the UBN will become generally wetter or drier, decisions with regard to the adaptation of land and water management to changing climatic conditions requires not only information on qualitative but also accurate seasonal quantitative changes.The value of using uncorrected climate simulations to answer those questions is, due to the lack of spatio-temporal accuracy and the lack of statistically representative observed weather characteristics, usually rather limited.Bias correction of climate simulations is an attempt to overcome at least some of these deficiencies.
The reference dataset used to bias-correct climate models and to calibrate and validate the hydrological model is another source of uncertainty.WFD were used in this study because bias correction on ESMs, provided by ISIMIP, was performed on the basis of this dataset.Moreover, WFD provide a sound basis as climate input, particularly in data-scarce regions, as was shown in various studies (Vetter et al., 2015;Aich et al., 2014;Liersch et al., 2013).The use of a different reference dataset would certainly require different calibration parameter settings and correction factors but would probably not impact the change signals.The most important issue in this connection is the consistency in using the same reference for calibration, validation, and bias correction.
As was shown in this study, monthly medians and average annual precipitation amounts of UC ESM and RCM simulations deviate sometimes strongly from reference climate.Although bias correction improved the performance of average climate conditions, the range of monthly precipitation amounts increased critically in several models, producing some extreme outliers in both ensembles.This phenomenon was particularly observed in simulations where deviations of monthly medians between UC simulations and WFD were rather large in the reference period.Average daily precipitation and the number of rainy days were considerably improved by bias correction, but 13 out of 15 BC models overestimate daily precipitation maxima, and many of them significantly.Hence, the bias correction methods applied to ESMs and RCMs in this study could be considered to be only partly successful.While achieving significant improvement in terms of average daily, monthly, and annual precipitation characteristics, increasing variability of precipitation amounts, and therefore under-and overestimation of extremes, was the result in many simulations.This phenomenon is problematic for impact studies and the application of hydrological models, particularly if changes of extreme values are the subject of investigation.Large overestimation of precipitation on some days or in some months, for instance, which are balanced by dry months in the long term, can lead to large amounts of excess water that may be simulated almost entirely as surface runoff by the hydrological model.Therefore, it is reasonable to use hydrological performance indicators to evaluate the suitability of climate simulations, particularly for quantitative impact studies, and to create a subset of models for the impact assessment.Another way to deal with low performance in the simulation of extremes in impact studies is to analyse changes in return periods of extreme events (Hattermann et al., 2016).
Due to the fact that discharge simulations, based on climate simulations, cannot be compared to observed discharges on a real-time daily, monthly, or annual basis, the methods to evaluate discharge performance are limited.In this study, the annual cycle (daily time series averaged over the simulation period) was characterized by R 2 and PBIAS, where R 2 was a measure of seasonality and PBIAS a measure of volumetric deviations.Flow duration curves (FDCs) were used to characterize the distribution of average flow conditions, high and low flows, as well as their extremes, by using the whole time series of daily discharge simulations.Unsurprisingly, discharge simulations show similar deficiencies to precipitation simulations.Using bias-corrected climate simulations improved the performance of non-extreme discharges (NED) significantly but, with few exceptions, the performance of high and low flows did not improve; in fact, it worsened in most of the simulations.Many BC discharge simulations tend to exaggerate high (overestimation) and low flows (underestimation).Comparing peak discharges using UC and BC climate input, for instance, showed a tremendous increase in some BC simulations, although average monthly precipitation patterns of BC models achieved a much better fit than their UC counterparts.Moreover, the multi-model means of BC simulations (both RCPs and periods) always project higher increases in average annual discharges than the UC multi-model means.However, a hydrological impact study in the Danube River basin showed in turn that relative changes in average monthly discharges projected using UC and BC climate models are overall comparable (Stagl and Hattermann, 2015).
Knowing these limitations, one should carefully consider the model's suitability and the purpose it is being used for.An impact study focusing on relative changes of future water availability may have lower requirements in terms of model accuracy than a study with the aim of investigating future extremes, such as floods and droughts or a study addressing land and water management issues including irrigation and/or reservoir operations.Whenever complex water management is involved, bias correction is often unavoidable because the simulation of reservoir and irrigation operations requires rather accurate hydrological input.However, to simply trust in climate input only because it was bias-corrected would be naive.Therefore, the question of model selection is valid.Why should one use or trust models to assess changes in seasonal patterns, for instance, that have not represented those patterns in the past or use a model to investigate future flood risk that completely fails to represent rainfall extremes?Again, bias correction may help to overcome some quality issues but it was also found in this study that improving climate simulations in the reference period does not guarantee higher quality or reliability in simulating future periods.On the contrary, the greater the necessity to correct a particular model, the higher the risk that BC simulations will show unexpected behaviour in future periods, where exceptions confirm the rule.Examples confirming this assumption are the following models: IPSL, CanESM2-RCA4, CNRM-CM5-RCA4, and MIROC-RCA4.However, the NorESM1 model is an exception here, because the BC simulation does not show extreme changes in future periods although strong bias correction was necessary in some months to force the model from a bimodal into a unimodal rainfall regime.It should be emphasized that the analysis of climate model performance in this study is only valid for the region of the UBN.It does not imply that a model which performed poorly in this study area will generally perform poorly in other regions, too.
The authors of this study conclude that a purpose-driven selection of a climate model subset is a reasonable approach, particularly in a regional context.To identify models that perform to a good level, the selection process should include an analysis of climate inputs, seasonal discharge patterns, volumetric deviations, and daily dynamics (FDCs), and an assessment of the magnitude of projected future changes.It is also worth mentioning that the thresholds defined to evaluate model performance have a subjective component and are based on statistical parameters, graphical data interpretation, and modelling expertise.If the thresholds had been set more critically in this study, almost no climate model would have passed the evaluation process successfully.The rather weak thresholds were a compromise and reveal the fact that the performance of many climate models is still far beyond being adequate for applied quantitative impact studies.This statement includes bias-corrected simulations and implies that the ability of bias correction can, depending on the approach, be rather limited and thus does not necessarily improve the reliability per se.In another river basin with different characteristics, e.g. with a nival regime or a bimodal rainfall regime, the performance criteria and their thresholds may have been defined differently.Hence, the model selection method can be applied to other river basins but it is always necessary to consider region-specific characteristics that may require the introduction of new criteria adapted to the situation at hand.However, model selection for regional impact studies is only a reasonable, justifiable, and recommended approach if the uncertainties of the selected ensemble are communicated within the context of the whole model ensemble.
This study demonstrated that neither the trend-preserving method applied to the five ESMs nor the harmonic-based method used to bias-correct the 10 RCMs was able to generate fully satisfactory climate inputs for a regional hydrological impact study with high demands in terms of quantitative accuracy.Hence, further research is required to improve regional climate simulations and/or to investigate alternative correction methods or approaches to make climate simulations meaningful for application-oriented regional studies available.Currently, the most promising solutions seem to be sophisticated delta-change methods, as suggested by Anandhi et al. (2011), Bosshard et al. (2011), and Chiew et al. (2009).
Data availability.All input data used to set up, calibrate, and validate the hydrological model and to bias correct the global and regional climate simulations are freely available and the corresponding sources are provided in Sect.3.1.All discharge simulations produced in this study have been made available at https://doi.org/10.4121/uuid:05b9f40f-583d-479b-a79e-f961f72436db(Liersch, 2018).The bias corrected CORDEX simulations are available here: https://doi.org/10.5880/PIK.2018.009(Liersch et al., 2018).
Competing interests.The authors declare that they have no conflict of interest.
a. What are the likely impacts of climate change on future discharges in the UBN? b.Is there an agreement on the signal of climate change impacts in the 21st century using different climate model ensembles?c.To what extent can bias correction alter the magnitudes of change signals in hydrological simulations in the study area?

SFigure 1 .
Figure 1.Map of the Blue Nile River basin.The Upper Blue Nile (UBN) catchment (172 000 km 2 ) is enclosed by the red line.The three gauges used for model calibration and validation are represented by white circles.
of daily and monthly precipitation Monthly medians and average annual precipitation sums of UC ESM and RCM simulations deviate sometimes strongly from WFD (see Figs.S2, S3, and S4 in the Supplement).The underlying data for the box plots are monthly precipitation sums of the 30-year reference period averaged over the UBN catchment area.Bias correction improved the performance of both indicators considerably in both model ensembles.Deviations of average annual precipitation of all BC ESMs are lower than ±2 %.The results for the BC RCM en-semble are more diverse.Five RCMs deviate ≤ ±2 %, three RCMs ≤ ±5 %, and two RCMs ≤ ±7 %.

Figure 2 .
Figure 2. Simulated discharges for calibration (a, b) and validation (c, d) periods at gauge El Diem (Sudan border) using WATCH Forcing Data (WFD).The annual cycle is shown in the top row and average monthly discharges in the bottom row.

Figure 5 .
Figure 5.Taylor diagram of average daily discharges at gauge El Diem in the reference period.It shows R 2 , standard deviation (SD) normalized by SD ref , and normalized SD D of discrepancies for Earth system model (ESM) input in the top row and regional climate model (RCM) input in the bottom row.

Figure 7 .
Figure7.Anomalies of annual precipitation amounts relative to the reference period.Range of selected model ensemble.

Figure 8 .
Figure8.Anomalies of average annual mean air temperature relative to the reference period.Range of selected model ensemble.

FiguresFigure 9 .
Figures 10 and 11 show projected discharge changes of each single model under RCP 8.5 in the period 2030-2059.The changes are relative to the models' reference period.The figures allow the changes between the reference and the future period of UC and BC models to be investigated, as well as the differences of projected changes between UC and BC simulations.The indicators R 2 and PBIAS are not used to measure the performance, but they indicate the magnitude of change between the reference and the projection period.The IPSL model shows the largest deviations between the future and the reference period (Fig.10) for both UC

Figure 11 .
Figure11.Changes of average daily discharges at gauge El Diem based on uncorrected (UC) and bias-corrected (BC) regional climate model (RCM) input in the period (2030-2059) under RCP 8.5 relative to the models' reference period.R 2 and PBIAS values are computed to show the differences between the projection period and the reference period.

Figure 13 .
Figure13.Monthly discharge changes of the selected model ensemble (10 models) relative to average annual discharge in the reference period at gauge El Diem.

Table 2 .
Figure12.Monthly discharge changes of uncorrected (UC) and bias-corrected (BC) Earth system model (ESM) and regional climate model (RCM) simulations in % under RCP 8.5.Monthly changes are relative to average annual discharge in the reference periodat gauge El Diem.Projected changes in average annual discharges relative to 1970-1999 in %.