Empirical streamflow simulation for water resource management in data-scarce seasonal watersheds

Introduction Conclusions References


Introduction
Hydrologists and water managers have made use of observed relationships between rainfall and runoff to predict streamflow ever since the creation of the rational method in the 19th century (Beven, 2011).However, the development of increasingly so-Figures

Back Close
Full phisticated machine learning techniques, combined with rapid increases in computational ability, has prompted extensive research into advanced methods for data-driven streamflow prediction in the past decade.Artificial neural networks (ANNs), regression trees, and genetic algorithms have been shown to be powerful tools for predictive modeling and exploratory data analysis, particularly in systems that exhibit complex, non-linear behavior (Solomatine and Ostfield, 2008;Abrahard and See, 2007).While distributed physical models that accurately represent hydrologic processes can still be considered the gold standard for rainfall runoff modeling, empirical models can be a useful tool in contexts where there is limited data on physical watershed processes but long time-series of precipitation and streamflow (Iorgulescu and Beven, 2004).
While many criticize these approaches as "black boxes" with no relationship to underlying physical processes (See et al., 2007), a number of studies have demonstrated how empirical approaches can be used to gain insights about physical system function (e.g., Han et al., 2007;Galelli and Castelletti, 2013).Additionally, improvements in interpretation and visualization methods can make complex models more easily interpretable (Sudheer and Jain, 2004;Jain et al., 2004).Finally, data-driven models can be useful in identifying situations where observed data disagree with what would be predicted based on conceptual models, and thus identify assumptions regarding runoff generation processes that may be incorrect (Beven, 2011).While there have been some applications of alternative machine learning methods, such as support vector machines (Asefa et al., 2006;Lin et al., 2006) and regressiontree based approaches (Iorgulescu and Beven, 2004;Galelli and Castelletti, 2013) for streamflow simulation, the vast majority of research has focused on artificial neural networks (Solomatine and Ostfield, 2008).While they have demonstrated impressive predictive accuracy in a number of different contexts, excessive parameterization of ANNs can result in overfit models that are not generalizable to unseen data (Iorgulescu and Beven, 2004;Gaume and Gosset, 2003).This can lead to complex models that only result in modest improvements (or no improvements at all) over much simpler approaches (Gaume and Gosset, 2003;Han et al., 2007).Even outside of a hydrology Figures context, it has been argued that ANNs are better suited for problems aimed at prediction without any need for model interpretation, rather than those where understanding the process generating predictions and the role of input variables is important (Hastie et al., 2009).Given the importance that this interpretation plays in understanding the contexts in which a hydrologic model is appropriate and reliable, the strong opinions surrounding the use of ANNs for water resource management are perhaps not surprising.While a number of comparison studies exist that apply multiple empirical models to a given problem, finding generalizable insights from these studies is hindered because of the limited number of models and datasets evaluated.Perhaps the most comprehensive comparison to date is that of Elshorbagy et al. (2010a, b), who compared six methods for data-driven modeling of daily discharge in the Ourthe River in Belgium.This work found that linear models were able to perform comparably to much more complex methods when the data content of the models were limited, or when system input-output behavior was close to linear.However, other studies have demonstrated the value of using more complex approaches when modeling more complex rainfallrunoff behavior (e.g., Abrahart and See, 2007;Asefa et al., 2006).The differing results obtained across these studies indicate that no single method is likely to be suitable for all basins, timescales, or applications.However, it is important to recognize that predictive accuracy alone is not necessarily sufficient justification for applying a model to a given problem.Models should not only be accurate, but also be fit-for-purpose (Beven, 2011;Van Griensven et al., 2012).For instance, accurate representation of low return period flows is more important in a flood forecasting model than one aimed at predicting average amounts of water available for withdrawal and human consumption.Similarly, the ability to provide insights into physical watershed function may be more important in basins where land-use change could alter the hydrologic regime, compared to a basin that is heavily urbanized and expected to remain so.This is particularly true for data-driven models; as pointed out by Solomatine and Ostfield (2008), there is a need for additional research on making models more understandable and Introduction

Conclusions References
Tables Figures

Back Close
Full In this work, we compare six methods for empirical streamflow prediction (linear models, generalized additive models, multivariate adaptive regression splines, random forests, M5 model trees and ANNs) in their ability to predict monthly streamflow in five rivers in the Lake Tana basin in Ethiopia.This study region was selected to provide a counterpart to previous comparative studies that have largely focused on rivers in temperate regions.Furthermore, physically-based hydrological process models are a challenge in these basins due to data limitations -soil and vegetation parameters are poorly characterized and high frequency, spatially-distributed precipitation estimates are highly uncertain -and complex hydrodynamic processes, including lake backwater effects, that are neglected by most watershed models.There are, however, relatively long time series of streamflow available, and estimates of historical precipitation and temperature are available at a monthly timescale.These data, combined with information on relevant landscape change, can be leveraged to create reasonably accurate empirical models.Models are compared not only in terms of their predictive accuracy, but also in terms of model error structure and the implications that this structure may have for water resource applications.Additionally, we evaluate the methods by which model structure and predictor variable influence can be evaluated to gain insights into physical system function for each model type.Finally, we assess the suitability of using different model types for climate change impact assessment by comparing model uncertainty in projections made for increasingly extreme climate conditions.The overall objective of this research is not to identify a single "best" model, but rather to highlight some of the strengths and limitations of different approaches, as well as demonstrate important issues that should be kept in mind for model comparisons in the future.Figures

Back Close
Full

Study area
Lake Tana is located at an elevation of approximately 1800 m in the highlands of northwest Ethiopia (Fig. 1).The catchment draining to the lake encompasses approximately 12 000 km 2 , and the four main tributaries providing water to the lake are the Gilgel Abbay (including its tributary, the Koga River), Ribb, Gumara, and Megech Rivers.Collectively, these rivers account for 93 % of the inflow to the lake (Alemayehu et al., 2010).Ninety percent of rainfall in the basin occurs during the wet season from May until October, and there is significant interannual variability in precipitation with annual rainfall levels ranging from below 1000 mm to over 1800 mm (Achenef et al., 2013).Population growth and expansion of agricultural and pastoral land use in the region has resulted in substantial deforestation and land degradation, with agricultural, pastoral and settled land cover comprising over 70 % of the basin's surface area (Rientjes et al., 2011;Garede and Minale, 2014;Gebrehiwot et al., 2010).There is some evidence that this has impacted the hydrology of the rivers draining into the lake (Gebrehiwot et al., 2010).
A summary of basin characteristics for the evaluation period of 1960-2004 is presented in Table 1.
Approximately 2.6 million people live in the basin, and are largely settled in rural areas and reliant on rainfed subsistence agriculture.This makes the region quite vulnerable to climate variability and change, and a number of water resources infrastructure projects are planned to better manage this vulnerability and support economic development (Alemayehu, 2010).To better understand the potential implications of this development, extensive effort has been put towards developing rainfall-runoff models for the Lake Tana basin, as well as other areas of the Ethiopian highlands with similar characteristics (van Griensven et al., 2012).Many of these studies rely on Soil and Water Assessment Tool (SWAT) models, although there are some that use water balance approaches ( Van Griensven et al., 2012).While these models have in some cases demonstrated reasonably high accuracy, previous evaluations were largely based on 11088 Figures

Back Close
Full Nash-Sutcliffe Efficiency (NSE; Nash and Sutcliffe, 1970) which can be a flawed performance metric in highly seasonal watersheds (Schaefli and Gupta, 2007;Legates and McCabe, 1999).More importantly, the limited data available for physical parameterization of these models required a heavy reliance on model calibration, which sometimes resulted in parameterization schemes that are inconsistent with physical understanding of the region's hydrology (Steenhuis et al., 2009;van Griensven et al., 2012).Furthermore, a number of studies relied on empirical relationships such as curve numbers and the Hargreaves equation that were developed for temperate regions (e.g., Mekonnen et al., 2009;Setegne et al., 2009).While these limitations are likely to introduce considerable uncertainty into model projections, particularly in situations where climatic or environmental conditions differ from those experienced in the calibration period, few studies include any sort of uncertainty analysis in model predictions.Empirical models could provide a useful complement to physical models developed for the region by providing insights into physical system function and allowing for more comprehensive uncertainty analysis.

Model development
Models were developed using monthly streamflow, climate, and land cover data for the period from 1961 to 2004.In each of the five major rivers in the basin, we developed empirical models that estimated monthly streamflow as a function of climate conditions and agricultural land cover in each basin.Monthly streamflow data was taken from historic stream gauge records for each basin, as reported in feasibility studies developed for proposed irrigation projects (Alemayehu, 2010).Historic data for monthly daily average temperature, monthly total precipitation, and monthly wet days in each river basin were derived from the University of East Anglia Climate Research Unit (CRU) TS3.10 gridded meterological fields (Harris et al., 2014), which are based on meteorological station observations.Historic estimates of rainfall intensity were also calculated by dividing monthly total precipitation by CRU TS3.10 records of the number of wet days in that month.However, this data was found to be highly correlated with monthly precipi-Figures

Back Close
Full tation and did not result in significant improvements to the predictive accuracy of tested models, and was thus not included in the final model formulations.Finally, to account for historic increases in agricultural and pastoral land cover that have occurred in the basin, the percentage of land cover used for any crop or grazing was estimated from historic land cover analyses described by Rientjes et al. (2011), Gebrehiwot et al. (2010), and Garede and Minale (2014).These studies used historic aerial photos and satellite images to estimate land cover changes in the Ribb, Gilgel Abbay, and Koga basins from the periods of 1957 to 2011.The percentage of agricultural land cover was interpolated for years when data wasn't available, and the value of agricultural land cover in the two basins without data was assumed to be equal to average agricultural land cover in the basins with data.While this approach is prone to errors that could stem from differing rates of land use change through time and between basins, it does provide a mechanism for capturing the long-term trend of expanding agricultural land cover that has been observed throughout the Ethiopian highlands when detailed land-cover data is unavailable.Including this data improved out-of-sample predictive accuracy of the models, further suggesting that it was a valuable addition.Two general formulations for the empirical models were evaluated.The first (referred to below as the standard model formulation) was where Q b,t is the monthly streamflow in river b at time period t, P b,t and T b,t are the monthly total precipitation and average temperature in river basin b at time period t, AgLC b,t is the total percentage of agricultural land cover in basin b at time t, and ε b,t is the model error.The subscripts t − 1 and t − 2 indicate lagged measurements from one and two months prior, and were included to roughly account for storage times longer than one month that could impact streamflow in each river.The function f represents a general function that differed between the specific models assessed and is discussed in more detail below.The logarithm of monthly streamflow was used as a response variable to keep model predictions positive and make the response variable 11090 Introduction

Conclusions References
Tables Figures

Back Close
Full better match a normal distribution, a requirement for the use of Gaussian linear models and generalized additive models.
In the second formulation, streamflow and climate anomalies were used as the response and predictor variables to better account for the highly seasonal nature of streamflow and precipitation in the region.Streamflow anomalies were calculated for each observation by subtracting the long-term average streamflow for that month (m) from the observed value and dividing this number by the long-term standard deviation of that month's streamflow as in Eq. ( 2).This procedure was repeated for precipitation and temperature, and these values were then used to fit models of the form described in Eq. ( 3).
Seven different types of models were compared using each formulation in each basin: 1.A Gaussian linear regression model (GLM) using the basic stats package in the R statistical computing software (R Development Core Team, 2014).
2. Gaussian generalized additive model (GAM): GAMs are a semi-parametric regression approach where the response variable is estimated as the sum of smoothing functions applied over predictor variables.These functions allow the model to capture non-linear relationships between the predictor and response variables without a priori assumptions about the form (e.g., quadratic, logarithmic) of these functions, and are fit using penalized likelihood maximization to prevent model overfitting (Hastie and Tibshirani, 1990).GAMs were fit using the mgcv package in R (Wood, 2011).
3. Multivariate adaptive regression splines (MARS): MARS are a non-parametric regression approach where the response variable is estimated as the sum of basis Introduction

Conclusions References
Tables Figures

Back Close
Full functions fit to recursively partitioned segments of the data (Friedman, 1991).MARS models were fit using the earth package in R (Milborrow, 2015).
4. Artificial neural network (ANN): ANNs are a non-parametric regression approach represented by a network of nodes and links that connects predictor variables to the response variable.Each link in the network represents a function that maps the input nodes into the output node (Ripley, 1996).ANN models were fit using the nnet package in R (Venables and Ripley, 2013).
5. Random forest (RF): random forests are a rule-based, non-parametric regression approach where the model prediction is created by averaging the predicted value from multiple regression trees which are trained on separate bootstrapped resamples of the data.Each tree is fit using a small, randomly selected subset of predictor variables, resulting in reduced correlation between trees (Breiman, 2001).Random forest models were fit using the randomForest package in R (Liaw and Wiener, 2002).
6. M5 model: M5 models are a rule-based, non-parametric regression approach that fits a linear regression model to each terminal node of a regression tree (Quinlan, 1992).M5 models were fit using the Cubist package in R (Kuhn et al., 2014).
7. Climatology model: a climatology model that simply predicted each month's streamflow as equivalent to the long-term average streamflow for that month was included for comparison purposes.

Model evaluation
When using non-parametric regression approaches, it is important to avoid overfitting a model to a given dataset because this can result in large errors in out-of-sample predictions (Hastie et al., 2009).To avoid model overfit, the caret package in R (Kuhn, 2015) was used to determine model parameters for the MARS, ANN, RF and M5 models.This package uses resampling to evaluate the effect that model parameters have Introduction

Conclusions References
Tables Figures

Back Close
Full on the model's predictive performance and chooses the set of parameters that minimizes out-of-sample error (Kuhn, 2015).In this evaluation, 25 bootstrap resamples of the training dataset were generated for each parameter value to be assessed.A model was fit using each bootstrap sample and used to predict the remaining observations, and the parameter values that minimized average RMSE across all resamples.Details on the specific parameters evaluated for each model are presented in Table 1.While the development of more complex structures are possible for some models, this process can result in over-parameterization and poor model performance (Gaume and Gosset, 2003;Han et al., 2007).Additionally, the use of a standardized parameterization procedure allows for a more even comparison between different model types.
The predictive ability of each model was assessed using 50 random holdout crossvalidation samples.In each sample, a random selection of years were selected, and observations from these years were removed ("held-out") from the dataset.The size of the held-out sample ranged from 1 to 9 years.Each model was then fit to the remaining portion of the data, using the caret package described above to determine model parameters for the MARS, ANN, RF and M5 models.These models were then used to predict streamflow for the held-out portion of the data, and both the mean absolute error (MAE) and NSE were calculated after transforming model predictions after back to the original streamflow units.While NSE values are acknowledged to be a flawed performance metric in highly seasonal watersheds (Schaefli and Gupta, 2007;Legates and McCabe, 1999), this metric was included to provide a rough comparison of how empirical model performance compared to the performance of physical models developed for the region.Mean MAE and NSE were calculated for each model across the 50 cross-validation samples and used to choose the model with the highest predictive accuracy in each basin.This cross-validation procedure provides a mechanism for evaluating how well a model will generalize to an unseen set of data while avoiding some of the problems that can arise from the use of a single calibration and validation dataset (Elshorbagy et al., 2010a;Han et al., 2007).Introduction

Conclusions References
Tables Figures

Back Close
Full As a rough point of comparison for the statistical models developed in this research, we also evaluated discharge estimates derived from a process-based hydrological model.The model used in this application is the Noah Land Surface Model version 3.2 (Noah LSM; Ek et al., 2003;Chen et al., 1996).Noah LSM was implemented for offline simulations of the Lake Tana basin at a gridded spatial resolution of 5 km for the period 1979-2010 using a time step of 30 min.Meteorological forcing was drawn from the Princeton 50 year reanalysis dataset (Sheffield et al., 2006), downscaled to account for Ethiopia's steep terrain using MicroMet elevation correction equations (Liston and Elder, 2006).The Princeton reanalysis was selected because it provides relatively high resolution meteorological fields, including all variables required to run a water and energy balance LSM like Noah, for the period 1948-present.While higher resolution and possibly higher quality datasets are available for recent years, this longer dataset was utilized to compare the process-based model to statistical models developed for a long historical period.Soil parameters for the Noah simulation were drawn from the FAO global soil database, land use was defined according to the United States Geological Survey (USGS) global 1 km land cover product, and vegetation fraction was derived from MODerate Imaging Spectroradiometer (MODIS) imagery.Land cover was treated as a static parameter over the full length of the simulation, as spatially complete estimates of historical land use were not available at the required resolution and specificity.
The highest performing model in each basin based on MAE was retained for more detailed evaluation of model error structure, covariate influence, and uncertainty in climate change sensitivity analysis.To generate a complete time-series of out-of-sample model predictions for error analysis, the holdout cross validation procedure was repeated for the highest performing standard-formulation and anomaly-formulation models for each basin, but this time holding out a single year of observations in each iteration.The predictions from this cross validation were used to evaluate the how model error structure might impact model predictions used for water resource applications.
The influence of different predictor variables on model predictions was also assessed for the highest performing model in each basin after being fit to the complete dataset.Introduction

Conclusions References
Tables Figures

Back Close
Full Each predictor variable was assessed using metrics for covariate importance and influence that are unique to that model type, demonstrating how models could be used to gain physical insights about data-scarce regions and the mechanisms for generating these insights for each type of model.Partial dependence plots (Hastie et al., 2009) were also generated for each covariate for the highest performing model in each basin to provide insights about how covariate influence compared across different basins and model types.
Finally, two evaluations were conducted to assess uncertainty in model projections of streamflow under increasingly extreme climate conditions to better understand the implications of using different model formulations for climate change impact studies.
Model projections of streamflow in different climate conditions are likely to be accompanied by considerable uncertainty, particularly when climate conditions exceed those experienced historically.To assess this uncertainty, the best performing model in each basin was used to generate streamflow predictions for (1) changes in temperature from 0 to 5 • C, (2) changes in precipitation from −30 to +30 %, (3) an increase in temperature to 5 • C combined with a decrease in precipitation to −30 %, and (4) an increase in temperature to 5 • C combined with an increase in precipitation to +30 %.For each of the four assessments, the models generated predictions for the 45 year historic climate record adjusted for a given degree of climate change using the delta-change method (Gleick, 1986), while holding agricultural land cover constant at 60 %.Model predictions for the altered climate record were then used to calculate the average annual streamflow in each river.These should not be interpreted as a prediction of actual climate change impacts, but rather a measurement of the sensitivity of streamflow in the basin to different climate conditions.This process was repeated 100 times for models fit on random bootstrap resamples of the historic dataset to generate uncertainty bounds surrounding model predictions and evaluate how the uncertainty in these predictions increased as climate conditions became more extreme.Introduction

Conclusions References
Tables Figures

Back Close
Full

Model accuracy and error structure
Table 2 shows the out-of-sample cross validation errors for each model assessed in each basin.The random forest model had the lowest mean absolute error for the standard-formulation model in four of the five basins, with the M5 model performing best for the Koga basin.These models outperformed the Noah LSM simulations in all basins assessed.The Noah LSM errors are for a single period of analysis and thus don't present an exact corollary to the cross validation performed for the empirical models.Nevertheless, the significant increases in errors associated with the Noah LSM model demonstrates the difficulty associated with the use of process-based models in the region, particularly when relying on global datasets that may be unreliable at the spatial and temporal resolutions required for physical modeling.Physical models developed for monthly streamflow prediction in other basins within the Ethiopian highlands have reported NSE values ranging from 0.53 to 0.92 (van Griensven et al., 2012), compared to values ranging from 0.71 to 0.87 for the random forest models developed here.If this measure alone was used for model evaluation, these empirical models would generally be classified as having good performance based on the guidelines suggested by Moraisi et al. (2007).However, the climatology model outperforms the best standard formulation models in all basins except Megech, indicating that in the majority of basins the errors from the fitted empirical models are higher than those that result from simply using the long-term monthly average for each month's prediction.This demonstrates how high NSE values can be quite easy to obtain in seasonal basins.
Evaluation of anomaly model errors indicates that the models using this formulation achieve better predictive accuracy than those using the standard formulation, and are able to outperform the climatology model based on both NSE and MAE in all basins.However, the highest performing models in each basin varies more when the anomaly formulation is used, with the GLM, GAM, random forest, and M5 models all minimizing 11096 Introduction

Conclusions References
Tables Figures

Back Close
Full  2, and demonstrate that a positive autocorrelation exists at the 12 month time lag.For brevity, only plots for two rivers are shown, although this autocorrelation existed in the standard-formulation models for all basins except Megech (Table 3).This autocorrelation occurs because the standard-formulation models consistently underestimate wet-season streamflow while overestimating dry-season flows, as is apparent in hydrographs of observed and predicted streamflow (Fig. 3).Because wet-season flows contribute such a large portion of the total annual flow volume, this results in regular underestimation of aggregate values such as mean annual flow (Table 3).This autocorrelation is reduced in the anomaly-formulation models, meaning that they are better able to capture the peak flow volumes experienced in the wet season and do not underestimate mean annual flow to the same degree that the standard formulation models do.

Model structure and covariate influence
Evaluating the relationship between predictor covariates and streamflow response can lend insight into the physical processes underlying runoff generation in each basin.There are two components of this relationship that can be evaluated: how much each covariate contributes to model accuracy (covariate importance), and the direction and nature of the relationship between covariate values and model response (covariate influence).In many machine-learning models, complete description of the all of the mathematical relationships within the model (for instance, through description of each tree comprising a random forest model) is infeasible, requiring the use of other mechanisms for understanding covariate importance and influence.However, because each 11097 Figures

Back Close
Full model type is structured in a different way, these mechanisms differ.This section first describes the mechanisms available for obtaining insights about covariate influence in each of the highest performing models.To provide a mechanism for comparing results across different basins, each basin model is then assessed using the general approach of partial dependence plots.
In the Gilgel Abbay and Koga basins, the highest performing model was a simple linear regression model.These models can be evaluated by reviewing model coefficients and associated p values, as shown in Table 4.In a standard linear regression, model coefficients can be interpreted as the mean change in the response variable that results from a unit change in that covariate when all others are held constant.These coefficients are for streamflow anomalies rather than raw values, making their immediate interpretation less intuitive.For instance, in the Gilgel Abbay model an increase of one standard deviation in precipitation results in an increase of 0.22 standard deviations in flow.The associated p value for each coefficient evaluates a null hypothesis that the true coefficient value is equal to zero given the other covariates in the model, and thus has no influence on the response variable.
Evaluating model structure based on regression coefficients is appealing due to their simplicity and familiarity.However, it is important to keep in mind that the above interpretations rely on specific assumptions regarding model error distributions.Examination of fitted model residuals from both basins indicate that errors are autocorrelated in the Koga basin and not normally distributed due to the presence of outliers in both basins.Non-normality and autocorrelation both impact the t statistics and f statistics used to test for the significance of model coefficients, and thus the p values for these models are likely biased (Montgomery et al., 2012).
Interpretation of variable influence in GAMs is based on the estimated degrees of freedom (EDF) a covariate's smoothing function s(X i ) uses within a model (Hastie and Tibushini, 1986).An EDF value of one or below indicates a linear function relating the response variable to that covariate, while values greater than one represent a nonlinear smoothing function.An EDF value of zero indicates that the covariate smoothing Introduction

Conclusions References
Tables Figures

Back Close
Full function is penalized to zero (meaning it has no influence on model predictions).In the model for the Megech River, the terms for lagged temperature at one and two months, as well as precipitation lagged at two months were all smoothed to zero.Of the remaining covariates, lagged precipitation has a linear impacts on model response, while precipitation, temperature and land cover have non-linear impacts.Smoothing functions can be plotted to gain more insight about these relationships (Fig. 4).The functions for precipitation anomaly, lagged (one month) precipitation anomaly, and agricultural land cover show a positive relationships with streamflow, while the function for temperature anomaly predicts low streamflow at both high and low anomalies.p values test the null hypothesis that a covariate's smoothing function is equal to zero, but rest on the assumption that model residuals are homoscedastic and independent (Wood, 2012).Similar to the linear models, residuals in the Megech GAM model appear to be both autocorrelated and heteroscedastic, meaning that a formal statistical interpretation of this value may be inappropriate and that confidence bounds around smoothing functions might be misleading.
The M5 cubist model fit for the Gumara basin is an ensemble of 100 small M5 regression trees.In each tree, the model splits observations based on logical rules related to one or more covariates and fits a linear regression model to each set of observations.The final model prediction is the average across all of the individual trees.Using this sort of ensemble approach can reduce model variance and improve accuracy if the individual trees are unbiased, uncorrelated predictors (Breiman, 1996).This can be useful in avoiding models that are overfit to the data, but can reduce model interpretability since direct visualization of model structure becomes impractical as the number of trees increases.However, the frequency with which individual covariates are used as splitting points within trees and as regression coefficients can provide some insights about covariate importance (Table 4; note that because multiple covariates can be used for rules and linear models, these don't necessarily add to 100 %).Model rules were largely based on land cover, with some rules based on precipitation.These two covariates were also used most frequently in linear regressions at model nodes, Introduction

Conclusions References
Tables Figures

Back Close
Full followed by temperature (current and 1 month lag) and 1 month lagged precipitation.Notably, climate data from 2 months lagged was not used at all.While this can be useful in identifying which covariates have the largest impact on model predictions, it does not provide any information regarding the nature or direction of that influence.
Similarly, the random forest model developed for the Ribb basin is an ensemble of regression trees in which the final model prediction is the average of the predictions from each individual tree.However, random forests use standard regression trees that do not incorporate linear regression models at terminal nodes.Variable importance within the final model is measured by recording the increase in out-of-sample MSE that results when a covariate is randomly permuted for each tree in the ensemble.This increase in error is then averaged across all trees in the ensemble.In our model, the largest increases in error resulted from permutation of land cover and temperature, followed by 2 month lagged temperature and precipitation.Covariate influence can be evaluated through the use of partial dependence plots, which measure the change in model predictions that result from changing the value of one parameter while leaving all other covariates constant (Hastie et al., 2009).Partial dependence plots indicate that model predictions of streamflow are higher when the percent of agricultural land cover is greater than approximately 75 %, when temperatures anomalies are low, and when precipitation anomalies are high.However, it appears that the plot for lagged temperature might be sensitive to outliers at high temperature anomalies as evidenced by the large increase that occurs above an anomaly of +2, in a region where very few data points are present.
Many of the measures used to evaluate covariate importance and influence are model specific, making inter-basin and inter-model comparisons difficult.However, the partial dependence plots used in the randomForest R package can be developed for any model and provide a mechanism for comparing the influence that covariates have in the different models and basins (Shortridge et al., 2015).Partial dependence plots were generated for each basin's best performing model and results are shown for climatic variables in Fig. 6.As expected, models generally respond positively to increases Figures

Back Close
Full in precipitation and negatively to increases in temperature, with the greatest influence in the current month and decreasing influence at one and two months prior.The influence of the current month's precipitation is linear in three of the five basins; while this is constrained to the be the case in the Gilgel Abbay and Koga basins due to the use of a linear model, the linear response in Gumara is not required from the M5 model structure.Interestingly, both Megech and Ribb demonstrate a linear response to negative precipitation anomalies, but little response to positive anomalies.Streamflow response to temperature is strongest in the Gumara basin; interestingly, this is the basin with the smallest response to precipitation.The partial dependence plots for the percentage of the basin classified as agricultural land cover indicates a positive relationship between agricultural land cover and streamflow in all basins except for the Gilgel Abbay (Fig. 7).This would be expected if deforestation had contributed to a decrease in evapotranspiration in the contributing watersheds.The exact nature of this response differs across the different rivers, with the relatively minor responses in Koga and Ribb, and much stronger responses in the Gumara and Megech basins.However, this plot also demonstrates some of the limitations associated with different model structures.The plot for Gumara is highly erratic, indicating that the M5 model might be overfit to the training dataset, despite the use of model averaging to reduce model variance.Additionally, the GAM used in the Megech basin was only trained on agricultural land cover values up to 77 %; while this model may be accurately representing the impact of land cover changes within this range, extrapolating this relationship to higher values leads to predictions that may not be physically realistic.increasing precipitation results in higher flows.However, the uncertainty surrounding temperature sensitivity increases at higher changes in temperature, while the uncertainty surrounding precipitation sensitivity remains relatively constant, even at extreme changes in annual precipitation.The bottom panels of the figure show the sensitivity of total inflows to concurrent changes in temperature and precipitation.Unsurprisingly, decreasing precipitation combined with higher temperatures results in greater decreases in total flow than when temperature and precipitation are varied independently.However, even if temperature increases are combined with higher precipitation, total flows decline in the majority of bootstrap resamples.

Climate change sensitivity and uncertainty assessment
The uncertainty surrounding temperature sensitivity is a key limitation to using datadriven approaches for climate impact assessment.To better understand which models and basins are contributing to this uncertainty, Fig. 9 shows how the coefficient of variation (the standard deviation of predictions from all bootstrap samples divided by the mean of these predictions) varies as a function of temperature change in each basin.
From this figure, it is apparent that the Megech model is by far the largest contributor to model uncertainty; however, it is not clear whether this contribution is due to model structure (the GAM model used for the Megech River) or characteristics associated with the basin itself.To investigate how different model structures contributed to this uncertainty, the bootstrap resampling procedure was used to assess uncertainty in streamflow predictions in the Gumara River from all model types.This basin was chosen because all six models were able to outperform the climatology model, and thus could be considered good choices for model selection based on predictive accuracy alone.The results indicate that the increase in uncertainty is highest, and increases non-linearly, in the GLM, GAM, and MARS models.Uncertainty increases more slowly in the ANN and M5 models, and no noticeable increase in uncertainty is apparent in the random forest model.Figures

Back Close
Full

Discussion
The objective of this study was not to identify the "best" approach for empirical rainfallrunoff modeling, as this is likely to be highly specific to the basin and problem to which a model is applied.However, we hope that the comparison conducted here can highlight some of the strengths and limitations of different approaches, as well as demonstrate some important issues that should be kept in mind for model comparisons in the future.One important finding was the limitation with using NSE as an error metric.
Our results confirm previous studies that found that even uninformative models able to capture basic seasonality are able to achieve high NSE values (Legates and Mc-Cabe, 1999;Schaefli and Gupta, 2007), and provide further evidence indicating that high NSE values should be considered a necessary but not sufficient requirement for model usage in planning situations.In particular, understanding error structure can be valuable in evaluating whether model biases might undermine the model's suitability for management activities.In our example, the autocorrelation present in the standardformulation models meant that these models were consistently underestimating wetseason flows, resulting in low estimates of the total annual flow in the rivers.Since multiple reservoirs are planned for construction on these rivers to support irrigation activities, this bias could lead to poor estimates of how much water is available for agricultural use in the short term (i.e., seasonal forecasting) and long-term (due to climate change).Interestingly, difficulties in accurately capturing high flows has been observed in physical hydrologic models for Ethiopia (e.g., Setegne et al., 2011;Mekonnen et al., 2009) and more generally (e.g., Wilby, 2005).The implications of this limitation should be carefully evaluated before using models for water resource planning or (more importantly) flood risk evaluation.could shed light on underlying physical processes.The easy manner in which covariate relationships within the GAM and random forest models can be visualized using a single command within their respective R packages is a strong advantage to these approaches compared to methods such as M5 model trees and artificial neural networks.Of course, partial dependence plots can be developed for any model type (as was done in this research), but code must be written by the user and thus requires a higher degree of effort than is necessary for in-package functions.A downside to most machine-learning models is that they do not support the statistical formalism in assessing variable importance that is possible when linear models and GAMs are used.However, this formalism often rests on assumptions regarding model residuals that are unlikely to be met in many hydrologic models (Sorooshian and Dracup, 1980).Within the Lake Tana basin, evaluation of covariate influence indicates that each basin's model is performing in a physically realistic manner, with a runoff increasing with higher precipitation levels and decreasing with higher temperatures.The influence of precipitation and temperature is greatest in the current month, and progressively declines to a very small influence after two months.This suggests that long-term (multimonth) storage does not significantly contribute to variability in flow volumes.One interesting finding is the non-linear relationship between concurrent month precipitation and runoff that exists in the Megech and Ribb basins, which suggests that above a certain point increasing rainfall does not result in a commiserate increase in streamflow.Other studies have noted the dampening effect that wetlands and floodplains have had on river flows in the region (Dessie et al., 2014;Gebrehiwot et al., 2010); this phenomenon could explain the non-linear relationship identified in this work.The clearly negative relationship between temperature and runoff demonstrates the degree to which upstream evapotranspiration impacts streamflow and suggests that evapotranspiration is largely energy-limited, rather than water-limited.Increasing agricultural land-use appears to be associated with higher runoff in all rivers except for Gilgel Abbay (where no clear relationship between land cover and runoff was observed), and suggests that agricultural expansion at the expense of forest cover has reduced the evaporative component of Introduction

Conclusions References
Tables Figures

Back Close
Full the water balance in these basins.Finally, the relative performance of different model formulations themselves can also be informative.For instance, the improved performance of the anomaly-formulation models indicates that the relationship between precipitation and runoff varies throughout the year and could point towards differences in runoff-generating mechanisms in the wet and dry seasons that have been observed in other case studies (Wilby, 2005).
One limitation with data-driven approaches for streamflow prediction is that the relationships they model can only generate reliable predictions for conditions that are comparable to those experienced historically.Using these models to generate predictions for conditions that exceed historic variability is likely to introduce considerable uncertainty into their projections.Our results indicate that uncertainty in projections of streamflow under changing precipitation is relatively constant, whereas uncertainty increases markedly in projections of streamflow under increasing temperature.This result is not surprising when one considers the basin's climate, which is characterized by highly variable rainfall but fairly consistent temperatures (Table 5).A temperature increase of 3 • C equates to almost two standard deviations beyond historic variability, whereas a change in precipitation of 30 % is well within the range of conditions experienced historically.One would expect that in other climates (for example, temperate watersheds with only minor changes in rainfall throughout the year), this relationship could be reversed.Despite the uncertainty that exists in projections of streamflow under changing temperature, total annual flow appears to be quite sensitive to increasing temperatures.In fact, the decreases in streamflow due to increasing temperature appears likely to be more than enough to counteract any increases in streamflow resulting from higher precipitation that is projected for the region in some global circulation models (GCMs).This is consistent with the work of Setegne et al. ( 2011), who used projections from multiple GCMs as input for a SWAT model developed for the region and found that streamflow decreased in the majority of emissions scenarios and models, even when precipitation increased.Unfortunately, this suggests that any hopes for a Introduction

Conclusions References
Tables Figures

Back Close
Full "windfall" of additional water to support agriculture and hydropower in the region under climate change may be unfounded.
Repeating the climate change sensitivity experiment with multiple models fit to the Gumara watershed indicated that the MARS, GAM, and linear models all result in the largest increase in uncertainty at high temperatures.This indicates that when models are fit to slightly different bootstrap resamples of the historic dataset, the projected changes in streamflow at high temperature changes can be highly erratic.This is likely due to the fact that extrapolating the relationships that are observed between historic temperature and streamflow to higher temperatures can lead to very large changes in streamflow.Fitting the models to bootstrap resamples of the data results in minor changes to these relationships that can result in widely varying projections when the models are used to predict streamflow at higher temperatures, particularly when these relationships are nonlinear (as in the GAM).At the other end of the spectrum, the random forest model exhibits almost no increase in uncertainty at high temperatures, meaning that projections of streamflow at high temperatures are consistent across the bootstrap resamples.This is likely the result of the random forest model structure.The predicted value for each of a regression tree's terminal nodes is the average of all observations that meet the conditions described for that node.Thus, the model will not predict values beyond those experienced historically, even if covariate values exceed those contained within the historic dataset.Thus, this model is likely to underestimate the change in streamflow that results from increasing temperatures.

Conclusions
In this work, we compared multiple methods for data-driven rainfall-runoff modeling in their ability to simulate streamflow in five highly-seasonal watersheds in the Ethiopian highlands.Despite the popularity of ANNs in research on streamflow prediction to date, relationships effectively and lend themselves to simpler visualization of model structure and covariate influence, making it easier to gain insights on physical watershed functions and confirm that the model is operating in a physically realistic manner.However, it is important to carefully evaluate model structure and residuals, as these can contribute to biased estimates of water availability and uncertainty in estimating sensitivity to potential future changes in climate.In particular, autocorrelation in model residuals can result in underestimation of aggregate metrics such as annual flow volumes, even in models with high NSE performance.Uncertainty in GAM projections was found to rapidly increase at high temperatures, whereas random forest projections may be underestimating the impact of high temperatures on river flows.Thorough consideration of this uncertainty and bias is important any time that models are used for water planning and management, but especially crucial when using such models to generate insights about future streamflow levels.By considering multiple model formulations and carefully assessing their predictive accuracy, error structure and uncertainties, these methods can provide an empirical assessment of watershed behavior and generate useful insights for water management and planning.This makes them a valuable complement to physical models, particularly in data-scarce regions with little data available for model parameterization, and warrants additional research into their development and application.Figures  Full Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | useful for water managers.More comprehensive consideration of model strengths and limitations should be standard practice in model development and selection, rather than simply evaluating global error metrics.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | MAE in different basins.In all basins except Koga, the highest performing model significantly outperformed the climatology model based on paired Wilcoxon rank-sum tests (Bonferroni-corrected p value < 0.01).Further exploration of model residuals indicates another important advantage of using the anomaly model formulation.In the standard model formulation, model residuals appear to be non-random.Example autocorrelation plots are shown for the Gilgel Abbay and Ribb Rivers in Fig.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Figure 8
Figure8shows the results of the climate change sensitivity analysis for total flow from all five tributaries, with dashed lines representing 95 % confidence intervals obtained through 100 bootstrapped resamples of the data set.As would be expected, increasing temperature independently of precipitation results in decreasing total flows while 11101 Discussion Paper | Discussion Paper | Discussion Paper | Depending on the model type used, different mechanisms are available to evaluate covariate importance and influence within the model.This evaluation can be useful in both confirming that the model is replicating physically realistic relationships between input and output variables, and in characterizing watershed behavior in a manner that Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ANNs were not found to be the most accurate model in any of the five basins evaluated.Other methods, in particular GAMs and random forests, are able to capture non-linear Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding sources.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Table 3 .
Cross validation errors for each assessed model.

Table 4 .
Residual autocorrelation factors at a 12 month lag for the standard formulation and anomaly formulation models, and resulting mean annual observed and predicted flow.