Extreme flooding impacts millions of people that live within the Amazon floodplain. Global hydrological models (GHMs) are frequently used to assess and inform the management of flood risk, but knowledge on the skill of available models is required to inform their use and development. This paper presents an intercomparison of eight different GHMs freely available from collaborators of the Global Flood Partnership (GFP) for simulating floods in the Amazon basin. To gain insight into the strengths and shortcomings of each model, we assess their ability to reproduce daily and annual peak river flows against gauged observations at 75 hydrological stations over a 19-year period (1997–2015). As well as highlighting regional variability in the accuracy of simulated streamflow, these results indicate that (a) the meteorological input is the dominant control on the accuracy of both daily and annual maximum river flows, and (b) groundwater and routing calibration of Lisflood based on daily river flows has no impact on the ability to simulate flood peaks for the chosen river basin. These findings have important relevance for applications of large-scale hydrological models, including analysis of the impact of climate variability, assessment of the influence of long-term changes such as land-use and anthropogenic climate change, the assessment of flood likelihood, and for flood forecasting systems.
Flooding is notably the most common and damaging natural hazard affecting millions of people worldwide every year, producing economic losses exceeding billions of dollars (Hirabayashi et al., 2013). Flood risk associated with a particular location can be highly variable depending on levels of exposure, resilience and preparedness (Alfieri et al., 2018), in addition to the increased uncertainty surrounding trends of hydrological extremes in a warming climate (Arnell and Gosling, 2016). For the Amazon basin, flood risk is considered to have increased, with a greater frequency of extreme flood events (e.g. in 2009, 2012, and 2014; Marengo and Espinoza, 2016) coinciding with a hypothesized intensification of the hydrological cycle since the 1980s (Gloor et al., 2013). Floods in Amazonian communities are known to have large socioeconomic consequences impacting ecosystems, health, and transport links, and are particularly damaging to agricultural and fishery practices (Schöngart and Junk, 2007; Marengo et al., 2012, 2013; Correa et al., 2017). Single flood events (e.g. 2012 in the Amazonian city of Iquitos, Peru) have impacted the lives of over 73 000 people (IFRC, 2013), with average annual damages estimated at USD 1.4 billion over a 4-year period (2008–2011) in the Brazilian Rio Branco basin alone (Mundial Grupo Banco, 2014).
In its simplest form, a hydrological model can be considered a representation of a real-world hydrological system used to better understand various water and environmental processes, predict system behaviour, and provide consistent impact assessment (Devia et al., 2015). They work by simulating the hydrological response to meteorological variations incorporating run-off generation and river routing processes (Sutanudjaja et al., 2018). As such, global hydrological models (GHMs) have been used in a wide range of applications, including short- to extended-range flood forecasting (Alfieri et al., 2013; Emerton et al., 2018), climate assessment (Hattermann et al., 2017), hazard and risk-mapping (Ward et al., 2015), drought prediction (van Huijevoort et al., 2014), and water resource assessment (e.g. water availability models; Meigh et al., 1999; Sood and Smakhtin, 2015).
Depending on the application and the needs of decision makers, different properties of the hydrograph simulated by hydrological models are important. For example, an accurate representation of peak river flows and their likelihood is key for decision-makers who wish to understand the area at risk of flooding. In contrast, estimates of daily streamflow may be more beneficial for the assessment of water resources such as irrigation requirements.
The availability of GHMs has grown in recent years thanks to increased efforts in addressing water-related issues in developing countries (De Groeve et al., 2015; Ward et al., 2015; Trigg et al., 2016), the development of flood forecasting systems (Aliferi et al., 2013; Werner et al., 2013; Emerton et al., 2018), improvements within precipitation datasets (Mittermaier et al., 2013; Novak et al., 2014; Forbes et al., 2015), the emergence of new global satellite and remote sensing datasets, and advancements in numerical modelling techniques (Yamazaki et al., 2014a; Sampson et al., 2015; Andreadis et al., 2017; Balsamo et al., 2018). For an overview of available GHMs, see Bierkens et al. (2015), who have provided the details of 22 large-scale hydrological models, with those used for operational flood forecasting being summarized in Emerton et al. (2016).
GHMs have differing spatial and temporal resolutions, parameter estimation approaches, number of parameters, calibration methods, input–output variables, and overall structures (Sood and Smakhtin, 2015). Their set-ups can generally be divided into two categories: land surface models (LSMs) and hydrological models (Gudmundsson et al., 2012). The majority of LSMs and hydrological models share the same conceptualization of the water balance (Haddeland et al., 2011) but differ in their objective. LSMs evolve from coupled land–atmosphere models with the purpose of solving the surface energy balance equations to provide the necessary lower boundary conditions to the atmosphere (Wood et al., 2011). In contrast, hydrological models tend to focus less on the partitioning of radiation and more on hydrological resources and understanding the lateral movement and transport of water along the land surface.
In terms of differences in model performance, the Gudmundsson et al. (2012) intercomparison study of six LSMs and five GHMs (i.e. hydrological models) concluded that the main differences were due to the snow scheme implemented with snow water equivalent values and mean runoff fractions lower in LSMs. No significant differences between LSMs and hydrological models were found for runoff and evapotranspiration globally, but rather the differences between the models themselves created large sources of uncertainty, highlighting the importance of analysing a range of different GHMs rather than a group consisting of a specific model type. For the purposes of this study, we categorize both LSM and hydrological models as GHMs.
For GHMs to be considered effective, end users need to know their accuracy and reliability (Ward et al., 2015). Thus, the evaluation of these models against observed data is an important procedure in efforts to reduce flood risk. Currently, no intercomparison analysis of GHMs has been conducted specifically for the Amazon basin, with previous studies focusing solely on the performance of individual models for the Amazon (e.g. Yamazaki et al., 2012; Paiva et al., 2013; Hoch et al., 2017a, b) or as part of a global study (e.g. Gudmundsson et al., 2012; Alfieri et al., 2013; Hirpa et al., 2018), which lack an in-depth focus on skill within the Amazon basin.
Finally, many of the GHMs (or their components) analysed in this study are used for specific applications, for instance, in water resources management (PCRaster Global Water Balance; PCR-GLOBWB), flash flood forecasting (Ensemble Framework for Flash Flood Forecasting; EF5), and extended-range flood forecasting (Global Flood Awareness System; GloFAS). Investigating the performance of hydrological simulations therefore can provide valuable information to researchers and model developers with which to better understand some of the strengths and weaknesses which exist within the model set-ups and help to distinguish how different parts of the hydrological chain can cause particularly “good” or “bad” model performance, thus having implications for their different applications.
In this study, the main objective is to assess the ability of different GHMs
freely available from collaborators within the Global Flood Partnership (GFP), identifying which approaches are most suitable in different areas of the Amazon basin for simulating flood peaks. To pursue this objective, the analysis is designed to answer the following research questions.
How well do GHMs represent the annual hydrological regime in terms of the Kling–Gupta efficiency (KGE) and its individual components? Which model set-up best represents annual maximum river flows? Which hydrological routing model allows the best representation of daily and peak river flows? Which precipitation dataset allows the best representation of daily and peak river flows? How do results differ when using a LSM as opposed to a hydrological model? By how much does calibration of groundwater and routing model parameters improve performance?
The experimental design involves comparing the output of daily and annual maximum discharge estimates produced by different GHMs forced using atmospheric reanalysis or satellite precipitation datasets against observations of streamflow. The common validation period is 1997–2015, with results also analysed for the shorter period of 2004–2015 to account for the shorter record length of one simulation.
Observed daily discharge data are used to evaluate each of the model runs. The network of hydrometric gauges is controlled and maintained by the national institutions responsible for hydrological monitoring in countries situated within the Amazon basin. These include the Agência Nacional de Águas (Water National Office – ANA, Brazil), Servicio Nacional de Meteorología e Hidrología (National Meteorology and Hydrology Service – SENAMHI, Peru and Bolivia), Instituto Nacional Meteorologia e Hidrologia (Institute to Meteorology and Hydrology, INAMHI, Ecuador), and the Instituto de Hidrología, Meteorología y Estudios Ambientales (Institute of Hydrology, Meteorology and Environmental Studies – IDEAM, Colombia).
Daily water level values are collected by the respective institutions and are sourced through the ORE-HYBAM observational service (
Eight GHMs composed of different meteorological datasets, hydrological models/LSMs, and river routing models are used to each simulate river discharge across the Amazon basin. Four meteorological products (ERA-Interim Land re-analysis, ERA-5 re-analysis, European Centre for Medium-range Weather Forecasts (ECMWF) 20-year control reforecasts (hereafter defined as reforecasts), and the real-time TRMM TMPA 3B42 v.7), three hydrological models/LSMs (PCR-GLOBWB, the Hydrology-Tiled ECMWF Scheme for Surface Exchanges over Land; H-TESSEL, EF5), and three river routing models (Catchment-based Macro-scale Floodplain model, CaMa-Flood; Lisflood; and the Coupled Routing and Excess Storage, CREST) are employed. While the focus of this study is on GHMs made available by the GFP community, other models are available within the Amazon basin. Some examples include MGB-IPH (Paiva et al., 2013), LPJmL (Lund–Potsdam–Jena managed Land; Bondeau et al., 2007), WaterGAP (water – global analysis and prognosis; Döll et al., 2003), and MAC-PDM.09 (the Macro-scale-Probability-Distributed Moisture model.09; Gosling and Arnell, 2011).
As a result of using freely available datasets from collaborators within the GFP, simulations are composed of a combination of routing models and meteorological datasets and do not all use the same precipitation input or hydrological set-up. However, the available combinations allow enough insight into the model components to draw conclusions for the objectives stated. For example, to analyse the performance of precipitation inputs, ERA-Interim Land, ERA-5, and the reforecasts are forced through the calibrated version of Lisflood, whereby the routing and LSM remain consistent. To evaluate the differences between using the Lisflood and CaMa-Flood routing models, two simulations which use ERA-Interim Land precipitation and the H-TESSEL LSM are compared. To identify the differences between employing a hydrological model (PCR-GLOBWB) or LSM (H-TESSEL), two set-ups which use the ERA-Interim Land precipitation reanalysis and the CaMa-Flood river routing model are directly compared. Finally, to see how much benefit model calibration within Lisflood provides, ERA-Interim Land and ERA-5 are forced through the calibrated and un-calibrated Lisflood model versions. The CREST EF5 run is the sole simulation to have a unique hydrological model and meteorological input, and although it is more challenging to analyse the performance of specific components of the model set-up against other simulations, it was included in the analysis for completeness.
An alternative approach would be to implement a full intercomparison experiment and run a new set of simulations which included all combinations of precipitation input, GHM, and routing scheme. However, this is a very large undertaking, and the time and computational expense to achieve this are prohibitive. Instead, by using freely available datasets with different hydrological set-ups, our method allows a first analysis providing enough evidence of dataset reliability and accuracy in order to determine the utility of the differing approaches for climate studies and to forecast applications. Moreover, by using iterative runs of similar model set-ups (i.e. changing a specific part of the hydrological model chain), it allows us to make conclusive statements regarding the differences in skill. Finally, short descriptions of each model and atmospheric product are outlined below, with a summary of each simulation provided in Table 1.
Characteristics of the eight global hydrological models (GHMs) used to produce estimates of daily river discharge.
ERA-Interim Land is a global reanalysis of land surface parameters produced
by the ECMWF with a T255 spectral resolution (
ERA-5 is the latest reanalysis product of the ECMWF producing consistent
estimates of atmospheric, land, and ocean variables at a horizontal
resolution of
ECMWF reforecasts are a collection of historical forecasts from start dates
at the same day of the year going back for a specific number of years to
provide a consistent model climatology from which to compare forecasts
(ECMWF, 2016). In this study we use the control member of the reforecasts
which are created based on a retrospective run of the most recent version of
the ECMWF's IFS to provide surface and subsurface runoff as input to the
Lisflood routing model at a resolution of 0.1
TRMM TMPA 3B42 RT v7 is a global merged multi-satellite precipitation
product generated at the National Aeronautics and Space Administration (NASA). TMPA is computed for two products: a near-real-time version (TMPA 3B42RT v7) and a post-real-time gauged adjusted research version (TMPA 3B42 v7), both of which run at resolution of 3-hourly
H-TESSEL provides the land surface component of the ECMWF IFS (van den Hurk et al., 2000; van den Hurk and Viterbo, 2003; Balsamo et al., 2009). H-TESSEL simulates the land surface response to atmospheric conditions estimating water and energy fluxes (heat, moisture, and momentum) on the land surface (Zsoter et al., 2019). H-TESSEL is predominately used within the operational set-up of short- to seasonal-range weather forecasts coupled with the atmosphere, but it can also be used in an “offline mode” to calculate the land surface response to atmospheric forcing, whereby input data (e.g. near-surface meteorological conditions) are provided on a 3-hourly time step (Pappenberger et al., 2012). In this study, H-TESSEL receives boundary conditions from the atmospheric input provided by either the ERA-5 reanalysis, ERA-Interim Land reanalysis, or the reforecasts providing total runoff for the CaMa-Flood routing model, and the surface and sub-surface water fluxes for Lisflood. Runs forced using the ERA-Interim Land reanalysis are run in the offline mode. For a detailed description of H-TESSEL, see Balsamo et al. (2009).
PCR-GLOBWB is a global hydrological and water resource model developed at
the Department of Physical Geography, Utrecht University, Netherlands
(Sutanudjaja et al., 2018). For each grid cell and time step, PCR-GLOBWB
simulates moisture storage in two vertically stacked upper soil layers, as
well as the water exchange among the soil, the atmosphere, and the
underlying groundwater reservoir. Besides, water demands for irrigation,
livestock, industry, and households can be integrated within the model.
Run-off is routed along a local drainage direction (LDD) network using the
kinematic routing wave equation. PCR-GLOBWB was applied at a resolution of
30 arcmin (
EF5 is an open-source software package developed at the University of Oklahoma (OU) that consists of multiple hydrological model cores producing outputs of streamflow, water depth, and soil moisture (Clark et al., 2016). Since 2016, EF5 has been used operationally for local forecasts across the US National Weather Service (NWS) for flash flooding purposes (Gourely et al., 2017). EF5 incorporates CREST, which is a distributed hydrological model created by OU and NASA (Wang et al., 2011). Within CREST, runoff generation, evapotranspiration, infiltration, and surface and subsurface routing are computed at each grid cell within the model domain, with surface and subsurface water routed using a kinematic wave assumption. Four excess storage reservoirs characterize the vertical profile within a cell representing interception by the vegetation canopy and subsurface water storage in the three soil layers (Meng et al., 2013). In addition, the representation of sub-grid cell routing and soil moisture variability is made through the use of two linear reservoirs for overland and subsurface runoff individually (Wang et al., 2011). Locations of major streams, flow direction maps, and flow accumulation are all derived from the HydroSHEDS (Hydrological Data and Maps Based on Shuttle Elevation Derivatives at Multiple Scales) dataset (Lenhner et al., 2008).
In this study, an un-calibrated version of EF5 was run using CREST version 2.0 (Xue et al., 2013; Zhang et al., 2015) for 13 years (2003–2015), with a 1-year spin-up at a spatial resolution of 0.05
Lisflood is a global spatially distributed, grid-based hydrological and
channel routing model commonly used for the simulation of large-scale river
basins (van Der Knijff et al., 2010). It is currently used as an operational
rainfall–runoff model within the European Flood Awareness System (EFAS) for
streamflow forecasts over Europe (Smith et al., 2016). Unlike EFAS, which
uses the full Lisflood set-up, GloFAS and the simulations included in this
study use only the routing component of the Lisflood set-up, with surface and
sub-surface input fluxes (e.g. vertical water, water/snow storage) provided
by the H-TESSEL module of the IFS at a resolution of 0.1
Lisflood also represents lakes and reservoirs as simulated points on the river network (Zajac et al., 2017). The outflows of lakes and reservoirs are based on (a) upstream inflow, (b) precipitation over the lake or reservoir, (c) evaporation from the lake or reservoir, (d) the lakes' initial level, (e) lake outlet characteristics, and (f) reservoir-specific characteristics. For further details on the parameterization of lakes and reservoirs within Lisflood, see Appendix A within Zajac et al. (2017). In the Amazon, represented lakes are predominately located along the main stem, with very few reservoirs throughout the basin. For exact lake and reservoir locations within the global Lisflood model, see Zajac et al. (2017).
In this study, two set-ups of Lisflood are used (Lisflood_uc and Lisflood_c). Lisflood_c represents the calibrated set-up of the Lisflood routing and groundwater parameters (see Hirpa et al., 2018), while Lisflood_uc represents the uncalibrated model run. Parameters were calibrated with the reforecasts initialized with the ERA-Interim land reanalysis from 1995 to 2015 as forcing against observed discharge data at 1278 gauging stations worldwide. All but one station (40; see Fig. 1a and Table S1) used in this study were included within the calibration. An evolutionary optimization algorithm was used to perform the calibration, with the KGE used as the objective function. The calibration was carried out for parameters controlling the time constants in the upper and lower zones, percolation rate, groundwater loss, channel Manning's coefficient, the lake outflow width, the balance between normal and flood storage of a reservoir, and the multiplier used to adjust the magnitude of the normal outflow from a reservoir. The results were validated by Hirpa et al. (2018) using the KGE (Gupta et al., 2009) over the period 1995–2015. In calibration (validation) KGE skill scores were greater than 0.08 compared to the default Lisflood simulation for 67 % (60 %) of stations globally. For a detailed description of the calibration of the Lisflood parameters and the range of values used for each parameter, see Hirpa et al. (2018). Further details of the Lisflood model are described in van Der Knijff et al. (2010).
CaMa-Flood is a global distributed river routing model which is forced by
runoff input from a LSM or hydrological model to simulate water storage
where further hydrological variables (i.e. river flow, water level, and
inundated area) can be derived along a prescribed river network. Horizontal
water transport along the river network is calculated using the local
inertia equations (Yamazaki et al., 2011). The backwater effect (i.e. upstream water levels which affect flow velocity downstream; see Meade et al., 1991) is represented by estimating flow velocity based on water slope
(Yamazaki et al., 2011). Moreover, floodplain inundation is represented
within CaMa-Flood as a subgrid-scale process by discretizing the river basin
into unit catchments which consist of subgrid river and floodplain
topography parameters (Yamazaki et al., 2014b). These parameters describe
the relationship between the total water storage in each grid point and
water stage and are automatically generated using the Flexible Location of
Waterways (FLOW) method with the generation of the river map created by
upscaling the HydroSHEDS flow direction map (Lehner et al., 2008). For
further information about the CaMa-Flood model, see the aforementioned
references. In this study, daily river discharge was obtained using
CaMa-Flood version 3.6.1 at a spatial resolution of 0.25
The non-parametric Spearman
The KGE (Gupta et al., 2009) measures the goodness-of-fit between estimates
of simulated discharge and gauged observations and is a modified version of
the dimensionless Nash–Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970).
The metric decomposes the NSE into three independent hydrograph components
– linear correlation ( “Good” (KGE “Intermediate” (0.75 “Poor” (0.5 “Very poor” (KGE
Although originally for the modified version of the KGE, these categories
provide an informative benchmark by which to evaluate results. A similar
study (Thiemig et al., 2013) assessing the performance of satellite-based
precipitation products for hydrological evaluation also adopted the same
approach.
When analysing the results, each component of the KGE is also considered
independently, enabling model errors to be directly related to either the
variability (KGE_
To evaluate the relative improvement of using one model set-up relative to
another (e.g. using the calibrated Lisflood routing model as opposed to the
uncalibrated model version), metrics are calculated as skill scores:
To allow for easier interpretation, the results and discussion are separated into six sections which match the research questions presented in Sect. 1.5, in addition to an outline of potential future work. Due to similar results between the two validation periods (1997–2015 and 2004–2015), only results for 1997–2015 are shown. For 2004–2015 results, see Figs. S1 and S2 in the Supplement. Results and discussions for individual stations are commonly referred to by the station numbers in italics and are presented in Fig. 1a and Table S1.
Full Kling–Gupta efficiency (KGE) scores at the 75 hydrological gauging stations for all simulations. For the periods 1997–2015 and 2004–2015 for the Coupled Routing and Excess Storage, Ensemble Framework for Flash Flood Forecasting (CREST EF5) run (g). Values greater than 0.75 are considered to indicate good performance (i.e. dark blue circles). To allow for easier model comparisons, plots are arranged by the different precipitation datasets (rows) and routing models (columns), with the exception of CREST EF5 (g). For example, the final column consists of model runs using the calibrated Lisflood routing model.
The annual hydrological regime on average is well represented by all models (Fig. 2), with the rationale for poorer performance at specific gauges dependent on either the temporal correlation, bias ratio, or variability ratio components of the KGE (Figs. 3–5). An average of 50 % of stations note scores above 0.5 for the KGE metric across all eight simulated runs, with a maximum value of 0.92 observed at the Santa Rosa gauging site (48, Fig. 1a) for the ERA-5 Lisflood_c simulation (Fig. 2f). The two CaMa-Flood set-ups using the PCR-GLOBWB hydrological model and the H-TESSEL LSM show the lowest skill, with 19 and 18 stations noting scores greater than 0.5 respectively. By contrast, the best performance is from the calibrated Lisflood set-ups, with median scores across stations of 0.56, 0.63, and 0.64 for runs forced with ERA-Interim Land, the reforecasts, and ERA-5 respectively. Such results are unsurprising given that the KGE was used as the objective function in the calibration algorithm of the Lisflood routing model.
In terms of spatial distribution, the poorest performance is consistent for the majority of simulations at the Arapari (55), Boca Do Inferno (56), and Base Alalau (61) gauging stations located north of Manaus, at the Fazenda Cajupiranga gauge (64) in the northernmost Branco catchment, and at the Fontanilhas (35) and Indeco (49) stations in the south-eastern Brazilian Amazon (Fig. 2). In the south-eastern Amazon, particularly in the Madeira and Tapajos sub-basins, the number of existing or under-construction dams is at its highest (Fig. 1b). Damming of rivers is known to have impacts on different aspects of the flow regime, with possible alterations in the timing, magnitude, and frequency of low and high flows (Magilligan and Nislow, 2005). Indeed, the frequency and duration of low- and high-flow pulses at stations downstream of dams have been shown to be particularly affected by the construction of cumulative dams (Timpe and Kaplan, 2017). Thus, discrepancies between observed and modelled data shown in Fig. 2 could be due to alterations to key features of the flow regime.
The highest scoring stations (KGE score
Correlation component (Pearson's) of the KGE at the 75 hydrological gauging stations for all simulations. For the periods 1997–2015 and 2004–2015 for the Coupled Routing and Excess Storage, Ensemble Framework for Flash Flood Forecasting (CREST EF5) run (g). Values greater than 0.6 are considered skilful (i.e. blue circles).
Figures 3–5 show the breakdown of the KGE scores for each
hydrological component to evaluate differences in performance with respect
to the correlation (i.e. timing), flow variability (
Alpha (i.e. variability ratio) component of the KGE at the 75 hydrological gauging stations for all the simulations. For the periods 1997–2015 and 2004–2015 for the Coupled Routing and Excess Storage, Ensemble Framework for Flash Flood Forecasting (CREST EF5) run (g). Blue circles indicate that the variability in the simulated time series is higher than that of the observed one, while red circles show the opposite effect. Values closer to one indicate better model performance (i.e. grey circles).
Beta (i.e. bias ratio) component of the KGE at the 75 hydrological gauging stations for all the simulations. For the periods 1997–2015 and 2004–2015 for the Coupled Routing and Excess Storage, Ensemble Framework for Flash Flood Forecasting (CREST EF5) run (g). Blue circles indicate that the bias in the simulated time series is higher than that of the observed one, while red circles show the opposite effect. Values closer to one indicate better model performance (i.e. grey circles).
The variance of modelled river flow is on average higher than the observed time series in all of the simulations, with the exception of the ERA-Interim Land PCR-GLOBWB CaMa-Flood simulation. For this run, 85 % of stations observe values of less than one, with stations situated in the Peruvian Amazon (2–5) the notable exception (Fig. 4b). In contrast, 79 % of stations for the CaMa-Flood set-up using the H-TESSEL LSM note values greater than one (Fig. 4a). All runs tend to underestimate river flows relative to the observed time series, with the majority of stations observing a beta value of less than one (Fig. 5). In the calibrated Lisflood simulation forced with the reforecasts, almost half of all the stations observe scores between 0.9 and 1.1 (i.e. grey circles), with a median of 0.99 (Table 2). These results are not replicated in the other two calibrated runs when using either ERA-Interim Land or ERA-5 as the precipitation input (Fig. 5d and f). For both of these runs a decrease is found in the number of stations achieving scores between 0.9 and 1.1 relative to the associated uncalibrated Lisflood set-ups (Fig. 5c and e). This is also highlighted by a decrease in the median scores of the two respective runs (Table 2), meaning that a greater water deficit exists in the calibrated set-ups.
Median scores for the 75 hydrological gauging stations for all metrics.
Stations in the south-eastern Amazon, particularly in the upper reaches of the Teles Pires River (37, 38, and 49), tend to underestimate river flow for most simulations (Fig. 5). In this region of the basin precipitation is controlled by frontal systems in the South Atlantic Convergence Zone (SACZ), which is prevalent during austral summer (Ronchail et al., 2002; Espinoza et al., 2009). In addition, rainfall variability in the Amazon is strongest in the south-east, with a distinct dry season (Paiva et al., 2012; Espinoza et al., 2009). Further analysis could be useful in evaluating seasonal patterns of model performance to establish whether climatological features such as the SACZ are accurately represented within the precipitation datasets. Other factors impacting performance in the south-east could be associated with the geology and topography (Fig. 1c and d). Stations in this area of the basin are located within the Brazilian Shields, composed predominately of Precambrian rock, and are characterized by gentle slopes and low erosion rates (Filizola and Guyot, 2009). Paiva et al. (2012) demonstrated the importance of accurate initial conditions of groundwater state variables in the Tapajos and Xingu river basins, particularly for low flows. In comparison, the majority of the central parts of the basin are characterized by tertiary rocks, flat terrain, large floodplains, and high sediment yields. In these regions (e.g. in the south-western Brazilian Amazon), KGE scores are generally higher (Fig. 2), with surface water variables (e.g. water levels, surface runoff, and floodplain storage) considered more important in hydrological prediction uncertainties (Paiva et al., 2012).
The KGE allows us to make explicit interpretations of the hydrological
performance of each model owing to decomposition into correlation, bias, and
variability terms (Kling et al., 2012). The results indicate that the
required developments to improve the representation of daily river flows are
specific to each individual model and to the area of interest. For instance,
for the ERA-Interim Land PCR-GLOBWB run, daily correlation scores (Fig. 3b)
showed the model suffers in reproducing the temporal dynamics of flow (as
measured by
Spearman's ranked correlation coefficients for observed against simulated annual maximum discharge values at the 75 hydrological gauging stations for all simulations. For the periods 1997–2015 and 2004–2015 for the Coupled Routing and Excess Storage, Ensemble Framework for Flash Flood Forecasting (CREST EF5) run (g). Values exceeding 0.6 are considered skilful (i.e. blue shapes). The number of overlapping years of data between observations and simulations are denoted by different shapes. A triangle represents 5–9 years, a square 10–14 years, and a circle 15–19 years of overlapping data.
Both the calibrated and uncalibrated versions of Lisflood simulations forced with the ERA-5 reanalysis are the best-performing runs, with median scores of 0.53 and 0.54 for the uncalibrated and calibrated simulations respectively (Fig. 7 and Table 2). However, a large deterioration in skill is evident for all simulations for Spearman's ranked coefficients between observed and predicted annual maximum river flows (Fig. 6), with only 21 % of stations on average observing scores exceeding 0.6 across all simulations. Here, it is important to note that due to the length of some station time series the number of overlapping data points can be small, and therefore the spatial distribution of model performance should be interpreted with caution. To provide a certain level of confidence between results, stations whose time series equals or exceeds 15 years are denoted using a circle, whereas those between 10–14 and 5–9 are represented using a square and triangle respectively.
Boxplots showing the distribution of scores for the
The highest scores are generally located towards the eastern side of the basin and along the main Amazon River where the terrain is predominately flat, and rivers drain extensive floodplains. These are constrained to runs using the Lisflood routing model with either ERA-Interim Land or ERA-5 as forcing (Fig. 6c–f). Interestingly, the calibrated Lisflood set-up forced using the reforecasts does not replicate good performance in these regions (Fig. 6h), indicating that the error between simulated and observed peak river flows could be associated with the precipitation input. When observing daily mean precipitation totals over the validation period (1997–2015), the reforecasts observe lower precipitation totals over central to northern areas of the basin relative to both of the climate reanalysis datasets (Fig. 8). However, when comparing the results of the ERA-Interim Land H-TESSEL CaMa-Flood and ERA-Interim Land H-TESSEL Lisflood_uc set-ups, correlations are much lower in the CaMa-Flood simulation, suggesting that both precipitation and routing processes are equally important (Fig. 6a and c).
Mean daily precipitation totals throughout the Amazon basin. For
Low agreement between peaks is consistent in the south-east and north-west of the basin across all simulations (Fig. 6). In the south-east, a lack of skill could again be associated with the abundance of hydroelectric dams in the region or with the poor representation of the SACZ rainfall regime. Evaluating the ability to represent the timing and magnitude of the annual flood wave has important implications for models predicting flood hazard and for practices providing early warning information. These results identify that while the representation of daily river flows improves upon model calibration of the Lisflood routing model (Sect. 3.1), the influence of routing calibration for simulating flood peaks has no impact.
We assessed the performance of the CaMa-Flood and Lisflood_uc routing models by comparing the two runs which are forced using the ERA-Interim Land reanalysis dataset. On average the uncalibrated Lisflood run outperforms CaMa-Flood for all metrics analysed (Fig. 7 and Table 2). Results from the CREST EF5 model are also discussed but are not directly comparable due to using differing meteorological inputs.
The median score of the correlation component of the KGE (i.e. Pearson's correlation coefficient) is found to increase by 0.19 when using the un-calibrated Lisflood model relative to CaMa-Flood, with 28 more stations achieving a correlation score of 0.6 or higher (Fig. 3a and c). This number increases when considering correlation scores greater than 0.8, with 38 and 7 stations reaching this value for Lisflood and CaMa-Flood respectively. The most notable increase in skill is found in Peru along the Marañón and Napo rivers (2 and 5), which note increases of 0.85 and 0.71 respectively when using the Lisflood model. In comparison, the CREST EF5 simulation fits between the CaMa-Flood and Lisflood runs with a median daily correlation score of 0.71 and notes 12 stations which have scores greater than 0.8 (Fig. 3g).
For the overall KGE metric, 24 % and 3 % of stations have values exceeding 0.5 and 0.75 for CaMa-Flood. These figures rise to 52 % and 11 % respectively in the uncalibrated Lisflood run. Large differences are particularly notable at stations situated in the upper reaches of the Solimões River (2–6) and within a cluster of stations situated towards the Colombian Amazon in the north-west (Fig. 2c). Significant differences are identified for peak flow correlations, with only three stations (27, 17, and 22) achieving scores exceeding 0.6 for the CaMa-Flood simulation compared to 22 using the uncalibrated Lisflood routing scheme (Fig. 6a and c). In comparison, the CREST EF5 simulation has 11 stations exceeding this threshold, with no distinguishable spatial pattern (Fig. 6g). For this run, the time series of modelled data is shorter (2004–2015), and so peak flow correlations should be interpreted with caution.
Stations located in and around the main Amazon River observe better performance for representing flood peaks in the Lisflood simulation (Fig. 6c), aligning with the locations of lakes included within the Lisflood set-up (see Zajac et al., 2017). This level of skill was not replicated in the CaMa-Flood simulation, where the representation of lakes is not included (Fig. 6a), suggesting the potential importance of lake parameterization for accurate peak flow estimations. However, Zajac et al. (2017) demonstrated that although the inclusion of lakes in Lisflood was found to generally improve the representation of extreme discharge for the 5- and 20-year return periods on the global domain, the change in skill upon the inclusion of lakes and reservoirs in the Amazon was minimal for several metrics. Very few reservoirs are included within Lisflood in the Amazon, and therefore the estimated effects on simulated streamflow are restricted.
Zhao et al. (2017) concluded the importance in the choice of different river routing schemes for simulating peak discharge across the globe, while the Hoch et al. (2017b) comparison of two routing models found results to differ despite having identical boundary conditions. It is therefore of interest to evaluate not only the entire GHM set-up, but also to assess the suitability of each model component of the hydrological chain in order to determine which routing model is most suitable for certain applications within the Amazon basin. Results suggest that adjustments of certain parameters such as Manning's channel coefficient could potentially improve the performance of the CaMa-Flood model, with the default coefficient higher in the uncalibrated Lisflood set-up (0.10 as opposed to 0.03; see Hirpa et al., 2018, for all default parameter values).
Three precipitation products (ERA-Interim Land, ERA-5, and the reforecasts) are used to force the calibrated Lisflood routing model, with the most recent ERA-5 reanalysis product the best-performing dataset. Figure 8 displays mean daily precipitation totals for each dataset over the main validation period (1997–2015). The main differences can be seen in the far west of the basin towards the Andes mountains, where precipitation is higher in ERA-5 compared to ERA-Interim Land, and in the north-west, where average daily precipitation totals are smaller in the reforecasts. On the other hand, values in the south-eastern corner of the basin are very similar between the three datasets. When comparing observed and simulated annual peak flows, median correlation scores improve by 0.12 and 0.22 when using ERA-5 compared to when using ERA-Interim Land and the reforecasts respectively (Table 2); 28 stations reach the 0.6 threshold relative to 22 and 9 stations for ERA-Interim Land and the reforecasts respectively, with the range of coefficients smaller for ERA-5 (Fig. 7a).
Figure 9e and f highlight the relative gain or loss in skill when using ERA-5 compared to ERA-Interim Land. The greatest improvements for each metric are observed within the upstream reaches of the Solimões River, particularly for stations located within the Peruvian Amazon (2, 4, and 5). In the main western headwater to the Solimões River (the Marañón River) at the San Regis gauging site (2) and at Tamshiyacu (4) near to the city of Iquitos, annual maximum correlation skill scores are 0.51 and 0.59 respectively. These results highlight that poor performance found in upstream reaches of the Solimões River (Fig. 6c and d) is likely due to the representation of rainfall rather than routing performance.
Relative improvement in skill at each gauging station for Spearman
annual maximum correlations and KGE values (i.e. skill scores).
In the other main tributary to the Solimões River, the Ucayali River, simulated annual peak flows show little agreement with observed data, with a decrease in skill identified when using ERA-5 as opposed to ERA-Interim Land (Fig. 9e). Despite the lack of agreement between observed and modelled data in the Ucayali River, the higher correlation scores identified downstream at Tamshiyacu suggest that better representation of high-water periods at the start of the Solimões River is likely modulated by the larger Marañón River. Therefore, the ability to represent flood hazard in communities near to the city of Iquitos is more dependent on how well we can predict river flow in the Marañón River.
All three runs perform well for the KGE metric, with little difference in results spatially (Fig. 2d, f, h). The reforecast simulation used within the Lisflood calibration is found to be superior, with 75 % of stations achieving scores which exceed 0.5 relative to 71 % and 59 % for ERA-5 and ERA-Interim Land respectively. Increased skill in the Peruvian Amazon is again the most noteworthy (Fig. 9f), with KGE skill scores of 0.67 for the Requena (3) (Ucayali River) and San Regis (2) (Marañón River) stations and 0.71 for Tamshiyacu (4) (Solimões River) when using ERA-5 relative to ERA-Interim Land. This increase in KGE skill can be attributed to an improvement in the variability and bias ratios found between the simulated and observed time series. Daily correlation scores for the three stations (2–4) are near identical to the variance and bias ratios underestimated for ERA-Interim Land while being much closer to the observed data for ERA-5 (Figs. 4d, f and 5d, f).
The Tamshiyacu gauging station (4) is used to measure flood hazard in the
city of Iquitos at the start of the Solimões River (Espinoza et al.,
2013) and is therefore of particular interest. At this important location,
scatterplots of observed against simulated river discharge (Fig. 10) show
that the negative bias observed when using ERA-Interim Land is corrected for
when using ERA-5, with the magnitude of the 90th percentile of river
flows almost identical to that of the observed dataset. Improvement is
likely associated with the increased resolution of the ERA-5 reanalysis,
which observes higher daily mean precipitation totals in regions towards the
Andes in the far north-west of the basin (Fig. 8b). Waters found at
Tamshiyacu are of Andean origin, meaning that the representation of rainfall
in the Andes Mountains is fundamental to accurately predicting streamflow.
ERA-5 runs at a horizontal resolution of
Scatterplots of observed against simulated river flow at the
Tamshiyacu gauging site, Peru (4). For
The success of GHMs in producing adequate estimates of river flow is underpinned by uncertainties within the meteorological input (Butts et al., 2004; Beven, 2012; Sood and Smakhtin, 2015). These results have particular importance for flood forecasting applications and research concerning extreme floods, with the higher-resolution ERA-5 dataset providing closer agreement between observed and simulated annual maximum river flows, particularly for the Peruvian Amazon. With the time series of observed data often beginning in the 1980s in the Amazon, ERA-5 could provide a useful tool for analysing historical flows and establishing links to climate variability. Upon completion, ERA-5 will date back to 1950 (Zsoter et al., 2019), meaning locations in which model skill is considered high could benefit from up to 30 years' worth of additional data for use in climate studies, thus allowing for more robust analysis. In future work, it could be of interest to compare the performance of ERA-5 against a wider range of precipitation datasets, such as the Multi-Source Weighted-Ensemble Precipitation (MSWEP) product that carefully integrates gauge, satellite, and reanalysis-based estimates. The Beck et al. (2017b) evaluation of 22 precipitation datasets previously demonstrated the advantages of using merged products for hydrological modelling purposes.
The H-TESSEL LSM and the PCR-GLOBWB hydrological model are directly compared whereby the precipitation forcing (ERA-Interim Land) and river routing scheme (CaMa-Flood) are consistent. Overall, it appears that the choice between using a LSM or a hydrological model in the Amazon basin is dependent not only on the specific region of interest, but also on the application and needs of the user. Previous studies (Zhang et al., 2016; Beck et al., 2017a) have found that LSMs, on average, perform better in rainfall-dominant regions, whereas hydrological models tend to achieve better results in snow-dominated regions owing to the use of complex energy balance equations introducing additional uncertainties. For the Amazon basin, Spearman's rank correlation coefficients between simulated and observed peak river flow are closely matched, with medians of 0.24 and 0.23 for H-TESSEL and PCR-GLOBWB respectively (Table 2). However, the number of stations with Spearman's maximum correlation scores exceeding 0.6 is slightly higher in PCR-GLOBWB at seven compared to three with H-TESSEL (Fig. 6a and b).
To illustrate the gain or loss in skill when using H-TESSEL relative to
PCR-GLOBWB, Spearman's annual maximum correlation and KGE skill scores
were calculated for each station (Fig. 9g and h). Overall, 68 % of
the stations show improved skill for peak river flow correlations when using the
LSM, though the gain in skill is minimal (median correlation skill
score
A large loss in performance for the KGE is observed when using H-TESSEL for stations in the Peruvian Amazon at the confluence point to the Solimões River (Fig. 9h). Model performance in this region can largely be attributed to the failure of the H-TESSEL CaMa-Flood run to accurately represent the variance of flow and the temporal correlation component of the KGE, with the variability of modelled flow far higher than in the observed data (Fig. 4a). Northern regions in the Branco basin and stations situated towards the Colombian Amazon show the opposite effect with higher KGE coefficients found for the H-TESSEL CaMa-Flood run (Fig. 2a), indicating that model suitability is regionally specific.
Calibration of hydrological models is known to be a useful tool in providing more accurate estimates of river flow (Beck et al., 2017a). However, due to a lack of data and the computational expense required in the calibration of GHMs, many remain uncalibrated (Bierkens, 2015; Sood and Smakhtin, 2015). Both Gupta et al. (2009) and Mizukami et al. (2019) demonstrate that square error-type metrics are unsuitable for model calibration when the model in question requires robust performance for high river flows. Improvement of flow variability estimates was documented in both studies when switching the calibration metric from the NSE to the KGE for both a simple rainfall–runoff model (similar to the HBV model; Bergström, 1995) and for two more complex hydrological models (Variable Infiltration Capacity and mesoscale Hydrologic Model), suggesting similar results are likely to be achieved for other hydrological models. To investigate the potential benefits of routing model calibration, whereby the KGE was used as the objective function, the time series of river discharge for the calibrated Lisflood runs forced using the ERA-Interim Land and ERA-5 reanalysis datasets were compared against the associated default set-ups without routing calibration.
Overall, hydrological performance improves upon model parameter calibration,
with positive KGE skill scores (i.e. an increase in skill) at 61 %
(59 %) of gauging stations for simulations forced with ERA-Interim Land (ERA-5) (Fig. 9c and d). The influence of calibration is stronger for the simulation forced with ERA-5, with the number of stations achieving
“intermediate” KGE scores (i.e. 0.75
Three stations (2–4) in the Peruvian Amazon show increased KGE skill scores when using the calibrated ERA-5 run relative to the similar uncalibrated set-up (Fig. 9d). Conversely, a loss in skill is observed at each station for the calibrated run forced using ERA-Interim Land (Fig. 9c). These results are likely associated with a larger negative runoff bias within the ERA-Interim Land Lisflood_uc run relative to the ERA-5 Lisflood_uc simulation for the three stations (Fig. 5c and e). This is supported by Hirpa et al. (2018), who concluded that stations which have a negative streamflow bias in the default run (i.e. Lisflood_uc) also have a negative KGE skill score in the calibrated simulation owing to the challenge of correcting for a water deficit within the routing component. Thus, for GHMs which tend to underestimate runoff, adjustments of parameters within the LSM or hydrological model (e.g. those responsible for the portioning of precipitation into runoff) or through bias-correction measures within the precipitation dataset may be advantageous in efforts to accurately represent floods.
No significant differences between calibrated and uncalibrated Lisflood annual maximum correlation scores are identified (Fig. 7a and Table 2). In total, the number of stations exceeding the 0.6 threshold for peak flow correlations remains the same for runs involving ERA-5 and decreases by one for ERA-Interim Land, meaning that the routing model calibration has very little impact on the ability to capture annual peaks. This suggests that calibrated parameters controlling flow timing (e.g. Manning's channel coefficient) are not as important for simulating the magnitude of higher flows in the Amazon basin and that bias correction of the precipitation or calibration of parameters associated with runoff and evapotranspiration might be more useful. As previously highlighted by Hirpa et al. (2018), the inclusion of an objective function that is explicitly based on flood peaks could improve the ability of Lisflood to simulate floods. This is supported by previous studies (Greuell et al., 2015; Beck et al., 2017a; Mizukami et al., 2019) which have also identified that improved performance in calibrated models is predominately specific to metrics which are incorporated into the objective function used within the calibration. For instance, in Mizukami et al. (2019), they find that when using an application-specific metric (annual peak flow bias; APFB) for the calibration of two hydrological models, it produced the best peak flow annual estimates compared to using the NSE, KGE, and its components. However, despite this improvement, flood magnitudes were still underestimated for all metrics used in calibration, and the use of the APFB as the calibration metric resulted in poorer performance across the individual KGE components upon evaluation.
While estimating the magnitude of peak river flows is fundamental, more evaluation is required in assessing the ability to represent the timing of flood peaks. Modelled flood peaks have been known to occur too early in large Amazonian rivers (Alfieri et al., 2013; Hoch et al., 2017b), with accurate flow timing of significant importance in the Amazon basin. For example, the time displacements between peak flows in coinciding tributaries are known to play a major role in the dampening of the Amazon flood wave (Tomasella et al., 2010) and in the synchronization of flood peaks, commonly associated with exceptional flood events (e.g. Marengo et al., 2012; Espinoza et al., 2013; Ovando et al., 2016). Additional evaluation using metrics which focus specifically on the timing aspect, such as the delay index (Paiva et al., 2013), would enable a more complete assessment of the hydrological modelling regime.
A limitation of this type of study is due to the intercomparison being restricted to the macroscale (i.e. only a subset of potential modelling configurations is considered). In future work it would be useful to increase the granularity of the modelling decision matrix to allow conclusions to be more generalized across the modelling community. For instance, when comparing the performance of the Lisflood and CaMa-Flood routing models, the results are specific to the simulations forced using the ERA-Interim Land reanalysis dataset. Although useful in providing a general indication of routing performance for each model when using a climate reanalysis dataset, the conclusions are specific to that particular comparison, with differing results possible when using another precipitation input. Future work could investigate one of the research questions stated in the objectives (Sect. 1.5) at a finer resolution, for example by comparing several different runs which use the Lisflood and CaMa-Flood routing models, whereby a greater variety of precipitation inputs are considered (e.g. MSWEP, CHIRP V2.0, ERA-5, TRMM v.7). Such analysis would allow more general conclusions and recommendations to be made to the modelling community, who are interested in those particular routing schemes. A similar approach could be adopted for the assessment of other components of the hydrological modelling chain.
In this paper, eight different GHMs were employed in an intercomparison analysis using two verification metrics to assess model performance against gauged river discharge observations. The motivation for this work stemmed from the need to evaluate the ability of GHMs to reproduce historical floods in the Amazon basin for use in climate analysis and to identify the strengths and weaknesses which exist along the hydrological modelling chain in order to provide insight to model developers. The implications of these results suggest that the choice of precipitation dataset is the most influential component of the GHM set-up in terms of our ability to recreate annual maximum river flows in the Amazon basin. This is evident with average station correlations between observed and simulated annual maximum river flows increasing when using the new ERA-5 reanalysis dataset, with significant improvements in locations of the Peruvian Amazon. In this region, waters are sourced from Andean origins where rainfall can often be poorly represented due to topographically complex terrains (Paiva et al., 2013). Thus, those wishing to simulate higher flows in the upper reaches of the Amazon may benefit from choosing a precipitation dataset which has a high spatial resolution, whereby the upper atmosphere is discretized at finer scales. Although an exact recommended spatial resolution cannot be provided based on the results of this study alone, previous works (e.g. Beck et al., 2017b) support the need for a comparatively high-resolution dataset in addition to other advantageous factors such as a long temporal record and the inclusion of daily gauge corrections.
Although parameter calibration of the Lisflood routing model improved the representation of the whole hydrological regime across the basin, the agreement between observed and simulated peak discharge values saw no change upon calibration. This indicates that the benefit of calibration is confined to the objective function used, in this case the KGE, and highlights that further model calibration using an objective function that fits the purpose of the application (e.g. RMSE of flood peaks or APFB for flood forecasting systems) could be worth considering. It is important to reiterate however that thoughtful consideration is required if choosing application-specific metrics, with the potential to degrade performance in other aspects of the hydrological regime (e.g. bias and flow variability ratios) a concern (Mizukami et al., 2019). The relative importance of good performance in the specific target metric compared to better performance for a range of metrics should be assessed on a model-by-model and circumstantial basis, taking into account the needs of potential users.
All of the data and models used in this study were obtained from collaborators of the Global Flood Partnership (GFP) and are freely available. Access to these sources is mentioned in Sect. 2.
The supplement related to this article is available online at:
EZ provided data and information for all simulations incorporating Lisflood and for the ERA-Interim Land H-TESSEL CaMa-Flood set-up. ZF and JM provided data and information for the TRMM CREST EF5 and ERA-Interim Land PCR-GLOBWB CaMa-Flood runs respectively. ES, HC, JB, and EC supervised the research and provided important advice. ES, HC, and JT designed the analysis and JT undertook the research in addition to writing the paper. All the authors were involved in discussions throughout the development and commented on the manuscript.
The authors declare that they have no conflict of interest.
Jamie Towner is grateful for financial support from the Natural Environment Research Council (NERC) as part of the SCENARIO Doctoral Training Partnership (grant agreement NE/L002566/1). The first author is grateful for travel support and funding provided by the Red Cross Red Crescent Climate Centre, to the research and national services, SO-HYBAM, IRD, SENAMHI, ANA, and INAMHI, for providing observed river discharge data, and to the ECMWF for computer access and technical support. Finally, specific thanks go to Christel Prudhomme and the Environmental Forecasts team in the Evaluation Section at the ECMWF for their advice and support throughout the analysis and writing of the manuscript.
This research has been supported by the SCENARIO NERC (grant no. NE/L002566/1).
This paper was edited by Stacey Archfield and reviewed by Gemma Coxon and Andrew Newman.