A pan-African medium-range ensemble flood forecast system

A pan-African medium-range ensemble flood forecast system V. Thiemig1,2, B. Bisselink1, F. Pappenberger 3,4,5, and J. Thielen 1 1Institute for Environment and Sustainability, Joint Research Centre, Ispra, Italy 2Utrecht University, Faculty of Geosciences, Utrecht, The Netherlands 3European Centre for Medium-Ranged Weather Forecast (ECMWF), Reading, UK 4School of Geographical Sciences, University of Bristol, Bristol, UK 5College of Hydrology and Water Resources, Hohai University, Hohai, China Correspondence to: V. Thiemig (vera.thiemig@jrc.ec.europa.eu)


Introduction
Riverine floods rank as the second highest death-causing natural disaster in Africa, surpassed only by droughts (Vos et al., 2009). The number of flood-related casualties, affected people, and associated economic losses have significantly increased in Africa since the middle of the 1990s (Guha-Sapir et al., 2012), due to an increase of human settlements in flood-prone areas rather than possible climate change issues (Di Baldassarre et al., 2010). Additionally, the fact that most medium-to large-size African river basins are transnational is another important influencing factor -Bakker (2009) reported that floods occurring in transnational river basins result in larger losses than if they were occurring in national basins. As a result, flood risk management in Africa has recently gained increased attention in the political and scientific environment (Portuguese Space Office, 2007). Both the Hyogo Framework (United Nations, 2005) and Rio+20 (UNCSD Secretariat, 2012) promote the strengthening of the resilience of African nations to withstand and recover quickly from impacts caused by events of hydrometeorological origin. The substantial reduction of disaster losses, in lives as well as in social, economic and environmental assets, is of prime focus. As such, the development of effective early warning systems is fundamental.
An inventory on the "current status on flood forecasting and early warning in Africa" based on reviewing literature, institutional websites and a questionnaire (Thiemig et al., 2011) has revealed a large number of institutional initiatives presently active in flood risk management. An increasing number focus on the development of hydrological fore-3366 V. Thiemig et al.: A pan-African flood forecasting system casting systems. Most of the forecasting endeavours target either short-(< 3 days) or long-range (> 2 weeks) forecasts, but hardly any of them target the medium-range (3-15 days). However, medium-range forecasts are crucial for reducing flood-related losses as they provide more time for decisionmaking and preparation compared to short-range forecasts, as well as producing more accurate estimations than seasonal forecasts (Thielen et al., 2009a). In particular, probabilistic medium-range flood forecasts based on meteorological EPS (Ensemble Prediction System), also called HEPS (Hydrological Ensemble Prediction System), are of added value as they increase the capability to issue flood warnings earlier and with more confidence than deterministic forecasts, given that they address the associated uncertainties  and see http://www.hepex.org).
Large research efforts of numerous flood working groups have resulted in an assortment of operational HEPS for various spatial scales (Table 1) Pappenberger et al., 2014). Over the past decade, these systems have demonstrated their potential to provide an essential contribution to the prevention and mitigation of flood-related losses, giving additional decision and preparation time prior to a flood event (Dale et al., 2014;He et al., 2010;Pappenberger et al., 2011;Roulin, 2007). A pan-African HEPS could bridge the gap between the partially existing short and long-ranged flood forecasting systems.
An example of a HEPS operating at the continental scale is the European Flood Awareness System (EFAS) Pappenberger et al., 2011;De Roo et al., 2011;Thielen et al., 2009b). EFAS uses multiple meteorological weather forecasts, both deterministic (DET) and probabilistic (EPS) (i.e. ECMWF-DET, ECMWF-EPS, German Weather Service-DET and COSMO-LEPS), as input to the hydrological model LISFLOOD Van Der Knijff et al., 2010). Using the same model and its parameters for long-term simulations of hydrological conditions in previous decades allows the calculation of flood warning relevant thresholds such as the 5-, 10-and 20-year return periods. By applying these thresholds to the forecasts, the ensemble streamflow calculations are converted into effective flood forecasts with up to 10 days' lead time. The transferability of the EFAS methods to other climatic regions and flood types has been extensively and successfully tested by Alfieri et al. (2012Alfieri et al. ( , 2013 and Thiemig et al. (2010). Additionally, Trambauer et al. (2013) confirmed LISFLOOD's suitability as a hydrological forecasting model at the pan-African scale, mainly due to its comprehensive representation of the most relevant hydrological processes as well as its applicability as an operational forecasting system with the available data. Therefore, to set up an African flood forecasting system we adopted the methodologies developed for EFAS, and calibrated LISFLOOD for African conditions. The resulting African Flood Forecasting System (AFFS) has the potential to be the first system providing probabilistic medium-ranged hydrological predictions for the entire continent of Africa.
The aim of this study is to investigate the capability of AFFS to predict flood events, in order to derive its potential as an operational flood forecasting system that could in future contribute to the reduction of flood-related losses by providing national and international aid organizations with timely crucial flood forecast information. The predictive capability is assessed in a hindcast mode. For every day of the flood-intense year of 2003, 50 hydrological forecasts are calculated over a lead time of 10 days. Applying hydrological thresholds on the resulting ensemble of hydrological predictions, flood signals can be derived spatially. The forecasting capacity of AFFS is assessed from two perspectives: its particular ability to detect and predict flood events and its overall performance to predict streamflow. The first is of paramount importance for the assessment of AFFS as a flood forecasting system as it focuses on the detection and prediction of flood events. This is done on an event-based analysis, comparing the AFFS flood signals against information collected from various disaster databases such as the Dartmouth Flood Observatory, the Emergency Event Database EM-DAT, the NASA Earth Observatory and Reliefweb to determine the number of hits, false alerts and missed alerts as well as the probability of detection (POD), false alarm rate (FAR) and Critical Success Index (CSI). Further, to illustrate the flood forecast performance of AFFS and also to give an example of its potential output, the hindcast for the March 2003 flood event in the Sabi Basin (Zimbabwe) is presented in detail. The second part of the analysis, that focuses on the general streamflow, is of minor importance for the assessment of AFFS as it is not focused on the prediction of flood events in particular. However, for the sake of completeness, a basic insight into the prediction of the general streamflow is given. This is done by calculating the Continuous Rank Probability Skill Score (CRPSS), a statistical indicator for probabilistic forecasts, in combination with the limit of predictability and reliability, for 36 key locations across Africa to gain an understanding of the general accuracy and the reliable time span of the streamflow forecasts. The two analyses are complementary in disclosing the strengths and shortcomings of AFFS.
The remainder of this article is structured as follows: Sect. 2 gives an outline of the material and methods used including the study area, the input data, the structure and functionality of AFFS and the hydrological model LIS-FLOOD, the setup of the pan-African hindcast as well as the verification metrics. In Sect. 3, results related to LIS-FLOOD's model performance as well as the forecast capability of AFFS are presented, while Sect. 4 contains a detailed discussion on the results and study limitations, as well as a final conclusion. ). An overview of topographical, meteorological and hydrological conditions, including the delineation of the hydrological basins, altitude and river basin size, time period and length of the wet season, mean annual precipitation, mean annual river discharge, discharge station network and the dominant land use/cover is presented in Fig. 1.

Hydrological reference data
Discharge observations are required for the optimization of LISFLOOD; whereas information about floods, in particular on when, where and with which magnitude a flood event has happened, is required for the verification of the performance of AFFS. Therefore, discharge observations and flood information retrieved from various flood archives were employed as hydrological reference data.

Flood archives
Various disaster databases such as the Dartmouth Flood Observatory (Brakenridge, 2013), the Emergency Events Database EM-DAT (Guha-Sapir et al., 2012), the NASA Earth Observatory (NASA, 2003) and Reliefweb were used to provide a list of flood events that were reported for Africa in the year 2003. Excluding flash floods, 39 medium-to large-scale flood events were identified. Information on the location and time-period of these events, together with the outline of the affected area, was compiled into a database (see Fig. 2) and used as reference for the event-based verification of the hindcasting performance of AFFS.

Discharge observations
Daily discharge records were collected from various national hydrological centres and databases such as the Ethiopian Ministry of Water and Energy, the GLOWA Volta Project, FAO Somalia Water and Land Information Management, the Global Runoff Data Centre (GRDC) and the South African Department of Water Affairs and Forestry (DWAF). The resulting ground observation network comprises 36 discharge measuring stations holding daily observations between 2003 and 2008 ( Fig. 1e). It can be seen that the distribution of stations is not homogeneous, but clustered in certain regions such as southern Africa, Zambezi and western Africa.

Meteorological data
Two meteorological data sources were used: ERA-Interim GPCP-corrected (Global Precipitation Climatology Projectcorrected) data source and ECMWF-ENS data source. Technical specifications are given in Table 2. The first data source consisted of historical meteorological data during the model calibration as well as near real-time meteorological data for the calculation of the initial conditions. The second data source, the ensemble meteorological forecasts, was used for the calculation of the hydrological forecast, i.e. hindcast.
To use the ERA-Interim GPCP-corrected data source as a proxy for near real-time meteorological data is only possible in a hindcast mode; however, during real-time forecasting, the first day of each ECMWF deterministic forecast could be used.

Other data
Information on topography, river channel geometry, land use, soil and vegetation properties were extracted from different data sources such as the Harmonized World Soil Database 1.0, the VGT4AFRICA project or the SRTM (Shuttle Radar Topography Mission) model. A list of all the required input maps is given in Burek et al. (2013) and a more detailed description of the source of the input maps for Africa is specified by Bodis (2009). In the following we refer to this collection of thematic layers as the African GIS data set.

Structure and functionality
The African Flood Forecasting System (AFFS) aims to produce accurate probabilistic, medium-ranged flood forecast information at the pan-African scale, up to 10 days in advance, that could in future support African water authorities with timely valuable information to reduce flood-related losses by increasing preparation time.
A schematic overview, illustrating the structure and functionality of AFFS, is given in Fig. 3.
For the calculation of flood forecasts, AFFS requires a hydrological model, five main data sources, as well as four main processes. The model selected for AFFS is the physically based hydrological model LISFLOOD which is described in detail in Sect. 2.3.2. The five main data sources on which AFFS relies are: historical hydrological observations (see Sect. 2.2.1), historical as well as near real-time meteorological observations (see Sect. 2.2.2), real-time me-teorological forecasts (see Sect. 2.2.2) and an African GIS data set (see Sect. 2.2.3). The four main processes AFFS runs are: the calculation of hydrological thresholds, the computation of the initial hydrological conditions, the computation of the ensemble hydrological predictions, and the identification of flood events. Each is described in detail in the following: 1. The calculation of hydrological thresholds. Hydrological thresholds facilitate the distinction between flood and no-flood situations, as well as the distinction between various flood magnitudes, when applied on the hydrological EPS predictions (step 4). The hydrological thresholds used within AFFS are the 2-, 5-, 10-and 30-year return periods. These are derived for each 0.1 • pixel based on a long-term discharge simulation, resulting from forcing LISFLOOD with the African GIS data set and daily historical meteorological data (here over 21 years; 1989-2010).

The computation of the initial hydrological conditions.
Information about the current hydrological conditions, meaning all state variables of the water cycle, is required for each day during the forecasting period to initialize LISFLOOD prior to calculating hydrological predictions (   4. The identification of flood events. The flood forecast itself results from comparing the ensemble of hydrological predictions (step 3) against the hydrological thresholds (step 1). A flood signal is identified if all of the following conditions are satisfied: first, that at least 30 or 15 out of the 50 hydrological predictions exceed the threshold of 2-or 10-year return period respectively for at least 3 consecutive days; second, that the upstream area is larger than 15 000 km 2 and third, that more than 40 clustered river pixels are affected.
The results are visualized in so-called "threshold exceedance maps", as well as ensemble quantile plots at key locations.

Hydrological modelling framework LISFLOOD
LISFLOOD is a fully distributed, physically based hydrological model Van Der Knijff et al., 2010) that simulates the spatial and temporal pattern of catchment responses in medium-to large-scale river basins as a function of spatial information about meteorology, topography, soil and land cover. Originally, LISFLOOD was developed specifically to simulate hydrological processes in large river basins, and later optimized for flood forecasting on the European scale within the framework of the European Flood Awareness System (www.efas.eu) Pappenberger et al., 2011;Ramos et al., 2007;Thielen et al., 2009b). Since then the range of application has been extended successfully to studies dealing with climate change impact assessment Feyen, 2008, 2009;Feyen et al., 2009;Rojas et al., 2012), flash flood forecasting (Alfieri et al., 2012) and water resources (Mubareka et al., 2013;Sepulcre-Canto et al., 2012). For a full description on the model structure and equations, the reader is referred to Burek et al. (2013). For AFFS, LISFLOOD was set up on the pan-African scale with a spatial resolution of 0.1 • . The model structure was extended to also account for large reservoirs as well as for transmission loss along the river channel, which is very significant in large river systems in semi-arid areas (Haddeland et al., 2011). All GIS-based model parameters were either extracted or derived from the African GIS data set (see Sect. 2.2.3).
In the current setup, layers of water use information from the Global Crop Water Model (GCWM) Döll, 2008, 2010) are dynamically coupled with LISFLOOD. It is assumed that water is subtracted solely from the river discharge, not from internal storage.
The local drain direction network (LDD) of the African river basins is developed using a sequence of upscaling operations performed on the flow network, derived from a highresolution Shuttle Radar Topography Mission (SRTM)-based elevation model of Africa. By upscaling from a fine to a coarser scale, the accuracy of the hydrography data can be lost and manual corrections should be applied. In the current pan-African setup we applied the new algorithm for automatic upscaling of river networks successfully developed by Wu et al. (2011) that address many of these upscaling issues.
Meteorological variables were obtained from the ERA-Interim and ECMWF-EPS fields (Simmons et al., 2007) from the European Centre for Medium-range Weather Forecasts (ECMWF). Parameters related to groundwater response, infiltration, groundwater losses, channel routing and reservoir operating rules were determined through model calibration.

Calibration
For the pan-African set-up, LISFLOOD was calibrated at the 36 locations shown in Fig. 1e (black dots) using daily discharge records, over a time period of five years (2004-2008; 2003 used as warm-up). Those 36 sub-catchments correspond to 11 hydrological basins. To drive LISFLOOD in the calibration procedure, the ERA-Interim precipitation which was corrected using the Global Precipitation Climatology Project (GPCP) data set from the ECMWF was used. This is done because Balsamo et al. (2010) and Di Giuseppe et al. (2013) reported on systematic biases in the ERA-Interim precipitation data. Details of the rescaling method can be found in Balsamo et al. (2010). The calibration was done using a state-of-the-art particle swarm optimization (PSO) algorithm particularly designed for hydrological applications, called hydroPSO (Zambrano-Bigiarini and , which has recently been applied successfully for the optimization of LISFLOOD over various African river basins (Thiemig et al., 2013). The selection of model parameters to be calibrated is listed in Table 3, including their respective physically reasonable ranges.

Test: Pan-African hindcast
The potential of AFFS as a future pan-African flood forecasting system for medium-to large-scale river basins and the medium-range (with up to 10 days' lead time) is tested in a retrospective analysis in which hydrological predictions are calculated over a certain time period in the past for which the true hydrological situation is already known, i.e. so-called hindcasts. Comparing the results of the hindcasts against available information on the true hydrological situation pro- vides the opportunity to assess the predictive capabilities of AFFS. A pan-African hindcast was therefore computed for the whole year of 2003.
The hindcast was computed with AFFS using the calibrated LISFLOOD setting (Sect. 2.3.2) and following the workflow as described in Sect. 2.3.1.
The hydrological thresholds (2 and 10-year return periods) were derived for each 0.1 • pixel from a long-term discharge simulation resulting from forcing LISFLOOD with daily GPCP-corrected ERA-Interim data over a time period of 21 years . The initial hydrological conditions, i.e. all state variables, were computed for each forecasting date between 1 January and 31 December 2003 by running LISFLOOD with the daily GPCP-corrected ERA-Interim. The ensemble of hydrological predictions was computed by forcing LISFLOOD for each forecasting date with the previously determined daily initial conditions and the respective real-time meteorological forecast. Here, we employed the 10-day probabilistic ECMWF-ENS (Buizza et al., 2007(Buizza et al., , 2008Leutbecher and Palmer, 2008) as the real-time meteorological forecast, since the 15-day ECMWF-ENS (Buizza et al., 2007) was only available after March 2003. Flood events were identified by comparing the ensemble of hydrological predictions against the critical thresholds.

Calibration
The performance of each calibration iteration was assessed using the modified Kling-Gupta Efficiency (KGE') (Kling et al., 2012).
The KGE' is a recent performance indicator based on the equal weighting of linear correlation (r), bias ratio (β) and variability (γ ), between simulated (s) and observed (o) discharge: where r is the Pearson product-moment correlation coefficient, µ is the mean discharge [m 3 s −1 ], CV is the coefficient of variation and σ is the standard deviation of the discharge [m 3 s −1 ]. KGE', r, β and γ are dimensionless and their optimum is at unity. The value of KGE' gives the lower value of any of the three sub-components (r, β and γ ). The hydrological performance can be classified using KGE' as following (Kling, 2012): The benefits of using the modified version of the Kling-Gupta Efficiency (KGE') over the original one (KGE) or Nash-Sutcliff Efficiency are discussed by Gupta et al. (2009) and demonstrated by Thiemig et al. (2013).
After the calibration, a unique "best" parameter set was obtained. For catchments lacking sufficient data for model calibration, default values without calibration were used for the model parameters (Table 3).

Hindcast
The capability of AFFS to predict streamflow in general, and flood events in particular, is assessed by comparing the hindcasting results with available ground observations and information from disaster databases respectively, using various evaluation methods presented in detail in the following.

General streamflow
The performance in predicting streamflow is evaluated based on the Continuous Rank Probability Skill Score (CRPSS). The CRPSS is calculated by dividing the CRPS (Continuous Rank Probability Score), which compares the cumulative distribution function of a probabilistic forecast (P hydEPS ) to the cumulative distribution function of the observation (P obs ), by a benchmark as follows: using the "Heaviside" function (Hersbach, 2000). It is necessary to compute the CRPSS rather than the CRPS, as the latter one depends on the magnitude of discharge and as such does not allow spatial comparison across different catchments. To circumvent this issue a normalized version of the CRPS is necessary (Trinh et al., 2013), for which reason the CRPSS was computed. Values of the CRPSS range from minus infinity to 1, where 1 represents the optimum, and negative values indicate a non-skilful forecast. In this study, the CRPSS was calculated for each lead time at the 36 key locations all over Africa. Two different benchmarks were considered; one, the seasonal mean (i.e. the moving average considering 30 days before and after the respective observation) and two, the persistence (i.e. the last observation is kept constant over the forecasting range; Bauer- Gottwein et al., 2015). Hence, the CRPSS computed here evaluates the advantage of using the flood forecast calculated by AFFS in comparison to using the seasonal mean or persistence as alternative approximation to a forecast. An average CRPSS was computed for all stations and also for different geographical locations (eastern, southern and western Africa).
The range of days in which the forecast is skilful is expressed by the "limit of predictability". The limit of predictability determines the number of days before the ensemble of hydrological forecasts deviates, on average, more from the actual observation than the benchmark. This gives the limiting point until which the forecasts have added value compared to the long-term mean or the last observation. Mathematically it coincides with the CRPSS being equal to 0.
Using a reliability diagram, how closely the forecast probabilities correspond to the actual chance of observing the event is assessed. The reliability diagram plots the frequency with which the event was observed to occur for various sub-groups of forecasting probabilities. A forecast system has perfect reliability if the forecast probability and the frequency of occurrence are equal, and the plotted points are lying on the identity line. Using the CRPSS, the reliability was calculated for each lead time at the 36 key locations.
The progression of the average CRPSS over the 10-day lead time is presented together with the limit of predictability and average reliability in Sect. 3.2.1.

Flood events
The ability of AFFS to detect flood events is assessed using a contingency table in combination with several skill scores such as the probability of detection (POD), the false alarm rate (FAR) and the Critical Success Index (CSI) that can be derived based upon that table.
The contingency table is a performance measure summarizing all possible forecast-observation combinations such as hits (H; event forecasted and observed), misses (M; event observed but not forecasted), false alarms (FA; event forecasted but not observed) and correct negatives (CN; event neither forecasted nor observed) (see Table 4). The POD, FAR and CSI (see Equations 4 to 6) provide further measures to quantify the ability of AFFS to identify flood events by providing success and failure rates. The POD and CSI give the proportion between successfully forecasted flood events and all observed flood events i.e. the total number of observed and forecasted flood events, respectively; while the FAR gives the proportion of falsely forecasted flood events considering all forecasted flood events. They are calculated as follows: All are expressed as percentages. The optimum value for POD and CSI is at 100 %; whereas it is 0 % for FAR. 3 Results

Model calibration
Figure 5a presents the model performance of LISFLOOD during the calibration period (2004)(2005)(2006)(2007)(2008) for the 36 catchments in terms of KGE'. 31 out of 36 catchments (86 %) have a KGE' greater than 0.5, and 50 % are greater than 0.75, indicating very good hydrological performances for most catchments. Poorer hydrological performances (KGE' < 0.5) are clustered in smaller tributaries in the arid area of South Africa and in a station in the Niger River, which is located downstream of the Inner Niger delta. Therefore, the observed discharge has different characteristics which is not captured by the simulations. The hydrological performance during the validation period (1998)(1999)(2000)(2001)(2002)(2003) is illustrated in Fig. 5b. It shows the KGE' for only 34 catchments, as there were no observations available for the remaining two stations for this specific time period. More than half of the KGE' values are greater than 0.5, and 29 % are greater than 0.7. The difference in KGE' between the calibration and the validation period is largest in the Zambezi catchment. Possible reasons can be seen in Fig. 6b. At this particular location there is a lack of data in the calibration period. On top of that, the flow during the few years for which data were available was relatively low in comparison to the one in the validation period, hence the calibration did not cover the full range of flow conditions, which surely contributes to a suboptimal calibration. Figure 6 shows the comparison between simulated and observed hydrographs for four selective locations in Africa (see Fig. 1e). For the Niger River (Fig. 6a) it can be seen that the flow dynamics are well reproduced during both calibration and validation, while the flow volume is only well captured during calibration, and slightly worse during validation, where it shows an underestimation. One reason for this could be related to the length of the calibration period for this catchment, which might be too short to determine the optimum value for the calibration parameters. Also in the Kafue River (Fig. 6b) the parameter optimization is only based on a 2-year period. However, the discharge is reproduced well during both calibration and validation, with the exception of the year 2001, in which the discharge is largely overestimated, resulting in a decreased KGE' of 0.36 during validation. For the Olifants River (Fig. 6c) the tendencies during both calibration and validation are similar, showing a fairly well captured flood dynamic with some extreme overestimations in flood volume resulting in a KGE' of 0.34 (calibration) and 0.56 (validation). For the Juba River (Fig. 6d), the KGE' indicates a satisfactory reproduction of discharges during calibration, but not during validation in which the KGE' is negative. This is due to the combination of the extreme overestimation in the year 2003 and the short length of the validation period.

General streamflow
The overall performance of the forecast is analysed by comparing the hydrological forecasts against ground observations using the CRPSS, the limit of predictability and the reliability.
In Fig. 7 the two different sets of CRPSS are plotted over the 10 days' lead time. Comparing against the seasonal mean as the benchmark, the average CRPSS decreases as the lead time advances, meaning that the error increases, i.e. AFFS' skill to forecast streamflow decreases (Fig. 7a). This is also confirmed by the number of stations with positive CRPSS, which continuously decreases over the 10 days' lead time (Fig. 7b). Decomposing the CRPSS for different regions in Africa shows that only a small number of stations in eastern Africa (20 %) have skilful streamflow predictions, while in western Africa the majority of stations (70-90 %) show skilful streamflow predictions. Comparing against the last observation as benchmark, shows however that the skill of the AFFS' prediction is increasing steeply at day 2 and remains high till the end of the forecast (day 10). This shows that after 2 days' lead time, the forecasts based on AFFS are much more skilful than presuming that the last observation will remain stagnate throughout the forecasting period. Figure 8 compares the forecast to the different benchmarks (seasonal mean and persistence, see Sect. 2.4.2) and indicates the number of days the forecast is skilful -this is also called the limit of predictability. A few stations indicate that a skilful forecast can be achieved up to day 10, and that at some stations, no skilful predictions have been made for the year 2003, in comparison to the seasonal mean or the last observation.
Both the positive CRPSS and the limit of predictability > 0 show that hydrological calculations based on AFFS are on average more skilful than using the last observation (i.e. persistence) or the seasonal mean as the forecast. This is true in particular for the forecasting range day 2-10 in which the CRPSS is remarkably high (persistence as the benchmark).
As the skill of the conventional ESP (not shown here) decreases similarly to the skill of the AFFS with increasing lead time, the decrease in forecasting performance cannot only be affiliated to possible inaccuracies of the meteorological ensemble predictions, but there must be other additional in- fluencing factors. However, establishing the sources of predictability is beyond the scope of this paper, but subject of future research. However, cross-comparing the CRPSS and the limit of predictability with the KGE' received during calibration (Fig. 5a) suggests that the skill of AFFS to predict streamflow is strongly dependent on the optimization of the hydrological model. For locations where LISFLOOD seems to be well fitted, expressed by a good hydrological performance (KGE' > 0.6), the forecasts were mostly skilful (positive CRPSS); while they were without skill (CRPSS negative and limit of predictability equal zero) exclusively at locations where the KGE' was less than 0.6 during calibration.
Hydrol. Earth Syst. Sci., 19, 3365-3385 Studies on global, seasonal streamflow prediction (Yossef et al., 2013;van Dijk et al., 2013) show that the source of forecast skill varies from basin to basin. Their results suggest that the forecast skill in monsoonal and semi-arid basins is mainly dependent on the skill of the meteorological predictions, while in large basins they were found to be more dependent on the skill of the initial conditions. Regarding catchments, AFFS showed to have particular skill at predicting streamflow for the Volta, Baro-Akobo, Kunene and the Upper Zambezi river basins. Figure 9 illustrates the average reliability of AFFS. Each boxplot summarizes the median reliability of the 36 key stations, considering all lead times. The diagram shows that the forecasting probability increases together with the frequency of occurrence, following the identity line (grey line) closely. This indicates a good overall reliability of the forecasts. However, a slight underestimation of frequently occurring events is notable. A possible explanation is that flood events with short durations and/or small affected areas are more frequent than large-scale and long-lasting events, but at the same time more difficult to capture due to various constraints set by the resolution of the input data and model.

Flood events
Information regarding observed flood events was retrieved from several disaster databases (Fig. 2), while forecasted flood events were identified by inspecting the threshold ex-ceedance maps (see step 4 in Sect. 2.3.1). 40 flood events were forecasted for the year 2003; information regarding time period and location was compiled in Fig. 4. Table 5 summarizes AFFS's ability to identify flood events. In general, comparing the 39 reported flood events ( Fig. 2) with the 40 forecasted ones (Fig. 4), 27 of the reported events were forecasted correctly by AFFS, while 12 were missed and 11 events that were forecasted were not reported; resulting into a general probability of detection (POD) of 69 %, a false alarm rate (FAR) of 29 % and a Critical Success Index (CSI) of 54 %.
In order to gain a clearer understanding of what might be influencing factors that determine the strengths and limitations of AFSS to identify flood events, the analysis was repeated for different flood durations (more or less than a week), climatic conditions (more or less than 600 mm average annual precipitation) as well as for different estimated sizes and average annual discharges of the affected area (more or less than 10 000 km 2 ; and more or less than 10 km 3 yr −1 ); and lastly also for different African regions (northern, western, eastern and southern Africa) as it might be of particular interest to potential future users of AFFS (see Table 5). The analysis shows that the probability of AFFS detecting a flood event seems to be particularly high for floods whose affected area is large (> 10 000 km 2 ), the flood duration long (> 1 week) and the amount of annual precipitation not very high (≤ 600 mm a −1 ); whereas the probability of missing a flood event is notably higher if the flood is of short duration (≤ 1 week) or the affected area relatively small (≤ 10 000 km 2 ). The false alarm rate indicates that AFFS predicts more flood events in regions with less than 10 km 3 mean annual discharge as well as flood events with large affected areas. However, it is unjustified to claim with certainty that these flood events were falsely predicted as there is also the possibility that they were just not reported. Finally, the Critical Success Index is quite similar for all the different categories, ranging from 46 to 65 %. Comparing the performance for the different regions, the high POD for eastern Africa as well as the low FAR of western Africa are the most distinct, while the performances in the other regions are quite similar. In summary, AFFS has, in general, a good ability to forecast the occurrence of flood events as the POD is always much higher than the FAR, and the CSI is generally above 50 %. Figure 10 presents the flood forecast for the March 2003 event in the Sabi Basin (for location see Fig 1a) as a visual example of a flood forecast obtained with AFFS. This flood forecast is one of the better ones achieved with AFFS. Note that there were no ground observations available to optimize LISFLOOD for this basin; hence the model was run with the default parameterization. The threshold exceedance maps (Fig. 10a) show the number of hydrological ensembles exceeding a certain critical threshold for a specific calendar date and lead time. Here the 2-year return period is chosen as the critical threshold. Forecasts are shown for the 3, 5, 7, 9 and 12 March with lead times of 3, 5 and 8 days. Ad-  ditionally, ensemble quantile plots (Fig. 10b) illustrate the 10-day probabilistic hydrological prediction for a specific location, including various specific EPS ranges (median, first and third quartile) and critical hydrological thresholds (2-, 5-, 10-and 30-year return periods). Here, the 10-day forecasts obtained on the 2 and 3 March for one specific reporting point are shown (for the location, see the red star in the upper left panel of Fig. 10a). Based on those AFFS output products, the onset of the flood event is forecasted with a lead time of 8 days for the 5 March, which coincides perfectly with information given by the Dartmouth Flood Observatory who reported flooding in the Sabi and tributaries between the 5 and 16 March 2003 (Fig. 2, obsID10). At the reporting point, the flood magnitude was forecasted (according to the EPS median) to exceed the 10-year return period, which also agrees with the severity classification of the observed flood event as given by the Dartmouth Flood Observatory: "Class 1: large flood events: significant damage to structures or agriculture; fatalities; and/or 1-2 decades-long reported interval since the last similar event". This example demonstrates that although there are no ground observations available for this basin, AFFS is capable of producing timely and accurate flood forecasts. Although this is only a single case study, the results show clearly that AFFS has the potential to support national and international organisations in future to prevent and/or mitigate flood-related damages and losses.

Discussion and conclusion
The predictive capability of the African Flood Forecasting System (AFFS) was investigated in a hindcast mode to estimate its potential as an operational flood forecasting system for the whole of Africa. AFFS detected correctly the majority of reported flood events. The system showed particular strength in predicting riverine flood events of long duration (> 1 week) and large affected areas (> 10 000 km 2 ). This type of flood has the capacity to impact the socio-economic structures of a country to the extent that it might cause setbacks in the country's development (UNCSD Secretariat, 2012; United Nations (UN), 2005). The example of the flood forecast for the Sabi River demonstrated the precision of AFFS, gave an example of the output products that could provide the end-user with clear and concise information about the possible future hydrological situation and showed that AFFS is capable of producing flood warnings even in ungauged river basins, i.e. in river basins where no observations are in the public domain. Hence, AFFS demonstrated a good potential to predict largescale and long duration flood events several days in advance.
It has to be noted here, that the performance of AFFS in an operational mode might differ from the one evaluated here. This is due to the meteorological input data used for the calculation of the initial conditions which are different during hindcasting and operational forecasting (see Sect. 2.2.2). Along the same lines, one might raise concern about the FAR, which suggests that 29 % of all flood events that AFFS predicted did not happen. However, the fact that these floods were not reported in one of the disaster databases does not necessarily mean that they did not actually happen, as there is no certainty that every flood that occurred was also reported, hence the database of observed events (Fig. 2) might not be complete. The possibility that the database of observed events (Fig. 2) might not be complete allows also for the case that a flood event was neither forecasted nor reported but happened; which would result in a lower POD. However, there is no possibility to ascertain this issue unless more information becomes available.
The limitations of AFFS centre around the detection of flood events with short durations (< week) and/or small affected areas (≤ 10 000 km 2 ). The difficulties in detecting relatively small and/or short duration flood events is most likely due to the combination of (a) the limited precision given by the meteorological input data to capture small-scale meteorological events accurately in the correct time and place, and (b) the relatively coarse grid size of 0.1 × 0.1 • that AFFS is operating on, which might be too coarse for these type of floods. Also during the analysis it was noticed that flood events occurring close to the boundaries of the Intertropical Convergence Zone were not captured well (not shown within the analysis). Forecasts in those areas may suffer from a displacement of the ITCZ controlling the onset and spatial extent of the West Africa monsoon, a conclusion also reached by Di Giuseppe et al. (2013).
This study has illustrated the structure and workflow of AFFS and given a first evaluation of its performance. The results indicate that system improvements and more detailed calibration of the system are needed. However, despite the limitations of the current setup, the system detected the majority of reported floods correctly even though LISFLOOD has been optimized using only a relatively small number of hydrological records (36 over the whole of Africa). This shows that the system works well with a minimum number of ground observations, while at the same time, it indicates a good potential for further improvements once more observational records become available. In this context, remote sensing might become a valuable alternative source of observed land surface hydrological fluxes. Surface-waterrelated signals are translated into estimates of e.g. streamflow, soil moisture, land-surface temperature, surface water height and inundation extend. The amount of research on the potential benefit of assimilating these estimates into hydrological applications is increasing steeply and has already shown promising results (Andreadis et al., 2007;Gleason et al., 2014;Hirpa et al., 2013Hirpa et al., , 2014Munier et al., 2015;Pedinotti et al., 2014;Revilla-Romero et al., 2014;Tarpanelli et al., 2013;Wanders et al., 2014). As assimilating these data might improve the forecast ability of AFFS by e.g. improving estimates of the initial conditions or the timing of the flood peak, it should be the focus of future research. In this context, Fig. 7 (benchmark: last observation) provides evidence that assimilating real-time discharge data would improve AFFS' forecast skill within the short range (0-2 days). Furthermore, in areas where the limit of predictability is currently at 10 days, the potential lead-time might be easily extended to up to 15 days by calculating 15 day hydrological predictions using the ECMWF-ENS which are available for the time period after March 2003. Additionally, a crosscomparison study of AFFS with other (global) flood forecasting systems covering the African continent (such as e.g. GloFAS) is necessary to gain a deeper understanding of the particular strengths and limitations of AFFS. Note, in order to draw a valid comparison, the general set-up of the comparison i.e. systems have to be equal, meaning that they have to be run over the same time period and the same spatial domain, and have to have evaluated the same flood events. As GloFAS did not exist back in 2003, a cross-comparison within this study was not feasible, but will be the focus of future research. Also, based on information from Pappenberger et al. (2011) the performance of AFFS might have been even better in recent years, as a consequence of the continuous improvement of the quality of meteorological data used as input to AFFS. However, this also needs to be addressed in future research. The HEPEX initiative (www.hepex.org) and the recently launched Global Flood Partnership (http:// portal.gdacs.org/Global-Flood-Partnership) will be explored as a possibility for further testing of AFFS in research and in the experimental real-time mode. Lastly, this study only evaluated the technical feasibility of AFFS, while issues related to practical implications, such as potential implementing institutes, funding and availability of technical expertise, were beyond the remit of this study, but would be highly relevant to future research.
In conclusion of AFFS, this study has demonstrated that this system has a large potential to contribute to the reduction of flood-related losses in Africa by providing national and international aid organizations with timely medium-range flood forecast information.