"A meta-analysis and statistical modelling of nitrates in groundwater at the African scale"

Contamination of groundwater with nitrate poses a major health risk to millions of people around Africa. Assessing the space-time distribution of this contamination, as well as understanding the factors that explain this contamination is important to manage sustainable drinking water at the regional scale. This study aims assessing the variables that contribute to nitrate pollution in groundwater at the pan African scale by statistical modeling. We compiled a literature database of nitrate concentration in groundwater (around 250 studies) and combined it with digital maps of physical attributes such as soil, geology, climate, hydrogeology and anthropogenic data for statistical model development. The maximum, medium and minimum observed nitrate concentrations were analysed. In total, 13 explanatory variables were screened to explain observed nitrate pollution in groundwater. For the mean nitrate concentration, 4 variables are retained in the statistical explanatory model: (1) Depth to... Abstract. Contamination of groundwater with nitrate poses a major health risk to millions of people around Africa. 6 Assessing the space-time distribution of this contamination, as well as understanding the factors that explain this 7 contamination is important to manage sustainable drinking water at the regional scale. This study aims assessing the 8 variables that contribute to nitrate pollution in groundwater at the pan African scale by statistical modeling. We compiled 9 a literature database of nitrate concentration in groundwater (around 250 studies) and combined it with digital maps of 10 physical attributes such as soil, geology, climate, hydrogeology and anthropogenic data for statistical model 11 development. The maximum, medium and minimum observed nitrate concentrations were analysed. In total, 13 12 explanatory variables were screened to explain observed nitrate pollution in groundwater. For the mean nitrate 13 concentration, 4 variables are retained in the statistical explanatory model: (1) Depth to groundwater (shallow 14 groundwater, typically <50m); (2) Recharge rate; (3) Aquifer type; and (4) Population density. The former three 15 variables represent intrinsic vulnerability of groundwater systems towards pollution, while the latter variable is a proxy 16 for anthropogenic pollution pressure. The model explains 65% of the variation of mean nitrate contamination in 17 groundwater at the pan Africa scale. Using the same proxy information, we could develop a statistical model for the 18 maximum nitrate concentrations that explains 42% of the nitrate variation. For the maximum concentrations, other 19 environmental attributes such as soil type, slope, rainfall, climate class and region type improves the prediction of 20 maximum nitrate concentrations at the pan African scale. As to minimal nitrate concentrations, in the absence of normal 21 distribution assumptions of the dataset, we do not develop a statistical model for these data. The data based statistical 22 model presented here represents an important step toward developing tools that will allow us to accurately predict nitrate 23 distribution at the African scale and 13 lithological coverages: evaporites (0.6%), metamorphic rocks (27.6%), acid plutonic rocks (1.1%), basic plutonic rocks (0.2%), intermediate plutonic rocks (0.1%), carbonates sedimentary rocks (9.4%), mixed sedimentary rocks (6.4%), siliciclastic sedimentary rocks (16.4%), unconsolidated (35.1%), (0.1%), (3.3%), rocks (0.6%) (0.9%) multiple sources of information. The attributes are related to recharge, geology, hydrogeology, soil texture, land use, topography and pollution pressure and were partially inspired from the DRASTIC vulnerability mapping approach. We compiled all explanatory variables in a common GIS environment (ArcGIS 10.3 TM ), using a common projection and resolution (15 km x 15 km) at the 1:60.000.000 scale. This spatial resolution was chosen because, we have considered that she was a reasonable compromise between different resolutions of the different datasets, computing constraints and regional extent. Indeed, this grid cell dimension has been used to map the vulnerability and risk pollution maps at the African scale (Ouedraogo et al., 2016). Generic variables at the grid scale were extracted to build our explanatory variables in this study. Most of these variables were categorical, but some were continuous. recharge is considered as primary explaining variable because recharge is the primary vehicle by which a contaminant is transported from the ground surface to groundwater. Groundwater recharge to an unconfined aquifer is Finally, further development may include the use of non-linear modelling techniques such as Random Forest techniques to identify the causal mechanism behind autocorrelation and heteroscedasticity in nitrate distributions over large extents such as Africa. Such techniques have the potential to improve the quality of explanation and eventually prediction by incorporate spatial autocorrelation, but complicate the physical explanation of observed trends. In addition, the model 546 should further be validated using more homogeneous data sets. There is a need for a process-based continental scale nitrate estimate that uses a consistent approach and data, as the basis for studying potential environmental factors impacts on groundwater resources in Africa. In a predictive mode, the model could be used for exposure estimate in epidemiological studies on the effect of polluted groundwater on human health. Also, an application of the statistical model to others contaminants could be explored.


Introduction
Nitrate contamination of groundwater is a common problem in many parts of the world.Elevated nitrate concentrations in drinking water can cause methemoglobinemia in infants and stomach cancer in adults (Yang et al., 1998;Knobeloch et al., 2000;Hall et al., 2001).As such, the World Health Organization (WHO) has established a maximum contaminant level (MCL) of 50 mg L −1 NO 3 (WHO, 2004).Nitrate in groundwater is generally of anthropogenic origin and associated with leaching of nitrogen from agriculture plots or from waste and sewage sanitation systems.The heavy use of nitrogenous fertilizers in cropping systems is the largest contributor to anthropogenic nitrogen in groundwater worldwide (Suthar et al., 2009).In particular, shallow aquifers in agricultural fields are highly vulnerable to nitrate contamination (Böhlke, 2002;Kyoung-Ho et al., 2009).According to Spalding and Exner (1993), nitrate may be the most widespread contaminant of groundwater.
In Africa, groundwater is recognized as playing a very important role in the development agenda.According to Xu and Usher (2006), degradation of groundwater is the most serious water resource problem in Africa.The two main threats are overexploitation and contamination (MacDonald et al., 2013).Indeed, based on a review of 29 papers from 16 countries, Xu and Usher (2006) identified major groundwater pollution issues in Africa, considering the order of importance as follows: (1) nitrate pollution, (2) pathogenic agents, (3) organic pollution, (4) salinization, and (5) acid mine drainage.These authors have shown that the major sources of groundwater contamination are related to on-site sanitation, to the presence of solid waste dumpsites, including household waste pits, to infiltration of surface water, to agricultural activities, to the presence of petrol service stations (underground storage tanks), and to the mismanagement of well fields.Nitrate contamination of groundwater is a problem that commonly occurs in Africa, as illustrated in the studies for Algeria (Rouabhia et al., 2010;Messameh et al., 2014), Tunisia (Hamza et al., 2007;Anane et al., 2014), Morocco (Bricha et al., 2007;Fetouani et al., 2008;Benabbou et al., 2014), Senegal (Sall and Vanclooster, 2009;Diédhiou et al., 2012), Ivory Coast (Loko et al., 2013a;Eblin et al., 2014), Ghana (Tay and Kortatsi, 2008;Fianko et al., 2009), Nigeria (Wakida and Lerner, 2005;Akoteyon and Soladoye, 2011;Obinna et al., 2014), South Africa (Maherry et al., 2009;Musekiwa and Majola, 2013), Ethiopia (BGS, 2001a;Bonetto et al., 2005), and Zambia (Wakida and Lerner, 2005).Several of these studies showed that pollution from anthropogenic activities is the main source of high and variable nitrate levels.For example, Comte et al. (2012) illustrate that the groundwater situated in the Quaternary sandy aquifer of the peninsula of Dakar is under strong anthropogenic pressure from the city of Dakar, resulting in important nitrate loadings.Such contamination problems are often retrieved in many metropoles in Africa.Notwithstanding the availability of all these studies at the local, regional or country level, no comprehensive synthesis of nitrate contamination of groundwater at the scale of the African continent has been presented in the literature.Assessing large-scale groundwater contamination with nitrates is important for the planning of the large-scale groundwater exploitation programmes and for designing transboundary water management policies.It also yields important baseline information for monitoring progress in the implementation of the United Nations Sustainable Development Goals (UN SDGs) for water.According to Saruchera and Lautze (2015), transboundary water cooperation has emerged as an important issue in the post-2015 UN SDGs.This study will increase awareness of citizens, international agencies and authorities (e.g.FAO, UNEP, and OECD; Water Sanitation for Africa -WSA) of the environmental factors likely to be significant in groundwater contamination.However, making an appropriate African-scale synthesis of nitrate contamination of groundwater remains a scientific and technical challenge, given the heterogeneity of the nitrate monitoring programmes and the absence of administrative and institutional capacity to collect and diffuse the data at the African scale.A concept that partially helps to solve this urgent data management problem is that of groundwater vulnerability.Groundwater vulnerability for nitrate contamination is an expression of the likelihood that a given groundwater body will be negatively affected by nitrate contamination.Given that the vulnerability is a likelihood, it is only an expression of the potential degradation of groundwater and hence a proxy of groundwater contamination by nitrates.Groundwater vulnerability can be assessed based on available generic data.It therefore does not depend on a strong and operational Africa groundwater quality monitoring capacity.In this paper, we propose and implement a methodology for assessing the vulnerability of groundwater contamination by nitrates at the African scale.We further consider nitrate in this study as a proxy for overall groundwater pollution, which is consistent with the view of the US EPA (EPA, 1996).
In general, there are three categories of models for the assessment of groundwater vulnerability: (1) index methods or subjective rating methods, (2) statistical methods and (3) process-based modelling methods.Index-and-overlay methods are one set of subjective rating methods that utilize the intersection of regional attributes with the qualitative interpretation of data by indexing parameters and assigning a weighting scheme.The most widely used index method is DRASTIC (Aller et al., 1985).Unfortunately, index methods are based on subjective rating methods (Focazio et al., 2002) and should preferably be calibrated using measured proxies of vulnerability (Kihumba et al., 2015;Ouedraogo et al., 2016).When a groundwater monitoring data set is available, formal statistical methods can be used to integrate groundwater contamination data directly into the vulnerability assessment.Finally, process-based methods refer to approaches that explicitly simulate the physical, chemical and biological processes that affect contaminant behaviour in the environment.They comprise the use of deterministic or stochastic process-simulation models eventually linked to physically based field observations (e.g.Coplen et al., 2000).Physically process-based methods are typically applied at small scales, mostly to define well protection zones, rather than to assess groundwater vulnerability at broader scales (Frind et al., 2006).A well-known example is the use of a physically based groundwater model (e.g.MODFLOW; Harbaugh et al., 2000) that solves the governing equations of groundwater flow and solute transport.Such models have explicit time steps and are often used to determine the timescales of contaminant transport to wells and streams, in addition to the effects of pumping.However, they also have many parameters that require estimation.In this paper, we use statistical models to assess the vulnerability of groundwater systems to nitrate pollution.
Formal statistical methods have often been employed to assess the vulnerability of groundwater at national and regional scales.They are also often used to discriminate contaminant sources and to identify factors contributing to contamination (Kolpin, 1997;Nolan and Hitt, 2006).Many authors used multiple linear regression (MLR) techniques.For example, Bauder et al. (1993) investigated the major controlling factors for nitrate contamination of groundwater in agricultural areas using MLR of land uses, climate, soil characteristics, and cultivation types.MLR was also used to relate pesticide concentrations in groundwater to the age of the well, land use around the well, and the distance to the closest possible source of pesticide contamination (Steichen, et al., 1988).Boy-Roura et al. (2013) used MLR to assess nitrate pollution in the Osona region (north-eastern Spain).Amini et al. (2008a, b) used MLR and an adaptive neuro-fuzzy inference system (ANFIS), a general non-linear regression technique, to study the global geogenic fluoride contamination in groundwater and the global geogenic arsenic contamination in groundwater respectively.MLR has the strong advantage that regression coefficients can directly be interpreted in terms of the importance of explaining factors.Many studies linking nitrate occurrence in groundwater to spatial variables have employed logistic regression (Hosmer and Lameshow, 1989;Eckhardt and Stackelberg, 1995;Tesoriero and Voss, 1997;Gardner and Vogel, 2005;Winkel et al., 2008;Mair and El-kadi, 2013).According to Kleinbaum (1994), MLR is conceptually similar to logistic regression.Other authors have used more sophisticated approaches such as Bayesian methods (Worrall and Besien, 2005;Mattern et al., 2012) and, more recently, classification and regression tree modelling approaches (Burow et al., 2010;Mattern et al., 2012).However, to our knowledge, a statistical model of groundwater nitrate contamination at the African scale does not exist yet.
In the present study, we used MLR techniques to assess the vulnerability of nitrate groundwater pollution at the African scale.To this end, we compiled at the African-scale groundwater a pollution database from the literature and combined it with environmental attributes inferred from a generic database.The generic database was developed in a former study to assess vulnerability using the DRASTIC index method (Ouedraogo et al., 2016).MLR models were subsequently identified to explain quantitatively the log transformed observed nitrate contamination in terms of generic environmental attributes and, finally, the regression models were interpreted in terms of characteristics of contaminant sources and hydrogeology of the African continent.

Study area
We studied the vulnerability of groundwater systems for nitrate contamination at the scale of the African continent.Groundwater is Africa's most precious natural resource, providing reliable water supplies to at least a third of the continent's population (MacDonald, 2010).However, the African continent is not blessed with a large quantity of groundwater resources, because it is the world's second-driest continent after Australia and water resources are limited.MacDonald Africa has a vast array of drainage networks; the most important ones are the following: the Nile River, which drains north-east and empties into the Mediterranean Sea.The Congo River drains much of central Africa and empties into the Atlantic Ocean.The Niger River is the principal river of western Africa; it is the third-longest river after the Nile and the Congo River, and empties into the Atlantic Ocean.Southern Africa is drained by the Zambezi River.Lake Chad constitutes one of the largest inland drainage areas of the continent.Other major lakes located in the east of Africa include Lake Tanganyika and Lake Victoria.
3 Data and methods

Nitrate contamination data
For a large part of Africa there is very little or no systematic monitoring of groundwater.In the absence of a data systematic monitoring programme, we compiled nitrate pollution data at the African scale from different literature sources.
We considered approximately 250 published papers on nitrate contamination of groundwater in Africa.We consulted the web of sciences (Scopus ™ , Sciences Direct ™ , Google ™ , and Google Scholar ™ ) and available books. Figure 2 shows the spatial distribution of the considered field studies.Table 1 outlines criteria used in the web search.

Data quality evaluation
We used the following additional criteria to select the study: the publication should explicitly report on nitrate concentrations in groundwater; and the publication should be published after 1999.Also, when many articles have been published on the same field site, we used only the most recent study.We excluded older studies before 1999 since the intensity of human activities is expected to be significantly different after 1999.We eliminated 37 articles because no quantitative data on nitrate concentration were reported.For the considered data set, 206 studies report on the maximum concentration of nitrate, 187 studies on the minimum concentration of nitrate, and 94 studies on the mean concentration of nitrate.Out of the 94 data sets for which mean values were reported, 12 field sites have a nitrate concentration smaller than 1 mg L −1 .We present the locations and references of the considered field studies in Table 2.In case spatial coordinates were not reported in the selected paper, we allocated the coordinates of the field study in Google Earth using the www.gps-coordintes.netand www.mapcoordinates.netapplications.As an example, we present in Fig. 3 the identified locations and reported maximum nitrate values of the selected studies.The absence of exact spatial coordinates in many studies will, therefore, generate a positioning error in the analysis.However, given the extent of the study, i.e. the African continent, we consider that this positioning error will not have significant effects on the overall results.The groundwater pollution risk in Fig. 3 corresponds to the potential of a groundwater body for undergoing groundwater contamination (Farjad et al., 2012).The risk of pollution is determined both by the intrinsic vulnerability of the aquifer, which is relatively static, and the existence of potentially polluting activities at the soil surface.These latter activities are time dynamic and can be controlled (Saidi et al., 2010).We generated the groundwater pollution risk map by combining the intrinsic groundwater vulnerability map with the land use map, using the additive model of Secunda et al. (1998).Details of these procedures are given by Ouedraogo et al. (2016).(15 km × 15 km) at the 1 : 60 000 000 scale.This spatial resolution was chosen because we have considered that it was a reasonable compromise between different resolutions of the different data sets computing constraints and regional extent.Indeed, this grid cell dimension has been used to map the vulnerability and risk pollution maps at the African scale (Ouedraogo et al., 2016).Generic variables at the grid scale were extracted to build our explanatory variables in this study.

Determination of spatially explanatory variables
Most of these variables were categorical, but some were continuous.
Groundwater recharge is considered a primary explaining variable because recharge is the primary vehicle by which a contaminant is transported from the ground surface to groundwater.Groundwater recharge to an unconfined aquifer is a function of precipitation, runoff, and evapotranspiration.The latter is related to vegetation and/or soil type.Groundwater recharge to a confined aquifer is generally more complex, as consideration must be given to the location of the recharge zone and the influence of any confining layers, vertical gradients, and groundwater pumping (Todd Engineers and Kennedy/Jenks Consultants, 2010).In this study, we derived the African recharge map from the global-scale groundwater recharge model of Döll et al. (2008).We also considered independent climate data as alternative proxies of recharge.Hence, we considered the climate and region type data class as defined by Trambauer et al. (2014).We also considered the rainfall map as generated from the UNEP/FAO World and Africa GIS database.The spatial resolution of this latter data set is approximately 3.7 km.
Subsequently, we selected a set of environmental attributes related to aquifer type, groundwater position, and the substrate that protects the aquifer.The depth to groundwater rep-resents the distance that a contaminant must travel through the unsaturated zone before reaching the water table or to the first screen.We mapped the depth to water based on the data presented by Bonsor et al. (2011).The slope of the land surface is important with respect to groundwater vulnerability because it determines the potential of a contaminant to infiltrate into the groundwater or be transported horizontally as runoff.We inferred the slope from the 90 m Shuttle Radar Topography Mission (SRTM90) topographic map, using the Spatial Analyst software of ArcGIS10.2™ .We derived the aquifer type and the impact of vadose zone material from the high-resolution global lithological database (GliM) of Hartmann and Moosdorf (2012).We determined aquifer type and unsaturated lithological zone for each of the five hydro-lithological and lithological categories as defined by Gleeson et al. (2014).These categories are unconsolidated sediments, siliciclastic sediments, carbonate rocks, crystalline rocks, and volcanic rocks (Gleeson et al., 2014).We constructed the soil type map from the 1 km resolution soil grid database developed by Hengl et al. (2012).We determined the hydraulic conductivity of aquifers from the Global Hydrogeology MaPS (GHYMPS) data set (Gleeson et al., 2014).For the determination of the land use at the African scale, we used the high-resolution land cover/land use map from the GlobCover data set (Defourny et al., 2014).There are twenty-two (22) classes of land cover that represent Africa in this data set.We aggregated these 22 classes into 6 similar classes (water bodies, bare area, grassland/shrubland, forest, urban, croplands) as represented in Fig. 4 and then regrouped them into 5 groups (water bodies, forest/bare area, grassland/shrubland, croplands, urban area).Finally, we considered a set of variables related to possible pollution pressure.We considered the application of fertilizer in the agricultural sector as a possible explanatory variable.We generated the nitrogen fertilizer application map from the Potter et al. (2010) data set.The values shown on this map represent an average application rate for all crops over a 0.5 • resolution grid cell.Following this study, the highest N fertilizer application rate (i.e.220 kg ha −1 ) is found in Egypt's Nile Delta.We further considered population density as a proxy of pollution source.We considered the population density map for the year 2000, as produced by Nelson (2004).

Statistical model description
We used MLR as the statistical method for identifying the relationship between the observed nitrate concentrations in groundwater and the set of independent variables given in Table 3. MLR is based on least squares, which means that the model is fitted such that the sum of squares of differences of predicted and measured values is minimized (Koklu et al., 2009;Helsel and Hirsh, 1992).The MLR model is denoted by Eq. ( 1): where y i is the response variable at location i, β 0 is the intercept, β j are the slope coefficients of the explanatory categorical or continuous variables x ij , n is the number of variables, and m is the number of locations or wells (number of studies here).ε i is the regression residual.In this study, the response variable is the log transformed nitrate concentration in groundwater.The log transformation was needed to stabilize the variance and to comply with the basic hy-  pothesis of MLR.The log transformed nitrate concentration is a continuous monotonic increasing function; it is, therefore, reasonable to accept that factors that contribute to the log transformed nitrate load will also contribute to the nitrate load.The explanatory variables were defined using a stepwise procedure, using the Akaike information criterion (AIC) as a test statistic (Helsel and Hirsch, 1992).4 Results

Normality of the dependent variable
Prior to analysis, we carefully checked the data using descriptive statistics such as boxplots and correlation analysis.The observed nitrate concentrations through metaanalysis range from 0 mg L −1 to 4625 mg L −1 for all categories, i.e. mean, maximum, and minimum values of nitrate groundwater contamination.Descriptive statistics are summarized in Table 4.The average mean nitrate concentration is 27.85 mg L −1 .The positive skewness of the mean nitrate concentration data and the kurtosis suggest that the mean nitrate concentration is not normally distributed.In contrast, the log-normally transformed mean nitrate concentration obeys normality, as demonstrated by means of the non-parametric Shapiro-Wilk test (p value = 0.1432 > 0.05).
The histogram of mean and log transformed concentration is shown in Fig. 5.We also checked the minimum and maximum nitrate concentrations for normality (results can be obtained from the authors upon request).

Correlation between nitrate in groundwater and explanatory variables
Land cover/land use is a principle factor, controlling groundwater contamination.The boxplot distribution of log transformed mean nitrate concentration for different land use classes is presented in Fig. 6.Groundwater in agricultural and urban areas is clearly more susceptible to nitrate pollution as compared to forest/bare area land use.Also, water bodies are susceptible to nitrate contamination, but this result is likely spurious since only two studies support this category.We performed a similar analysis of the log transformed maximum and minimum nitrate concentrations.The corresponding boxplot results can be obtained from the authors upon request.High values for the log transformed maximum nitrate concentration are also found in urban and cropland areas.High values for the log transformed minimum nitrate concentration are detected in cropland fields.All analyses confirm that the highest nitrate pollution is retrieved in urban areas, immediately followed by agricultural areas.
In this study, the aquifer systems for Africa are divided into five categories based on the lithological formations.formed nitrate concentrations found in the deeper groundwater systems (> 250 m b.g.l.).The relationship between the log transformed mean nitrate concentration and groundwater recharge can also be observed in Fig. 8b.This figure shows that nitrate concentration in the groundwater decreases with recharge.This may be due to dilution of nitrate charge.We observe in this figure high nitrate concentrations in the very low recharge class (0-45 mm yr −1 ).This may be due to irrigation water return that feeds the groundwater and that is not integrated into the recharge calculations.The analysis of Pearson's correlation between recharge and log transformed mean nitrate gives r = −0.292.
The relation between the log transformed mean nitrate concentration and the population density is given in Fig. 8c.We observe increasing nitrate in groundwater related to increasing population.This explicit relationship between pop-ulation density and nitrate concentration has a Pearson's correlation of 0.632.This obviously confirms the importance of studying the population as a potential polluting parameter and its relevant correlation with nitrate occurrence in the groundwater at the African scale.
Nitrogen fertilizer contributes significantly to an increase in crop yields, but excess nitrogen fertilizer generally pollutes groundwater (Green et al., 2004;Nolan et al., 2002).In the case of Africa, the impact of the nitrogen fertilizer application rate on log transformed mean nitrate concentration is illustrated in Fig. 8d.Pearson's correlation gives a low relation (r = 0.09).The analysis in this figure confirms that no clear relationship exists between fertilizer load and groundwater nitrate contamination.This can be linked to the relatively low fertilizer use in Africa as compared to other continents.Indeed, most studies have nitrogen fertilizer dressings that are below 50 kg ha −1 .According to the FAO (2012),  Africa only accounts for about 2.9 % of world fertilizer consumption in 2011.
We performed similar correlation analysis on the log transformed maximum concentration and log transformed minimum concentration respectively.Details can be obtained from the authors upon request.Results of these analyses are coherent with the results for log transformed mean nitrate concentration.

Development of the multi-variate statistical model
We developed a set of multiple variable regression models for the log transformed mean and maximum nitrate concen-trations in terms of the above-mentioned explanatory variables.A positive regression coefficient indicates a positive correlation between a significant explanatory variable and a target contaminant, while a negative coefficient suggests an inverse or negative correlation.We retained only explanatory variables with p values ≤ 0.1.
The best final model that explains the log transformed mean nitrate concentration includes only four explanatory variables: (1) depth to groundwater, (2) recharge, (3) aquifer type, and (4) population density.servations.The sign of the parameter coefficient indicates the direction of the relationship between independent and dependent variables (Boy-Roura et al., 2013).The lower the p value, the more significant the model parameter.
The regression analysis confirms the strong relationship between population density and log transformed mean nitrate concentration.As the p value is far below 0.05, we are more than 95 % confident that the population density strongly affects the nitrate occurrences in groundwater.
The aquifer medium is another important explanatory variable for log transformed mean nitrate concentration.Three categories of aquifer media significantly explain the dependent variables: carbonate rocks, crystalline rocks, and unconsolidated sediment rocks.Indeed, the analysis of regression coefficients shows that the likelihood of nitrate contamination decreases with the presence of unconsolidated sediments and crystalline rocks.Other aquifer types tested include siliciclastic sedimentary rocks, and volcanic rock aquifers were found to be statistically insignificant in the model.However, the aquifer media type is an important variable to assess groundwater vulnerability and to bring information about the hydrogeological system into the assessment.It allows one to differentiate the vulnerability in terms of aquifer lithology.Variables such as hydraulic conductivity could be surrogates for aquifer media because hydraulic conductivity data were developed based on the lithological formation.Nevertheless, they were not statistically significant in the final model.
The third variable represents the depth to groundwater.The three first classes (0-7, 7-25, and 25-50 m b.g.l.) of groundwater depth are all statistically significant.The water table corresponding to the 0-7 m class has the strongest statistical significance.The positive parameter coefficient indicates large contamination for shallow groundwater depths.By analysing the table of the coefficients, we observe that the largest groundwater depth class (100-250 m b.g.l.) is not statistically significant (p value > 0.05).We can conclude that the shallow groundwater systems at an African scale are most vulnerable to nitrate pollution.
The fourth variable included in the final model is the recharge.The recharge rates in the 45-123 and 123-224 mm yr −1 classes are statistically significant.In general, these rates correspond to semi-arid and dry sub-humid regions.The high concentrations in these areas can be due to intensive agricultural activities.
Other explanatory variables such as rainfall or land cover/land use were not considered in the final model.Indeed, notwithstanding the fact that a variable such as land cover/land use strongly influences observed log transformed Hydrol.Earth Syst. Sci., 20, 2353-2381, 2016 www.hydrol-earth-syst-sci.net/20/2353/2016/ q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 3 4 5 mean nitrate concentration (Fig. 6), it is related to other variables such as population density.Hence, to avoid multicollinearity in the final model, the land cover/land use variable is no longer included in the final model.The final MLR model using the four variables yields an R 2 of 0.65, indicating that 65 % of the variation in observed log transformed mean nitrate concentration at the African scale is explained by the model.The result of the model is globally significant because the p value is 2.422 × 10 −10 at 95 % of the significance level.The observed vs. predicted log transformed mean nitrate concentrations are shown in Fig. 9 and indicate that the MLR fits the data well.The probability plot of model residuals indicates that the distribution is close to normal (Fig. 10).We performed the Shapiro-Wilk test as an additional check on the distribution of nitrate residuals.Because the probability associated with the test statistic is larger than 0.05, we accept the null hypothesis that the residuals follow a normal distribution.Despite the fact that a few points have a higher Cook D value compared to the rest of the observation, they were kept in the MLR to represent the whole range of nitrate concentration data.In order to check the regressions' assumptions of homoscedasticity, a plot of the residuals of log transformed mean nitrate vs. the predicted log transformed mean values is illustrated in Fig. 11.We observe that the majority of observations are in the range of −2 to 2, except for two outliers observed in the bottom left part of the graph.The residual standard error of the log transformed mean nitrate is 0.91116 (ln (mg L −1 )).We observe that the residuals decrease with increasing predicted nitrate concentrations.The Breusch-Pagan test was used to assess heteroscedasticity in the model residuals (BP = 24.2773and q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 Figure 10.Normal probability distribution of model residuals for the predicted log transformed mean nitrate concentration.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 3 4 5 p value = 0.042).With a p value of 0.042, we reject the null hypothesis that the variance of the residuals is constant, and infer that heteroscedasticity is indeed present.As a result, we may expect some bias in the MLR model.Similarly to the log transformed mean nitrate concentration modelling, we developed another model corresponding to the log transformed maximum nitrate concentration.This model yielded only R 2 = 0.42 for the maximum values.The explanatory variables which influence the log transformed maximum nitrate concentration in groundwater are depth to groundwater, soil media, topography, rainfall, climate class, and type of region.For the log transformed minimum concentration, the absence of normal distribution assumptions did not allow one to develop a MLR model.

Discussion
We present in this study a database method to assess the vulnerability of groundwater systems to water quality degradation.We used the log transform of reported nitrate concentration as a proxy for groundwater vulnerability.We present a statistical model to explain this proxy in terms of generic data at the African scale.In a previous study, we evaluated the groundwater vulnerability to pollution at the African scale using the generic DRASTIC approach (Ouedraogo et al., 2016).However, the uncalibrated DRASTIC model predictions are subjected to quite some uncertainty, in particularly due to the subjectivity in assigning the generic DRASTIC model parameters.In contrast to this previous study, we focus in this paper on nitrate pollution, which is a parameter that is strongly related to vulnerability and which often is measured in ongoing monitoring programmes.We integrate published nitrate in groundwater data explicitly in the assessment, thereby reducing the subjectivity of the DRASTIC approach.
We assessed in this study the quality of the data (Sect.3.2).Yet notwithstanding this, some caution is needed in the interpretation of the results, in particular as bias may be present in the meta-analysis.For instance, there may be bias towards studies on aquifers which are productive and used for drinking water supply, irrigation, or mining activities.Another possible bias is that some studies mainly focussed on nitrates, while others are oriented to more general groundwater quality studies.Furthermore, the data were collected from different sources (peer-reviewed journal articles, book chapters, or other grey literature).With such an approach, sampling and analytical methods are not standardized, being an additional source of possible bias.Data availability is a major issue when developing a continental-scale groundwater nitrate statistical model.Unsurprisingly there are no consistent and standardized monitoring data sets at the continental scale.The available data sets are also patchy, both spatially and temporally.A meta-analysis of literature data is so far the only method for getting the picture at the continental scale.Results from this meta-analysis should not be over-interpreted.Whilst the data provide a useful preliminary assessment of the nitrate contamination in groundwater at the African scale, there are clear limitations.
In this study, we used MLR to explain nitrate groundwater in terms of other generic spatially distributed environmental parameters.MLR is an approach to model the relationship between a response variable and multiple sets of explanatory variables (Rawlings et al., 1998).MLR analysis is ca-pable of both predicting and explaining a response variable using explanatory variables without compromise (Kleinbaum et al., 1988).Previous studies of MLR using spatial variables for nitrate concentration in groundwater showed R 2 values of 0.52 and 0.64 in shallow alluvial aquifers (Gardner and Vogel, 2005;Kaown et al., 2007) and an R 2 of 0.82 in deep sandy tertiary aquifers (Mattern et al., 2009).For the application in this study, we selected the parameters using stepwise MLR regression, allowing us to select only those parameters which have a significant impact on the log transformed concentration values of nitrate.
The explanatory variables with the strongest influence on the mean log transformed nitrate concentration at the African scale are the population density and groundwater depth, which is in agreement with results from other studies such as Nolan (2001), Nolan et al. (2002), Nolan and Hitt (2006), Liu et al. (2013), Bonsor et al. (2011), andSorichetta et al. (2013).Both explanatory variables are directly related to the probability of having high nitrate concentrations in groundwater.The strong influence of the population density variable can be explained by the serious problem of sanitation in African townships.This is consistent with the conclusions of UNEP/UNESCO project "Assessment of Pollution Status and Vulnerability of Water Supply Aquifers Cities", stating that the major pollution pressures on African water bodies are related to poor on-site sanitation, solid waste dumpsites including household waste pits, and surface water influences (Xu and Usher, 2006).This is also consistent with other studies stating that leaking septic tanks and sewer systems are considerably causing nitrate contamination of groundwater in urban areas (Böhlke, 2002;Showers et al., 2008).The magnitude of contamination is not only affected by the population density, but also by the socio-economic setting (UNEP/DEWA, 2014).A high population density is therefore often associated with the lack of adequate sanitation in many slums/shanty towns in Africa.The strong influence of population density in our model suggests that high concentrations in groundwater are mainly from subsurface leakage of municipal sewage systems, petrol service stations (underground storage tanks), and agricultural chemicals in small-scale farming.Hence, sanitation programmes in Africa must not be delinked from groundwater protection and controlling the use of fertilizer products in agriculture.
Nitrate concentrations were generally higher for shallower wells than for deeper groundwater systems.For deep groundwater, predicted nitrate concentration was lower as compared to shallow groundwater (Nolan et al., 2014).Alluvial and shallow aquifers are thus particularly vulnerable to nitrate pollution, while deep confined aquifers are generally better protected.The inverse relation between depth and nitrate is consistent with previous groundwater studies that considered well depth or depth of the screened interval as explanatory variables (Nolan and Hitt, 2006;Nolan et al., 2014;Wheeler et al., 2015;Ouedraogo and Vanclooster, 2016).Nitrate generally moves relatively slowly in soil and groundwater, and therefore there is a significant time lag between the polluting activity and detection of the pollutant in groundwater (typically between 1 and 20 years, depending on the situation) (Boy-Roura, 2013;Mattern and Vanclooster, 2009).Deeper groundwater may, therefore, predate periods of intensive fertilizer application (1950-present).
The rate at which nitrate moves through the subsurface depends on the permeability and extent of fissuring of soil and aquifer, which controls flow, diffusion, and dispersion processes.According to Close (2010), nitrate is negatively charged and thus electrostatically repelled by media in the unsaturated zone that usually have a negative charge, such as clay minerals.This means that nitrate sorption within the unsaturated zone is unlikely and that the large residence times are related to the slow physical transport process.Foster and Crease (1974) and Young et al. (1976) were the first authors to mention a "storage of nitrate" in porewater and consequent slow vertical migration through the unsaturated zone towards groundwater systems.More recently, other investigators showed the process of nitrate accumulation in the unsaturated zone (Ascott et al., 2016;Wang et al., 2016;Worall et al., 2015).The long travel distances towards deep aquifer systems increase the probability that nutrients will react for instance through denitrification (Stevenson and Cole, 1999;Thayalakumaran et al., 2004;Aljazzar, 2010;Wheeler et al., 2015).Denitrification is facilitated by the absence of oxygen.Denitrification was found to be relatively limited in the unsaturated zone (Kinniburgh et al., 1994;Rivett et al., 2008), while it is the principle process responsible for reduction of nitrate in groundwater (Aljazzar, 2010;Stevenson and Cole, 1999;Thayalakumaran et al., 2004), in particular in reduced groundwater (Burow et al., 2013).Boy-Roura et al. (2013), for instance, found low nitrate concentrations (below 50 mg L −1 ) in those areas where denitrification processes have been identified.An indicator of the presence of denitrification processes contributed as such to explain nitrate contamination in the Osona region (north-eastern Spain) (Boy-Roura et al., 2013).In our study, an indicator of the presence of denitrification processes in the groundwater system was not available and could not be included in the model.
Another remark concerns the presence of nitrate in some specific geological formations.According to Tredoux and Talma (cited in Xu and Usher, 2006), an apparent correlation may exist between the occurrence of high nitrate levels and certain geological formations.The apparent correlation however between the occurrence of high nitrate levels and certain geological formations is mainly due to secondary effects.Only in exceptional cases can geological formations serve as a primary source of nitrogen.This happens when contamination ions are incorporated into rock minerals to be released by weathering and oxidized to nitrate.These authors further concluded that in most cases, the occurrence of high levels of nitrate is due to contamination related to anthropogenic activities.
The strong relation between nitrate contamination and both groundwater depth and population density is a particular point of concern given the fact that the majority (85 %) of Africa's population lives in regions where depth to groundwater is shallow (0-50 m b.g.l.) and where hand pumps may be used to abstract water.Eight percent of these people (i.e.nearly 66 million people) are likely to live in areas where depth to groundwater is 0-7 m b.g.l.A significant minority (8 %) of Africa's population lives in regions where the depth to groundwater is between 50 and 100 m b.g.l., and common hand pump technologies (e.g.India Mark) are inoperable in these cases.These areas are mainly within southern Africa and to a lesser extent situated in the Sahel.
A third important explanatory variable that was included in the model was the groundwater recharge rate.The recharge rate of an aquifer is indeed another factor that controls groundwater flow regime and hence the movement of nitrate.Nitrate can easily be transported to shallow groundwater in well-drained areas with rapid infiltration and highly permeable subsurface materials.However, according to a recent study in the shallow unconfined aquifer of the Piemonte plain, dilution can be considered the main cause of nitrate attenuation in groundwater (Debernardi et al., 2007).The variable recharge in our model is consistent with studies like Hanson (2002) and Saffigna and Keeney (1997).According to UNEP/DEWA (2014), recharge from multiple sources influences groundwater microbial and chemical water quality.Groundwater recharge rate is interlinked with many other environmental variables, including, but not limiting, soil type, aquifer type, antecedent soil water content, land use/land cover type, and rainfall (Sophocleous, 2004;Ladekarl et al., 2005;Anuraga et al., 2006).Hence, to avoid multi-collinearity, variables like land use/land cover type, rainfall, and soil type were not considered in the final model.
Despite land cover/land use type not being explicitly included in the final model, the exploratory analysis clearly shows a strong relationship between nitrate concentration and land use/land cover type.Indeed, nitrate concentrations are generally higher in urban areas.This is consistent with many other studies such as Showers et al. (2008).The high contamination in urban areas jeopardizes groundwater exploitation in urban areas.Urbanization is a pervasive phenomenon around the world, and groundwater demands in urban areas are increasingly growing.The degradation of groundwater bodies in urban areas is, therefore, a particular point of concern.Also, agricultural land exhibits an impact on groundwater nitrate concentrations compared to the grassland/shrubland, water bodies, and forest/bare area, but this effect is less important as compared to agricultural land effects in other parts of the world (e.g.Europe).
The influence of aquifer type on the nitrate contamination was demonstrated by Boy-Roura et al. (2013) and the influence of soil type by Liu et al. (2013).As with land cover/land use type, these variables were not retained in the final model to avoid collinearity with recharge.
The advantage of the MLR technique is that it can be easily implemented and that model parameters can be easily interpreted if the possible interaction between variables is ignored.However, MLR cannot represent well the many nonlinear dynamics that are associated with the contamination of groundwater systems.The violation of the homoscedasticity hypothesis, for instance, indicates that some bias will be present in our MLR model.Standard statistical models employed in distribution modelling, such as MLR, work under the assumption of independence in the residuals and homoscedasticity.When heteroscedasticity is present, residuals may be autocorrelated.This will lead to inflated estimates in degrees of freedom, an underestimation of the residual variances, and an overestimation of the significance of effects (Legendre and Fortin, 1989;Legendre, 1993;Dale and Fortin, 2002;Keitt et al., 2002).This may show that others variables should be included in the model or that the system may be highly non-linear.
We could avoid heteroscedasticity and improve the modelling performance by introducing non-linear regression techniques (Prasad et al., 2006) or by introducing additional variables in the model.Indeed, many studies showed that non-linear statistical models of groundwater contamination outperform the linear models (e.g.Pineros-Garcet et al., 2006;Mattern et al., 2009;Oliveira et al., 2012;Wheeler et al., 2015).To uncover non-linear relationships, non-parametric data mining approaches provide obvious advantages (Olden et al., 2008;Wiens, 1989;Dungan et al., 2002).Machine learning provides a framework for identifying other explanatory variables, building accurate predictions, and exploring other non-linear mechanistic relationships in the system.We may, therefore, expect that non-linear statistical models will improve the explanatory capacity of the model and remove heteroscedasticity from the model.
However, we believe that this theoretical constraint of heteroscedasticity does not undermine the overall results.The observed heteroscedasticity can be considered modest in view of the large extent of the study and the violation of statistical design criteria when collecting data through a metaanalysis.Also, the interpretation of the factors and coefficients associated with non-linear regression techniques becomes more complicated.We, therefore, prefer to maintain in this paper the MLR techniques as a first approach to screen the factors that contribute to log transformed mean nitrate concentration risk.We suggest however that future studies should address the added value that can be generated with non-linear modelling techniques.Such non-linear modelling techniques are particularly needed for the maximum concentration for which the R 2 of simple MLR remains currently too poor, and also for the minimum concentration, which shows the absence of normal distribution assumptions.
Also, in this study, we only identified a MLR model based on a meta-analysis spanning the African continent.Since the data collected through the meta-analysis are very heterogeneous, the quality of the data set remains rather poor.There-fore, future studies should critically address the validity of the identified model and explore how the model can be improved and used in a predictive model.It is however suggested that such model improvement and validation steps should be based on a more homogeneous data set.We, therefore, suggest performing this future model validation and model improvement step using data collected at the regional scale using more homogeneous data collection protocols.

Conclusion
Contamination of groundwater by nitrate is an indicator of groundwater quality degradation and remains a point of concern for groundwater development programmes all over the world.It is also a good proxy of overall groundwater vulnerability to water quality degradation.We address in this paper the issue of nitrate contamination of groundwater at the African scale.We inferred the spatial distribution of nitrate contamination of groundwater from a meta-analysis of published field studies of groundwater contamination.We analysed the literature for reported mean, minimum and maximum concentrations of nitrate contamination.We subsequently analysed, using boxplots, the reported contamination in terms of spatially distributed environmental attributes related to pollution pressure and attenuation capacity.We extracted the explanatory variables from a geographic information system with the ArcGIS 10.3 ™ tool.
We finally developed a MLR statistical model allowing us to explain quantitatively the log transformed observed contamination that is a proxy of vulnerability, in terms of spatially distributed attributes.We selected the explanatory variables using a stepwise regression method.
Groundwater contamination by nitrates is reported throughout the African continent, except for a large part of the Sahara.The observed nitrate concentrations range from 0 to 4625 mg L −1 .The mean nitrate concentration varies between 1.26 to 648 mg L −1 .The sample mean of this mean nitrate concentration is 54.85 mg L −1 , its standard deviation was 89.91 mg L −1 , and its median was 27.58 mg L −1 .The minimum nitrate concentration varies between 0 and 185 mg L −1 , while the maximum concentration varies from 0.08 to 4625 mg L −1 .The sample means of the minimum and maximum concentrations are 8.91 and 190.05 mg L −1 ; the sample standard deviations are 23.17 and 428.69 mg L −1 ; and the sample medians are 0.55 and 73.64 mg L −1 respectively.The distribution of the reported nitrate contamination data is strongly skewed.We, therefore, build statistical models for the log transformed mean and maximum concentrations.
The graphical boxplot analysis shows that nitrate contamination is important in shallow groundwater systems and strongly influenced by population density and recharge rate.Nitrate contamination is, therefore, a particular point of concern for groundwater systems in urban sectors.
The MLR model for the log transformed mean nitrate concentration uses "the depth to groundwater", "groundwater recharge rate", "aquifer type" and "population density" as explanatory variables.The total variability explained by the model is 65 %.This suggests that other variables may be needed to explain the reported nitrate concentrations.These findings highlight the challenges in developing appropriate regional databases to predict groundwater degradation.The MLR shows that the population density parameter is the most statistically significant variable.This confirms that leaking cesspits and sewer systems are causing significant nitrate contamination of groundwater, predominantly in urban areas.We identified similar MLR models for the log transformed maximum nitrate concentrations.However, for this latter attribute, the explained variation using the simple MLR techniques (i.e.42 %) remains small.
One of the main strengths of our study is that it is based on a large database of groundwater contamination reports from different countries, spanning the African continent and linked to environmental attributes that are available in a spatially distributed high-resolution format.In addition, the development of a continental-scale model of nitrate contamination in groundwater of Africa allowed one to determine which explanatory variables mainly influence the presence of nitrate.This represents an important step in managing and protecting both water resources and human health at the African scale.The main weakness of the modelling approach lies in the lack of detailed information available at the African scale, particularly the lack and uneven distribution of measured nitrate points.In spite of weaknesses and uncertainties caused by a moderate heteroscedasticity from residuals in the model, the modelling approach presented here has great potential.Although the meta-analysis should not replace systematic nitrate monitoring, it gives a first indication of possible contamination.It can also be applied to the preliminary assessment of nitrate using spatial variables.This may support the water resource development programme for transboundary aquifer managers and regional basin organizations.This is particularly important as the demand for drinking water is increasing rapidly at the African scale.
We suggest that further development include the use of non-linear modelling techniques such as random forest techniques.Such techniques have the potential to improve the quality of explanation and eventually prediction by incorporating spatial autocorrelation.We also suggest that the models should be further validated using more homogeneous data sets.In a predictive mode, statistical models like those developed in the present paper can be used for exposure estimates in epidemiological studies on the effect of polluted groundwater on human health.Similar models can also be developed for other contaminants and could be explored.

Figure 3 .
Figure 3.The locations and the maximum values of nitrate in Africa superimposed on a risk pollution map as generated in the previous generic vulnerability study of Ouedraogo et al. (2016).
We evaluated model performance based on the significance level of estimated coefficients, the coefficient of determination (R 2 ), the mean square error (MSE), the probability plots of model residuals (PRES), the plots of predicted vs. observed values, and the Akaike information criterion (AIC).High values of R 2 and low values of RMSE, PRES, and AIC indicate a better performance of the model.To validate the model obtained by the stepwise procedure, the standard regression diagnostics were assessed.To test the heteroscedasticity in the model residuals, we use the Breusch-Pagan (BP) test by implementation with the "lmtest" package.A Student statistic t test was finally used to check the statistical significance (with p values < 0.10) of variables in the final model.We assessed tolerance to examine whether multicollinearity exists between variables.In this study, we performed the statistical analyses using R version 3.1.1(R Development Core team, 2015).

Figure 7 .
Figure 7. Log transformed mean nitrate concentration for different aquifer system classes.

Figure 8 .
Figure 8. Log transformed mean nitrate concentration for different groundwater depth classes (a), recharge classes (b), population density classes (c) and nitrogen application rate classes (d).

Figure 11 .
Figure 11.Relation between residuals and predicted log transformed mean nitrate concentration.

Table 1 .
Criteria used to identify nitrate data studies within web databases.

Table 2 .
Localization of study sites considered in the meta-analysis.

Table 3 .
Explanatory variables used in the MLR analysis.Université Catholique de Louvain/Earth and Life Institute/Environmental sciences; 2 Socioeconomic Data and Applications Center (SEDAC);3The new global lithological map database -GLiM: a representative of rock properties at the earth's surface; 4 Consultative Group for International Agricultural Research (CGIAR)/Consortium for Spatial Information (CSI); 5 A glimpse beneath the earth's surface: Global Hydrogeology MaPS (GLHYMPS) of permeability and porosity. 1

Table 4 .
Summary statistics of original and log (ln) transformed nitrate data.
Table 5 summarizes the results of this linear regression model.This model can explain 65 % of the log transformed mean nitrate concentration ob-