Evaluating performance of simplified physically based models for shallow landslide susceptibility

Rainfall-induced shallow landslides can lead to loss of life and significant damage to private and public properties, transportation systems, etc. Predicting locations that might be susceptible to shallow landslides is a complex task and involves many disciplines: hydrology, geotechnical science, geology, hydrogeology, geomorphology, and statistics. Two main approaches are commonly used: statistical or physically based models. Reliable model applications involve automatic parameter calibration, objective quantification of the quality of susceptibility maps, and model sensitivity analyses. This paper presents a methodology to systemically and objectively calibrate, verify, and compare different models and model performance indicators in order to identify and select the models whose behavior is the most reliable for particular case studies. The procedure was implemented in a package of models for landslide susceptibility analysis and integrated in the NewAge-JGrass hydrological model. The package includes three simplified physically based models for landslide susceptibility analysis (M1, M2, and M3) and a component for model verification. It computes eight goodness-of-fit indices by comparing pixel-by-pixel model results and measurement data. The integration of the package in NewAge-JGrass uses other components, such as geographic information system tools, to manage input–output processes, and automatic calibration algorithms to estimate model parameters. The system was applied for a case study in Calabria (Italy) along the Salerno–Reggio Calabria highway, between Cosenza and Altilia. The area is extensively subject to rainfall-induced shallow landslides mainly because of its complex geology and climatology. The analysis was carried out considering all the combinations of the eight optimized indices and the three models. Parameter calibration, verification, and model performance assessment were performed by a comparison with a detailed landslide inventory map for the area. The results showed that the index distance to perfect classification in the receiver operating characteristic plane (D2PC) coupled with the model M3 is the best modeling solution for our test case.


Introduction
Landslides are one of the most dangerous geohazards worldwide and constitute a serious menace for public safety leading to human and economic losses (Park, 2011).Geoenvironmental factors such as geology, land use, vegetation, climate, and increasing population sizes may increase the occurrence of landslides (Sidle and Ochiai, 2006).Landslide susceptibility assessments, i.e., the likelihood of a landslide occurring in an area on the basis of local terrain conditions (Brabb, 1984), are not only crucial for an accurate landslide hazard quantification but also a fundamental tool for the environmental preservation and responsible urban planning (Cascini et al., 2005).
Many methods for landslide susceptibility mapping have been developed and can be grouped in two main branches: qualitative and quantitative methods (Glade and Crozier, 2005;Corominas et al., 2014 and references therein).
Qualitative methods, based on field campaigns and expert knowledge and experience, are subjective but necessary to validate quantitative method results.Quantitative meth-Published by Copernicus Publications on behalf of the European Geosciences Union.
ods include statistical and physically based methods.Statistical methods (e.g., Naranjo et al., 1994;Chung et al., 1995;Guzzetti et al., 1999;Catani et al., 2005) use different approaches such as bivariate statistics, multivariate analysis, discriminant analysis, and the random forests approach to link instability factors (such as geology, soil, slope, curvature, and aspect) with past and present landslides.Bivariate statistical methods ignore the interdependence of instability factors, whereas multivariate analysis is able to statistically consider their interactions.Other data-driven methods for landslide susceptibility analysis include the use of neural networks (Pradhan, 2011;Conforti et al., 2014), support vector machines (Pradhan, 2013 and citations therein), and Bayesian networks (Lee et al., 2002).Deterministic models (e.g., Montgomery and Dietrich, 1994;Lu andGodt, 2008, 2013;Borga et al., 2002;Simoni et al., 2008;Capparelli and Versace, 2011) synthesize the interaction between hydrology, geomorphology, and soil mechanics in order to physically understand and predict the location and timing that trigger landslides.These models generally include a hydrological and a slope stability component.The hydrological component simulates infiltration and groundwater flow processes with different degrees of simplification, from steady state (e.g., Montgomery and Dietrich, 1994) to transient analyses (Simoni et al., 2008).The soil stability component simulates the slope safety factor (FS) defined as the ratio of stabilizing to destabilizing forces.One of the main advantages of datadriven methods for landslide susceptibility is that they can be easily applied in wide areas while deterministic models are in general applied in local analyses.The latter are more computationally expensive and require detailed input data and parameters, which often involve high uncertainty.On the other hand, data-driven methods assume that landslides are caused by the same combination of instability factors over the entire study area, whereas deterministic models enable different triggering mechanisms to be understood and investigated.
The results of a landslide susceptibility analysis strongly depend on the model hypothesis, parameter values, and parameter estimation method.Questions regarding the performance evaluation of the landslide susceptibility model, the choice of the most accurate model, and the selection of the best-performing method for parameter estimation are still open.Thus, a procedure is needed that facilitates reproducible comparisons between different models and evaluation criteria aimed at the selection of the most accurate models.
Much effort has been devoted to the crucial problem of evaluating landslide susceptibility model performance (e.g., Dietrich et al., 2001;Frattini et al., 2010;Guzzetti et al., 2006).Accurate discussions about the most common quantitative measures of goodness of fit (GOF) between measured and modeled data are discussed in Bennet et al. (2013), Jolliffe and Stephenson (2012), Beguería (2006), and Brenning (2005 and references therein).We have summarized them in Appendix A. Usually one of these indices is selected and used as an objective function (OF) in combination with a calibration algorithm in order to obtain the optimal set of model parameters.However, in most cases, the selection of the OF is not justified or compared with other options.
The wrong classifications in landslide susceptibility analysis not only risk a loss of life but also have economic consequences.For example, locations classified as stable increase their economical value because no construction restrictions will be applied, while the reverse is true for locations classified as unstable.
In this work, we propose an objective methodology for environmental model analysis which selects the bestperforming model based on a quantitative comparison and assessment of model prediction skills.In this paper, the methodology is applied to assess the performance of simplified landslide susceptibility models.As the procedure is model independent, it can be used to assess the ability of any type of environmental model to simulate natural phenomena.
Unlike previous applications, our methodology aims to objectively i. select a set of the most appropriate OFs in order to determine the best model parameters; ii. compare the performance of a model using the parameter sets selected in the previous step in order to identify the OFs that provide particular and not redundant information; and iii. perform a model parameter sensitivity analysis in order to understand the relative importance of each parameter and its influence on the model performance.
The methodology enables the user to i. identify the most appropriate OFs for estimating the model parameters and ii. compare different models in order to select the best one that estimates the landslide susceptibility of the study area.
The procedure is implemented in the open-source and GIS-based hydrological model, denoted as NewAge-JGrass (Formetta et al., 2014a) which uses the Object Modeling System (OMS, David et al., 2013) modeling framework.OMS is a Java-based modeling framework which promotes the idea of programming by components.It provides the model developers with many features such as multithreading, implicit parallelism, model interconnection, and a GIS-based system.
The NewAge-JGrass system, Fig. 1, contains models, automatic calibration algorithms for model parameter estimation, and methods for estimating the goodness of the model prediction.The open-source GIS uDig (http://udig.refractions.net/)and the uDig Spatial Toolbox (Abera et al., 2014, https://code.google.com/p/jgrasstools/wiki/JGrassTools4udig) are used as a visualization and input-output data management system.The OMS framework has been previously used as the core for landslide modeling (Formetta et al., 2015(Formetta et al., , 2016)).These studies deal with realtime early warning systems for landslide risks and involve 3-D physically based hydrological modeling of very small catchments (up to around 20 km 2 ).In contrast, the current application focuses on wider areas of landslide susceptibility assessments using completely different physically based models which are presented in the next section.
The methodology presented in this paper for landslide susceptibility analysis (LSA) represents one model configuration within the more general NewAge-JGrass system.It includes two new models specifically developed for this paper: mathematical components for landslide susceptibility mapping and procedures for landslide susceptibility model verification and selection.The LSA configuration also uses two models that have already been implemented in NewAge-JGrass: the geomorphological model setup and the automatic calibration algorithms for model parameter estimation.All the models used in the LSA configuration are presented in Fig. 1, encircled with a dashed red line.
The methodology is presented in Sect. 2. It was set up considering three different landslide susceptibility models, eight GOF metrics, and one automatic calibration algorithm.The flexibility of the system enables more models and GOF metrics to be added, and different calibration algorithms can be used.Thus, different LSA configurations can be created depending on the landslide susceptibility model, the calibra-tion algorithm, and the GOFs selected by the user.Finally, Sect. 3 presents a case study of landslide susceptibility mapping along the A3 Salerno-Reggio Calabria highway in Calabria, which illustrates the capability of the system.

Modeling framework
The LSA is implemented in the context of NewAge-JGrass (Formetta et al., 2014a), an open-source, large-scale hydrological modeling system.It models the whole hydrological cycle: water balance, energy balance, snow melting, etc. (Fig. 1).The system implements hydrological models, automatic calibration algorithms for model parameter optimization and evaluation, and a GIS for input-output visualization (Formetta et al., 2011(Formetta et al., , 2014a)).NewAge-JGrass is a component-based model: each hydrological process is described by a model (energy balance, evapotranspiration, runoff production in Fig. 1).Each model implements one or more components (considering, for example, the model evapotranspiration in Fig. 1, the user can select between three different components: Penman-Monteith, Priestly-Taylor, and Fao).In addition, each component can be linked to the others and executed at runtime, building a model configuration.Figure 1 offers a complete picture of the system and the integration of the new LSA configuration encircled with dashed red lines.More precisely, the LSA in the current configu-G.Formetta et al.: Performance of simplified physically based models for shallow landslide susceptibility ration includes two new models: a landslide susceptibility model and a verification and selection model.The first includes three components proposed in Montgomery and Dietrich (1994), Park et al. (2013), andRosso et al. (2006); the latter includes the three-step verification procedure (3SVP), presented in Sect. 2. The LSA configuration also includes another two models previously implemented in the NewAge-JGrass system: i. the Horton Machine for geomorphological model setup, which computes input maps such as slope and total contributing area and displays the model's results, and ii. the particle swarm for automatic calibration.
Section 2.1 presents the landslide susceptibility model and Sect.2.2 presents the model selection procedure (3SVP).

Landslide susceptibility models
The landslide susceptibility models implemented in NewAge-JGrass and presented in a preliminary application in  2006) model (M3).The three models are derived from simplifications of the infinite slope equation (Grahm, 1984;Rosso et al., 2006;Formetta et al., 2014) for the factor of safety: where FS (-) is the factor of safety, C = C +C root is the sum of C root , the root strength (kN m −2 ), and C the effective soil cohesion (kN m −2 ), ϕ (-) is the internal soil friction angle, H is the soil depth (m), α (-) is the slope angle, γ w (kN m −3 ) is the specific weight of water, and w = h/H (-) where h (m) is the water table height above the failure surface (m), G s (-) is the specific gravity of soil, e (-) is the average void ratio, and S r (-) is the average degree of saturation.
The model M1 assumes a hydrological steady state, with flow occurring in the direction parallel to the slope, and neglects cohesion, degree of soil saturation, and void ratio.It computes w as where T (L 2 T −1 ) is the soil transmissivity defined as the product of the soil depth and the saturated hydraulic conductivity and b (L) is the length of the contour line.Substituting Eq. ( 2) in Eq. ( 1), the model is solved for Q / T assuming FS = 1 and stable and unstable sites are defined using threshold values on log(Q / T ) (Montgomery and Dietrich, 1994).
Unlike M1, the model M2 considers (i) the effect of the degree of soil saturation (S r (-)) and void ratio (e (-)) above the groundwater table and (ii) the stabilizing contribution of the soil cohesion.The model output is a map of safety factors (FS) for each pixel of the analyzed area.
The component M3 considers both the effects of rainfall intensity and duration on the landslide triggering process.The term w depends on rainfall duration and is obtained by coupling the conservation of mass of soil water with Darcy's law (Rosso et al., 2006), providing (3) These models are suitable for shallow translational landslides controlled by groundwater flow convergence.Shallow landslides usually have a very low ratio between the maximum depth (D) and the length (L) of scar (D/L < 0.1; Casadei et al., 2003), involve a small volume of the colluvial soil mantle, and present a generally translational failure mechanism (Milledge et al., 2014).
Each component has a user interface which specifies the input and output.Model inputs are computed in the GIS uDig integrated in the NewAge-JGrass system by using the Horton Machine package for terrain analysis (Abera et al., 2014).Model output maps are directly imported in the GIS and are available for the user's visualization.
The models that we implemented present an increasing degree of complexity in terms of the theoretical assumptions for modeling landslide susceptibility.Moving from M1 to M2, the soil cohesion and soil properties were considered, and moving from M2 to M3, rainfall of finite duration was used.

Automatic calibration and model verification procedure
In order to assess the models' performance, we developed a model that computes the most common indices for assessing the quality of a landslide susceptibility map.These indices are based on a pixel-by-pixel comparison between the observed landslide (OL) and predicted landslide (PL) maps.They are binary maps with positive pixels corresponding to "unstable" ones, and negative pixels that correspond to "stable" ones.Therefore, four types of outcomes are possible for each cell.A pixel is a true positive (tp) if it is mapped as "unstable" both in OLs and in PLs, which is a correct alarm with well-predicted landslides.A pixel is a true negative (tn) if it is mapped as "stable" both in OLs and in PLs, which corresponds to a well-predicted, stable area.A pixel is a false positive (fp) if it is mapped as "unstable" in PLs, but is "stable" in OLs; that is a false alarm.A pixel is a false negative (fn) if it is mapped as "stable" in PLs, but is "unstable" in OLs; that is a missed alarm.The concept of the receiver operator characteristic (ROC; Goodenough et Hydrol.Earth Syst. Sci., 20, 4585-4603, 2016 www.hydrol-earth-syst-sci.net/20/4585/2016/  , 1974) graph is based on the values assumed by tp, fp, and tn.ROCs are used to assess the performance of models which provides results assigned to one of two classes.
The ROC graph is widely used in many scientific fields, such as medicine (Goodenough et al., 1974), biometrics (Pepe, 2003), and machine learning (Provost and Fawcett, 2001).
The ROC graph is a Cartesian plane with the FPR on the x axis and TPR on the y axis.FPR is the ratio between false positives and the sum of false positives and true negatives, and TPR is the ratio between true positives and the sum of true positives and false negatives.They are defined in Table 1 and commented on in Appendix A. The performance of a perfect model corresponds to the point P(0,1) on the ROC plane.Points that fall on the bisector (black solid line on the plots) are associated with models that are considered random: they predict stable or unstable cells with the same rate.
Eight GOF indices for the quantification of model performance were implemented in the system.Table 1 shows their definition, range, and optimal values.A more comprehensive description of the indices is provided in Appendix A.
Automatic calibration algorithms implemented in NewAge-JGrass as OMS components can be used in order to tune the model parameters to reproduce the actual landslides.This is possible because each model is an OMS component and can be linked to the calibration algorithms as it is, without rewriting or modifying its code.Three calibration algorithms are embedded in the system core: Luca (Hay et al., 2006), a step-wise algorithm based on shuffled complex evolution (Duan et al., 1992); particle swarm optimization (PSO), a genetic model presented in Kennedy and Eberhart (1995); and DREAM (Vrugt et al., 2008), an acronym for differential evolution adaptive metropolis.In the actual configuration, we used a PSO algorithm to estimate optimal values of the model parameters.
During the calibration procedure, the selected algorithm compares the model output in terms of a binary map (stable or unstable pixel) with the actual landslide, thus optimizing a selected objective function (OF).The model parameter set for which the OF assumes its best value is the optimization procedure output.The eight GOF indices presented in Table 1 were used in turn as OFs and, consequently, eight optimal parameters sets were provided as the calibration output (one for each optimized OF).This means that a GOF index selected in Table 1 becomes an OF when it is used as an objective function of the automatic calibration algorithm.
In order to quantitatively analyze the model performance, we implemented a three-step verification procedure (3SVP).Firstly, we evaluated the performance of each OF index for each model.We presented the results in the ROC plane in order to assess what the OF index(es) was (were), and whose optimization provided the best model performance.Secondly, we verified whether each OF metric had its own information content or whether it provided information analogous to other metrics (and thus not essential).
Lastly, for each model, the sensitivity of each optimal parameter set was tested by perturbing optimal parameters and by evaluating their effects on the GOF.

Site description
The test site was located in Calabria, Italy, along the Salerno-Reggio Calabria highway between Cosenza and Altilia municipalities, in the southern part of the Crati basin (Fig. 2).The mean annual precipitation is about 1200 mm, distributed over approximately 100 rainy days, with a mean annual temperature of 16 • C. Rainfall peaks occur from October to March, when mass wasting and severe water erosion processes are triggered (Capparelli et al., 2012;Conforti et al., 2011;Iovine et al., 2010).
In the study area, the topographic elevation has an average value of around 450 m a.s.l., with a maximum value of 730 m a.s.l.Slopes, computed from the 10 m resolution digital elevation model, range from 0 to 55 • , while the average is about 26 • .
The Crati Basin is a Pleistocene-Holocene extensional basin filled with clastic marine and fluvial deposits (Vezzani, 1968;Colella et al., 1987;Fabbricatore et al., 2014).The stratigraphic succession of the Crati Basin can be simply di-vided into two sedimentary units as suggested by Lanzafame and Tortorici (1984).The first unit is a Lower Pliocene succession of conglomerates and sandstones passing upward into a silty clay (Lanzafame and Tortorici, 1986) second unit.This is a series of clayey deposits grading upward into sandstones and conglomerates which refer to Emilian and Sicilian, respectively (Lanzafame and Tortorici, 1986), as also suggested by data provided by Young and Colella (1988).
In the study area, the second unit outcrops.A topsoil of about 1.5-2.0m lies on sandy-gravelly and sandy deposits, which are generally well-stratified.Soils range from Alfisols (i.e., highly mature soils) to Inceptisols and Entisols (i.e., poorly developed soils).Due to the combination of such climatic, geostructural, and geomorphological features, the test site is one of the most landslide-prone areas in Calabria (Conforti et al., 2014;Carrara and Merenda, 1976;Iovine et al., 2006).
Mass movements were analyzed from 2006 to 2013 by integrating aerial photography interpretation acquired in 2006, 1 : 5000 scale topographic map analysis, and an extensive field survey.
All the data were digitized and stored in a GIS database (Conforti et al., 2014) and the result was the map of occurred landslides, presented in Fig. 2d.Digital elevation model, slope, and total contributing area (TCA) maps are presented in Fig. 2a, b, and c, respectively.In order to perform model calibration and verification, the data set of occurred landslides was divided in two parts one used for calibration (located at bottom of Fig. 2d) and one for validation (located in the upper part of Fig. 2d).The landslide inventory map refers only to the initiation area of the landslides.This leads to a fair comparison with the landslide models that provide only the triggering point and does not include a run-out model for landslide propagation.

Results and discussion
The LSA presented in the paper was applied to the Salerno-Reggio Calabria highway, between Cosenza and Altilia (southern Italy).Section 3.1 describes the model parameter calibration and the model verification procedure, Sect.3.2 presents the model performance correlation assessment, Sect.3.3 presents the robustness analysis of the GOF indices used, and lastly, Sect.3.4 presents the computation of the susceptibility map.
The component PSO provides eight best parameter sets, one for each optimized GOF indices.Values for each model (M1, M2, and M3) are presented in Table 3. Optimal parameter sets differ slightly among the models and among the optimized GOF indices for a given model.In addition, a compensation effect between the parameter values is evident.High values of friction angle are related to low cohesion values; high values of critical rainfall are related to high values of soil resistance parameters.For the model M1, the transmissivity value (74 m 2 d −1 ) optimizing ACC is much lower than the transmissivity values obtained by optimizing the other indices (around 140 m 2 d −1 ).Similar behavior was observed for the optimal rainfall value of 148 mm d −1 optimizing ACC, and around 70 mm d −1 optimizing the other indices.For the model M2, the optimal transmissivity and rainfall values optimizing CSI (10 m 2 d 1 and 95 mm d −1 ) are much lower than the values obtained by optimizing the other indices (around 50 and 250 mm d −1 on average).For the model M3, on the other hand, optimal parameters present the same order of magnitude for all the optimized indices.This suggests that the variability of the optimal parameter values for models M1 and M2 could be due to the compensation of the effects of important physical processes neglected by those models.
Executing the models using the set of eight optimal parameters, true positive rates and false positive rates are computed by comparing the model output and actual landslides for both the calibration and verification data sets.The results are presented in Table 4 for all three models (M1, M2, and M3).These points were reported in the ROC plane to visualize the effects of the optimized objective function on model performance in a unique graph.This procedure was repeated for the three models.ROC planes, considering all the GOF indices and all three models, are included in Appendix B both for the calibration and verification period.For models M2 and M3, it is clear that ACC, HSS, and CSI performed the worst.This is also true for model M1, although, unlike M2 and M3, there is no clear separation between the performance provided by ACC, HSS, and CSI and the remaining indices.
Among the results provided in Table 4, we focused on the GOF indices whose optimization satisfies the following condition: FPR < 0.4 and TPR > 0.7.This choice was made in order to focus comments on the results exclusively for the GOF indices which provide acceptable model results and in order to heighten the readability of graphs.
Figure 3 presents three ROC planes, one for each model, with the optimized GOF indices that provide FPR < 0.4 and TPR > 0.7.The results presented in Fig. 3 and Table 4 show that i. the optimization of AI, D2PC, SI, and TSS achieves the best model performance in the ROC plane, which is verified for all three models; ii. performance increase as model complexity increases: moving from M1 to M3, points in the ROC plane approach the perfect point (TPR = 1, FPR = 0); iii. by increasing the model complexity, good model results are achieved, not only in the calibration but also in the validation data set.In fact, moving from M1 to M2 soil cohesion and soil properties were considered, and moving from M2 to M3 rainfall of a finite duration was used.
The first step of the 3SVP procedure highlights that the optimization of AI, D2PC, SI, and TSS provides the best performance irrespectively of the model used.
Finally, it is important to consider the limitations of the models used for the current applications.Models M1 and M2 are not able to mimic the transient nature of precipitation and infiltration processes, and only M3 is able to account for the combined effect of storm duration and intensity in the triggering mechanism.In addition, in this study we neglected effects such as spatial rainfall variability, roads, and other engineering works.

Correlation assessment of the model performance
The second step in the procedure is to verify the information content of each optimized OF, checking whether it is the same as other metrics or it is a particular feature of the optimized OF.
Executing a model using one of the eight parameter sets (assuming, for example, the one obtained by optimizing CSI) enables all the remaining GOF indices to be computed, which we indicate as CSI CSI , ACC CSI , HSS CSI , TSS CSI , AI CSI , SI CSI , D2PC CSI , and ESI CSI for both the calibration and the verification data sets.Let us denote this vector with the name MP CSI : the model performance (MP) vector computed using the parameter set that optimizes CSI.MP CSI has 16 elements: 8 for the calibration and 8 for the validation data set.Repeating the same procedure for all eight GOF indices gives MP ACC , MP ESI , MP SI , MP D2PC , MP TSS , MP AI , and MP HS .Figure 4 presents the correlation plots (Murdoch and Chow, 1996) between all MP vectors, for each model M1, M2 or M3.The matrix is symmetric with an ellipse at the intersection of row i and column j .The color is the absolute value of the correlation coefficient between the MP i and MP j vectors.The eccentricity of the ellipse is scaled according to the correlation value: the more prominent it is, the less correlated are the vectors.If the ellipse leans towards the right, the correlation is positive; if it leans to the left, it is negative.
All indices present a positive correlation with each other, irrespectively of the model used.In addition, strong correlations between the MP vectors of AI, D2PC, SI, and TSS are evident in Fig. 4.This confirms that an optimization of AI, D2PC, SI, and TSS provides similar model performance, irrespectively of the model used.On the other hand, the remaining GOF indices give quite different information from the previous four indices; however, their performance was worse in the first step of the analysis.Thus, in the case study, using one of the four best GOFs is sufficient for the parameter estimation.

Model sensitivity assessment
In this step, we focused on models M2 and M3 and performed a parameter sensitivity analysis.Let us consider model M2 and the optimal parameter set computed by optimizing the critical success index (CSI).Also, considering the cohesion model parameter, the procedure evolves according to the following steps: -The starting parameter values are the optimal values derived from the optimization of the CSI index.
-All the parameters except the analyzed parameter (cohesion) were kept constant and equal to the optimal parameter set.
-A total of 1000 random values of the analyzed parameter (cohesion) were selected from a uniform distribution with the lower and upper bound defined in Table 1.With this procedure, 1000 model parameter sets were defined and used to execute the model.
-A total of 1000 values of the selected GOF index (CSI), computed by comparing model outputs with the measured data, were used to compute a boxplot of the parameter C and optimized index CSI.
The procedure was repeated for each parameter and for each optimized index.Results are presented in Figs. 5 and 6 for models M2 and M3, respectively.Each column in the figures represents one optimized index and has a number of boxplots equal to the number of model parameters (five for M2 and six for M3).Each boxplot represents the range of variation of the optimized index Hydrol.Earth Syst. Sci., 20, 4585-4603, 2016 www.hydrol-earth-syst-sci.net/20/4585/2016/The rows for which the condition FPR < 0.4 and TPR > 0.7 is verified are shown in bold.
due to a particular change in the model parameters.The narrower the boxplot for a given optimized index, the less sensitive the model is to that parameter.For both M2 and M3, the parameter set obtained by optimizing AI and SI shows the least sensitive behavior for almost all the parameters.In this case, a model parameter perturbation has little impact on the model's performance.However, the models with parameters obtained by optimizing ACC, TSS, and D2PC are the most sensitive to the parameter variations and this is reflected in much more evident changes in model performance.
Finally, it is important to consider that the methodology used for evaluating the parameter sensitivity is based on changing the parameters one at a time.Although this procedure facilitates an intercomparison of the results (because the parameter sensitivity is computed with reference to the optimal parameter set), it is does not take into account simultaneous variations or interactions between parameters.

Model selections and susceptibility maps
The selection of the most appropriate model for computing landslide susceptibility maps is based on what we learn from the previous steps.In the first step, we learn that (i) the optimization of AI, D2PC, SI, and TSS outperforms the remaining indices and (ii) models M2 and M3 provide more accu-  rate results than M1.The second step suggests that overall the model results obtained by optimizing AI, D2PC, SI, and TSS are similar each other.Lastly, the third step shows that the model performance derived from the optimization of AI and SI is less sensitive to input variations than D2PC and TSS.This could be due to the formulation of AI and SI which gives much more weight to the true negative compared to D2PC and TSS.
For our application, the model M3 with parameters obtained by optimizing D2PC was the most sensitive to the parameter variation, avoiding an "insensitive" or flat response by changing the parameter values.A more sensitive coupled model optimal parameter set will in fact accommodate any parameters, input data, or measured data variations responding to these changes with a variation in model performance.

Conclusions
We have presented a procedure to quantitatively calibrate, evaluate, and compare the performance of environmental models.The procedure was applied for the analysis of three landslide susceptibility models.It is made up of three steps: i. model parameter calibration, optimizing different GOF indices, and model evaluation in the ROC plane; ii. computation of the degree of similarities between different model performance obtained by optimizing all the considered GOF indices; iii. evaluation of model sensitivity to parameter variations.
The first step identifies the more appropriate OFs for the model parameter optimization.The second step verifies the information content of each optimized OF, checking whether it is analogous to other metrics or peculiar to the optimized OF.Finally, the last step quantifies the relative influence of each model parameter on the model performance.
The procedure was conceived as a model configuration of the hydrological system NewAge-JGrass.It integrates i. three simplified, physically based landslide susceptibility models; ii. a package for model evaluations based on pixel-by-pixel comparison of modeled and actual landslide maps; iii. model parameter calibration algorithms, and iv. the integration with the uDig open-source geographic information system for model input-output map management.
The system is open-source and available at (https://github.com/formeppe).It is integrated according to the OMS standards, which enables the user to easily integrate a generic landslide susceptibility model and use the complete framework presented in the paper, thus avoiding having to rewrite programming code.
The procedure was applied in a test case on the Salerno-Reggio Calabria highway and led to the following conclusions: 1. the OFs AI, D2PC, SI, and TSS coupled with the models M2 and M3 provided the best performance among the eights metrics used in the calibration; 2. the four selected OFs provided quite similar model performance in terms of MP vectors, i.e., one of them would be sufficient for the model application; 3. M3 showed the best performance by optimizing the D2PC index.In fact, M3 responded to parameter variations with changes in model performance.
In our application, effective precipitation was calibrated because we were performing a landslide susceptibility analysis and it was useful for demonstrating the method.However, we are aware that for operational landslide early warning systems, rainfall constitutes a fundamental input of the predictive process.In addition, the analysis would profit from data on the rainfall that triggered the landslides; however, such data are currently not available for the study area.
We believe that our system would be useful for decision makers who deal with risk management assessments.It could be improved by adding new landslide susceptibility models or different types of model selection procedures.

Data availability
Data for this paper are available from the corresponding author upon request. is large, the false alarm value is relatively overwhelmed.If tn is large, as in landslide maps, FPR tends to zero and TSS tends to TPR.A problem of TSS is that it treats the hit rate and the false alarm rate equally, irrespective of their likely differing consequences.TSS is similar to Heidke, except the constraint on the reference forecasts is that they are constrained to be unbiased.

Figure 1 .
Figure 1.Integration of the landslide susceptibility analysis system in NewAge-JGrass hydrological model.

Figure 2 .
Figure 2. Test site.(a) Digital elevation model (DEM) (m), (b) slope (-) expressed as tangent of the angle, (c) total contributing area (TCA) expressed as number of draining cells, and (d) map of actual landslides.

Figure 3 .
Figure 3. Models' performance results in the ROC plane for M1, M2, and M3.Only GOF indices whose optimization provides FPR < 0.4 and TPR > 0.7 were reported.

Figure 4 .
Figure 4. Correlation plot between models' performance (MP) vectors computed by optimizing all GOF indices in turn.Results are reported for each model: M1, M2, and M3.

Figure 7 .
Figure 7. Landslide susceptibility maps using model M3 and the parameter set obtained by optimizing D2PC.

Table 1 .
Indices of goodness of fit for comparison between actual and predicted landslide.

Table 3 .
Optimal parameter sets output of the optimization procedure of each GOF index in turn.Results are presented for each model (M1, M2, and M3).