Heterogeneity measures in hydrological studies : review and new developments

Regional frequency analysis is needed to estimate hydrological quantiles at ungauged sites or to improve estimates at sites with short record lengths, by transferring information from gauged sites. Some regional procedures, such as the index-flood method, require the delineation of homogeneous regions as a basic step for their application. The 10 homogeneity of these delineated regions is usually tested providing a yes/no decision. However, complementary measures that are able to quantify the degree of heterogeneity of a region are needed to compare regions, evaluate the impact of particular sites and rank the performance of different delineating methods. Well-known existing heterogeneity measures are not well-defined for ranking regions, as they entail drawbacks such as assuming a given probability distribution, providing negative values and being affected by the region size. Therefore, a framework for defining and assessing desirable properties 15 of a heterogeneity measure in the regional hydrological context is needed. In the present study, such a framework is proposed through a four-step procedure based on Monte Carlo simulations. Several heterogeneity measures, some of which commonly known, others derived from recent approaches or adapted from other fields are presented and developed to be assessed. The assumption-free Gini Index applied on the at-site L-variation coefficient (L-CV) over a region led to the best results. The measure of the percentage of sites for which the regional L-CV is outside the confidence interval of the at-site L20 CV is also found to be relevant, as it leads to more stable results regardless of the regional L-CV value. Thus, the application of both measures is recommended in practice.


Introduction
Regional hydrological frequency analysis (RHFA) is needed to estimate extreme hydrological events when no hydrological data are available at a target site or to improve at-site estimates especially for short data records (e.g.Burn and Goel, 2000;Requena et al., 2016).Delineation of homogeneous regions is a basic step for the application of a number of regional procedures such as the well-known index-flood method (Dalrymple, 1960;Chebana and Ouarda, 2009).Such a method employs information from sites within a given homogeneous region to estimate required quantiles at a given target site.The heterogeneity concept has been considered in a number of fields, including ecology, geology and information sciences (e.g. Hydrol.Earth Syst.Sci. Discuss., doi:10.5194/hess-2016-136, 2016 Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.Li and Reynolds, 1995;Mays et al., 2002;Wu et al., 2008).However, the present paper focuses on the heterogeneity concept in hydrology derived from 'regional homogeneity'.In this regard, regional homogeneity is often defined as the condition that floods at all the sites in a region have the same probability distribution except for a scale factor (e.g.Cunnane, 1988).
In order to delineate homogeneous regions, numerous studies have proposed and compared similarity measures entailing climatic (e.g.mean annual rainfall), hydrologic (e.g.mean daily flow), physiographic (e.g.drainage area) and combined descriptors (see Ali et al., 2012) to be used as input to statistical tools for grouping sites.The selection of these descriptors is carried out by stepwise regression, principal components or canonical correlation, among others (e.g.Brath et al., 2001;Ouarda et al., 2001;Ilorme and Griffis, 2013).Known statistical tools, such as cluster analysis, or new approaches, such as the affinity propagation algorithm, are considered to form homogeneous regions based on the previously identified similarity measures (e.g.Burn, 1989;Ouarda and Shu, 2009;Ali et al., 2012;Wazneh et al., 2015).Moreover, many tests have been introduced and compared throughout the literature to decide whether a given delineated region can be considered as homogenous (e.g.Dalrymple, 1960;Wiltshire, 1986;Scholz and Stephens, 1987;Chowdhury et al., 1991;Fill and Stedinger, 1995;Viglione et al., 2007).The homogeneity test proposed by Hosking and Wallis (1993) is usually utilised.In this test the statistic H is related to the variability of the at-site L-variation coefficient (L-CV) over a region (e.g.Alila, 1999;Burn and Goel, 2000;Castellarin et al., 2001;Shu and Burn, 2004;Smith et al., 2015).
In practice, apart from determining if a region can be considered as homogeneous by making a yes/no binary decision (e.g.Warner, 2008) generally based on a significance test, the quantification of the degree of heterogeneity is also necessary.
Heterogeneity measures are required for such a task.Two approaches can be considered in this regard: (i) the use of heterogeneity measures for determining the effect of the departure from the homogeneous region assumption on the quantile estimate; and (ii) the use of heterogeneity measures for ranking regions according to their degree of heterogeneity.Regarding the former, quantifying the degree of heterogeneity provides a notion of the inaccuracy incurred through the estimation of quantiles by a regional method, for which homogeneous regions are assumed but a 'non-perfect' homogeneous region is used.This approach has already been studied, being closely related to the homogeneity test notion (e.g.Hosking and Wallis, 1997;Wright et al., 2014), which is further explained below.
The second approach corresponds to the focus of the present paper.Through this second approach, different regional delineation methods can be properly compared to identify the best one.This will be the method delineating the 'most homogeneous region'.Also, heterogeneity measures can be helpful in ranking potential homogeneous regions formed by removing discordant sites.By analogy with distribution selection (e.g.Laio et al., 2009), the concept of heterogeneity measure considered here plays the role of a 'model selection criterion', such as the Akaike Information Criterion (Akaike, 1973); whereas the homogeneity test plays the role of a 'goodness-of-fit test'.The former ranks delineated regions by providing unambiguous results to identify the best one in terms of heterogeneity; whereas the latter indicates if the given region can be considered as homogeneous or not.
In relation to the use of heterogeneity measures as a proxy for quantile error (approach (i)), the test statistic H is indeed considered by Hosking and Wallis (1993) as a heterogeneity measure for which given thresholds are established.These Hydrol. Earth Syst. Sci. Discuss., doi:10.5194/hess-2016-136, 2016 Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.thresholds are obtained as a trade-off between quantile error due to regional heterogeneity and gain obtained by considering the whole regional information instead of that of a sub-region or at-site data.Therefore, instead of providing a binary decision based on a given significance level (α), e.g.reject the region as homogenous when H > 1.64 for α = 5%; as a more general guideline the region is considered as 'acceptably homogeneous' if Wright et al. (2015) compared the performance of five statistics in this regard: the three L-moment-based statistics of Hosking and Wallis (1993) and two non-parametric statistics, the Anderson-Darling and the Durbin-Knott test statistic.
A number of studies have proposed and compared methods in which different combinations of similarity measures and/or statistical tools are considered for delineating regions (references below).As a consequence of the non-availability of a welljustified heterogeneity measure for comparison purposes (approach (ii)), studies usually consider measures based either on H or on errors from the quantile estimate step.Shu and Burn (2004) utilised the percentage of (initially) homogeneous regions and the mean of H over regions obtained by each considered method for distinguishing the best one.Farsadnia et al. (2014) identified the best grouping method among those analysed as that leading to the lowest number of 'possibly homogeneous' and 'heterogeneous' regions according to H. Ilorme and Griffis (2013) used an H weighted average regarding the data length of each region to compare regions obtained by removing discordant sites based on different criteria.However, H is not well-defined for ranking regions according to their heterogeneity degree, as it possesses several drawbacks.First, it is originally built as a significance test.Thus, its value depends on specific assumptions that may not be fulfilled in practice, such as assuming a regional kappa distribution that even though flexible may not characterise the data.
Second, it may entail negative values for particular situations, which may distort results making difficult the suitable ranking of regions.Third, it is affected by the number of sites in the region, tending to obtain small heterogeneity values for small regions even if they are not homogeneous (Hosking and Wallis, 1997, page 66-67).This tends to complicate comparison among regions with different sizes.
Instead of using measures based on H, other studies quantified the performance of different delineating methods by comparing quantile errors (e.g.Castellarin et al., 2001;Ouali et al., 2015).However, the latter approach implies performing the last step of a regional analysis (i.e.quantile estimation) when dealing with an initial step (i.e.region delineation); which involves additional calculations, uncertainty due to the assumption of a given parent distribution for the data and a non direct assessment of the delineation method.A different approach was recently proposed by Viglione (2010) and Das and Cunnane (2011) regarding the use of the confidence intervals for L-CV to assess heterogeneity, for which details are given in Sect.3. Therefore, a general framework is needed to allow defining and assessing desirable properties of a heterogeneity measure in the regional hydrological context in order to properly identify a suitable measure.Such a measure should overcome the aforementioned drawbacks: it should be free of assumptions, positive, not affected by region size and focused on the delineation step.Also, it should allow ranking the heterogeneity degree of several regions to identify 'the most homogeneous region' or to assess the effect of some sites on the 'heterogeneity degree' of the region.In the present paper, such a framework is proposed under an evaluation of the heterogeneity measures based on Monte Carlo simulations.Several Hydrol.Earth Syst.Sci. Discuss., doi:10.5194/hess-2016-136, 2016 Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.measures extracted from literature in hydrology and other fields are presented and/or adapted to be assessed as well-justified heterogeneity measures.The present paper is organised as follows.The procedure for the assessment of a heterogeneity measure is presented in Sect. 2. The heterogeneity measures considered to be checked by the proposed procedure are introduced in Sect.3. Results of the assessment are illustrated in Sect. 4. Discussion of results is presented in Sect. 5 and conclusions are summarised in Sect.6.

Assessment of a heterogeneity measure
A simulation-based procedure consisting of four steps is proposed to study the behaviour of a given heterogeneity measure (generically denoted Z) regarding its desirable properties in the regional hydrological context.The steps of the procedure are: (i) sensitivity analyses of varying factors involved in the definition of a region; (ii) success rate in identifying the most heterogeneous region; (iii) evolution of the variability for the Z average with respect to the degree of regional heterogeneity; and (iv) effect of discordant sites.The first step is applied to all the considered heterogeneity measures (presented in Sect.3), while the remaining steps are applied to those not entailing unacceptable results from the first step.Some elements of the procedure are inspired and adapted from studies where different aims were sought (e.g.Hosking and Wallis, 1997;Viglione et al., 2007;Chebana and Ouarda, 2007;Castellarin et al., 2008;Wright et al., 2015).
Before further describing the aforementioned steps and desirable properties, elements of the framework needed for performing the assessment procedure are presented.The procedure is based on synthetic regions generated through Monte Carlo simulations from a representative parent distribution commonly used in frequency analysis, the Generalised Extreme Value (GEV) distribution.A region is defined by its number of sites (N), at-site data length (n), regional average L-CV (  ), regional average L-skewness coefficient ( 3  ) and a unit regional sample mean.The heterogeneity of a given region may be due to differences in any feature of the at-site frequency distribution among sites.In particular the L-CV, which is a dimensionless measure of the dispersion of the distribution that is also related to the slope of the associated flood frequency curve, has been considered as representative of such differences (e.g.Stedinger and Lu, 1995;Viglione, 2010).In the present study, heterogeneous regions are simulated using the heterogeneity rate , defined as  = (max  (  ) − min  (  ))/  (e.g.Hosking and Wallis, 1997;Das and Cunnane, 2012) ., where   is the L-CV at site i with i = 1, …, N. Since in practice large values of the L-skewness coefficient ( 3 ) are related to large values of the L-CV  (Hosking and Wallis, 1997, page 68), the same heterogeneity rate of  is considered for  3 .A region is defined as homogeneous for  = 0%, implying that   and  3  are the same for all the sites in the region (i.e.  =   and  3  =  3  ).The heterogeneity of a given region increases as  increases from 0% to 100%.This implies that  i and  3  vary linearly.We then have for the first site  1 =   −    2 ⁄ and for the last site   =   +    2 ⁄ .The same can be written for  3  .
Finally, a given region consists of at-site data generated from a GEV distribution with parameters obtained through at-site Lmoments.At-site data are standardised by their sample mean to frame them in the regional context (e.g.Bocchiola et al., 2003;Requena et al., 2016).Note that heterogeneity measures directly based on L-moments lead to the same results for standardised or non-standardised data.A region with N = 15, n = 30,   = 0.2 and  3  = 0.2 is considered as a reference for the simulation study.Hereafter, the value of  3 is (usually) omitted, as  3 is considered to have the same value as .The number of simulations N S of a given region is taken to be equal to 500, which is considered large enough to obtain robust results.
These fixed values of the factors, as well as their varying values used below, are selected according to the literature and with the aim of providing a general view of the behaviour of the measures without excessively complicating the simulation study.

Sensitivity analyses
The first step of the assessment of a heterogeneity measure Z is the analysis of the effect of varying factors involved in the definition of a region.This step is performed through sensitivity analysis to identify if the behaviour of Z is acceptable in relation to what is ideally expected from a heterogeneity measure.

Effect of the heterogeneity rate:
The degree of heterogeneity of a region is the aimed value to be quantified by Z.A surrogate of such a degree of regional heterogeneity is the heterogeneity rate , which is used to initially define the heterogeneity of the simulated region to be evaluated by Z. Hence, Z should increase with .This analysis is performed by obtaining Z for  = 0%, 10% ,…, 90%, 100% over N S = 500, keeping the remaining values of the reference region (i.e.N = 15; n = 30;   = 0.2).

Effect of the number of sites:
The size of a region, represented by the number of sites N, is a relevant factor to the degree of its heterogeneity.A large N is required to properly estimate quantiles associated with high return periods, as more data are available; yet homogeneous regions are more difficult to obtain for large N due to more potential dissimilarities between sites (Ouarda et al., 2001;Chebana and Ouarda, 2007).Nevertheless, by definition Z should not be affected by N, as it should provide the same results for regions with a different size but the same degree of heterogeneity.Therefore, the smaller is the influence of N on Z the better Z is.This analysis is performed by obtaining Z for N = 5, 10, 15, 20, 25, 30, 40, 50, 60, 70 over N S = 500, keeping the remaining values of the reference region (i.e.n = 30;   = 0.2).Two different values of the heterogeneity rate ( = 0% and 50%) are also considered to identify if the behaviour of Z changes depending on the degree of heterogeneity.
Effect of the regional average L-moment ratios: Z should ideally provide similar results for regions entailing the same degree of heterogeneity, regardless of the values of   and  3  , in order to provide an appropriate comparison and ranking of the regions.For instance, two regions with sites generated from a different   value but considering the same value  = 0% should entail similar Z values, as both are 'perfectly' homogeneous.However, such an output may not be easy to obtain due to the fact that   is associated with a measure of dispersion.Thus, the smaller the influence of   and  3  on Z the better Z will be.This analysis is performed by comparing the results of   = 0.2, which is related to the reference region, with those obtained by   = 0.4.It is done by varying the heterogeneity rate  and by varying the number of sites N. Recall that  3  is considered to have the same value as   .

Effect of the record length:
The amount of available at-site information, represented by the data length n, is associated with the accuracy of the value of Z.The longer n is the better will be the information provided by each site to determine the regional degree of heterogeneity.Therefore, the analysis of the effect of n should be focused on identifying the minimum n required to obtain reliable values of Z.This analysis is performed by obtaining Z for n = 10, 20,…, 90, 100 over N S = 500, keeping the remaining values of the reference region (i.e.N = 15;   = 0.2).Two different values of the heterogeneity rate ( = 0% and 50%) are also considered to identify if the behaviour of Z changes depending on the degree of heterogeneity.

Success rate
The second step in the assessment of Z is the evaluation of its success rate (SR) for identifying the most heterogeneous region.Note that the SR notion is commonly used in a number of fields such as biology (e.g.Canaves et al., 2004).Without loss of generality, such an evaluation is performed on two regions A and B. For  A <  B , SR is defined as the ratio of the number of samples simulated from a given region A and a given region B, for which Z correctly identifies  B as the most heterogeneous region, to the total number of simulated samples.Thus, the larger SR is the better Z will be.The aim is to verify the ability of Z to compare regions with different degrees of heterogeneity, when entailing or not different characteristics (i.e.,  A ≠  B or  A =  B , and A large set of 48 cases is considered to obtain a wide view of the behaviour of Z, as combination of the following factor values:  A = 0%, 30%, 50%, 70% with  B =  A +10%, the remaining values of the reference region (i.e.n = 30).

Evolution of the variability for the Z average with respect to the degree of regional heterogeneity
The third step of the assessment of Z is the analysis of the evolution of the variability of the average value of Z as a function of the degree of regional heterogeneity.The aim is to determine the capability of Z to accurately rank regions according to their degree of heterogeneity when it is summarised as an average of the Z values obtained for several (sub)regions that are obtained by a given delineation method.This provides an assessment of its capability to compare results from several delineation methods.This is a twofold analysis.Firstly, a monotonic relation should exist between the average Z and the degree of heterogeneity, as explained in Sect.2.1.Secondly, the variability of the average Z along such a monotonic relation should be small enough to not affect a proper ranking of the regions.
We consider two regions A and B, without loss of generality.The idea is that (sub)regions delineated by a given method

Effect of discordant sites
The fourth step of the assessment of Z is the analysis of the effect of discordant sites in a region.The aim is to check the capability of Z to show a progressive variation of its value as a consequence of a progressive change in the degree of regional heterogeneity, induced here by replacing given 'homogeneous' sites by given 'discordant' sites in a region.Both the effect of the 'nature' of the discordant sites, characterised by the L-CV   and L-skewness coefficient  3  of their parent distribution, and the effect of the number of such discordant sites (k) are considered.
The procedure is described below.Note that the values of the factors used in this section are selected to facilitate the graphical representation.Thus, a homogeneous region (i.e. = 0%) with N = 20,   = 0.25 and n = 30 is considered as the initial region.Then, k of its sites (with k = 1,…, N/2) are replaced by k discordant sites belonging to a parent distribution characterised by   , with  d = 0% within the group of discordant sites.The analysis is performed for   = 0.1, 0.2, 0.25, 0.3, 0.4.Remark that   = 0.25 is considered for the homogeneous region so that the discordant sites are not 'discordant' at the midpoint of the range used for   (i.e. at   =   = 0.25).The procedure is repeated for N S = 500 simulations of the initial homogeneous region, estimating a mean value of Z over Ns for each (  , k) pair.For the region formed by 'homogenous' and 'discordant' sites, named as mixed region, Z is expected to be larger for larger k values.Indeed, a larger number of discordant sites in the region should increase the degree of regional heterogeneity.Also, Z is expected to be larger as the difference between   and   gets larger, since the addition of sites with a 'larger discordance' should increase the degree of regional heterogeneity.On the other hand, for the sub-region formed by the sites belonging to the initial homogeneous region, Z is expected to keep the same values regardless of the value of k, which in this case is the number of initial sites removed.The degree of regional heterogeneity should be relatively constant in this case, since all the sites belong to the same initial homogeneous region.Note that a mixed region can be seen as a sort of bimodal region used in other studies (e.g.Chebana and Ouarda, 2007).

Heterogeneity measures
The aim of this section is to present and develop heterogeneity measures based on different approaches to be assessed by the procedure proposed in Sect. 2. Heterogeneity measures are selected as a result of a general and comprehensive literature review in a number of fields including hydrology.We can distinguish three types of measures: (a) known in RHFA; (b) derived from recent approaches in RHFA; and (c) used in other fields and adapted here to the regional hydrological context.
Therefore, a total of eight measures are considered.

Measures known in RHFA
The first group consists of the well-known statistics H, V, H 2 and V 2 (Hosking, 2015), as well as the k-sample Anderson-Darling (AD) statistic (Scholz and Stephens, 1987;Scholz and Zhu, 2015).
Even though H is not properly defined as a heterogeneity measure for ranking the degree of heterogeneity of several regions (see Sect. 1), it is considered in this study because it is commonly adopted in regional analysis.As the aim of this study is to provide a general heterogeneity measure, its associated distribution-free statistic V is also considered.Specifically, V is a statistic of the dispersion of the sample L-CV t in a region, expressed as: with where   is the sample L-CV at site i and   is its associated regional average.H is a measure of the variability of t in the region compared with that expected for simulated homogeneous regions.It is built by normalising V by its mean   and standard deviation   : where   and   are obtained from N H = 500 simulated homogeneous regions with the same n and N as the given region, generated from a kappa distribution fitted to the regional average L-moment ratios.
The extensions of V and H by considering not only t but also the sample L-skewness coefficient  3 , traditionally known as  2 and  2 , are also included in this study.Their inclusion is motivated by recent results regarding the usefulness of H 2 for testing homogeneity when considering different thresholds from those of H (Wright et al., 2014): where  3  is the sample L-skewness coefficient at site i and  3  is its associated regional average. 3  is defined analogous to   in Eq. ( 2).In order to avoid results conditioned on the given value of   and  3  ,  and  2 are standardised here by their regional values, defining  ′ and  2 ′ respectively as: The AD statistic, which is a rank-based statistic based on comparing the at-site empirical distributions with the pooled empirical distribution of the data, is also included in this first group: where  = ∑    =1 and   is the number of observations in the i th sample not greater than   , where  1 < ⋯ <   is the pooled ordered sample of the data, which in the regional context entails considering the data of each site first divided by its corresponding mean and then ordered.The AD statistic has already been considered in several studies.Viglione et al. (2007) assessed its behaviour as a homogeneity test statistic, recommending its use when  3  > 0.23.Wright et al. (2015) evaluated its performance as a heterogeneity measure regarding its ability to be a surrogate of the quantile error, yet obtaining a weak performance partially attributed to a possible influence of the procedure used for estimating errors.

Measures derived from recent approaches in RHFA
The second group is represented by a measure derived from a relatively novel approach in which the confidence interval for the at-site L-CV   (with i: 1,…N) is estimated and compared with   .The focus is to evaluate how often the latter is included in such confidence intervals in order to assess if differences between   and   can be attributed to sample variability or to regional heterogeneity.
Viglione (2010) proposed a procedure for obtaining the confidence interval for L-CV without considering a given parent distribution of the data, applying it to a didactic illustration for comparing several regional approaches.The procedure is summarised below: the variance of the sample L-CV t, var(t), is estimated according to Elamir and Seheult (2004) which is implemented in Viglione (2014); simple empirical corrections are applied on t and var(t) based on the values of  3 and n; and the confidence interval for t is then obtained from a log-Student's distribution considering corrected values of t and var(t).For instance, for a 90% confidence interval, a region is considered as heterogeneous if 100 -(P 05 + P 95 ) ≪ 90%, where P 05 (P 95 ) is the percentage of sites for which   is below (above) the confidence interval for   .The larger (P 05 + P 95 ) is, the larger the regional heterogeneity will be.Das and Cunnane (2011) obtained such a confidence interval based on simulations from a GEV distribution, with the aim of evaluating if a usual method to select catchment descriptors for delineating regions in Ireland provided homogeneous regions.The number of sites for which   is outside the   confidence intervals is considered as a measure of heterogeneity, also expressed as a percentage of sites.
In the present study, the heterogeneity measure considered regarding this approach is named as P CI and defined as the total percentage of sites in the region for which   is outside the 90% confidence interval for   .As the parent distribution of the data is unknown in practice, such a confidence interval is estimated following the aforementioned distribution-free approach.

Measures used in other fields and adapted here to the regional hydrological context
The last group consists of the Gini index (GI) (Gini, 1912;Ceriani and Verme, 2012), which is a measure of inequality of incomes in a population commonly used in economics; and of a measure based on the entropy-based Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), which estimates the distance between two probability distributions and is used for different purposes in a number of fields including hydrology (e.g.Weijs et al., 2010).
The definition of the GI is usually given according to the Lorenz curve (Gastwirth, 1972), but it can be expressed in other ways.Specifically, the sample GI: corrected for short sample sizes can be defined as (Glasser, 1962;Zeileis, 2014): where  : are the sample order statistics and  is their mean.Theoretically, GI ranges from zero to one.The former is obtained when all the   values are equal, and the latter is given when all but one value equals zero (in an infinite population).Note that GI is connected with the L-moments as both are based on sample order statistics.Indeed,  = /2 (for  > 0), where GMD is the Gini's mean difference statistic (Yitzhaki and Schechtman, 2012); and  = 2 2 , where l 2 is the second sample L-moment (Hosking and Wallis, 1997).Hence, GI corresponds to the sample L-CV t (Hosking, 1990), which implies that if GI is applied on the observations at site i, the result is   .In order to adapt GI to the regional hydrological context, in this study GI is applied on   over sites, providing a measure of the inequality, or variability, of   in the region.Therefore, the measure considered in this study is (  ,  = 1, … , ): where  : are the sample order statistics,  ̅ is their mean, and the number of sites N corresponds to the data length of .Note that (  ,  = 1, … , ) is equivalent to (  ,  = 1, … , ).Also, note that this is somehow analogous to the approach based on moments used in early studies (e.g.Stedinger and Lu, 1995), where the coefficient of variation ( =   ⁄ ) of the coefficient of variation of the data (i.e.(  ,  = 1, … , )) was used for building simulated regions; defining homogeneous regions for (  ) = 0 and extremely heterogeneous regions for (  ) ≥ 0.4.
The KL divergence (so-called relative entropy) of the probability distribution P with respect to Q is defined as: where p and q are the density functions.The expression related to the discrete case is the following (e.g.Hausser and Strimmer, 2009) for which nonparametric versions of the probabilities P and Q may be considered, such as a kernel density function, in order to avoid subjectivity in selecting a given parametric probability distribution.  can then be defined as the KL divergence of the probability distribution at site i with respect to the probability distribution at site j, where   ≠   .The dissimilarity matrix of the region is obtained by computing the KL divergence between sites as: The degree of regional heterogeneity is then evaluated by ‖  ‖, which in this study is considered as the absolute column sum normalized norm:

Results
Simulation results obtained by the application of the proposed assessment procedure (Sect.2) to the considered heterogeneity measures (Sect.3) are presented in this section.Note that a summary of the results obtained from each step is presented in Table 1.

Sensitivity analyses
Results of the effect of varying factors defining a region (Sect.2.1) are presented through boxplots and mean values of the heterogeneity measure over Ns = 500 simulations of the corresponding region, in order to show complete information.Results for  R = 0.2 refer to those related to the reference region.Figure 1 shows that all the considered measures show a good behaviour regarding the heterogeneity rate , as their values increase with .This dependence is less pronounced for  2 and  2 ′ , which are the measures that depend on both t and  3 ; and for AD and ‖  ‖, which are based on the whole information.
The effect of N on the considered measures is shown for  R = 0.2 when  = 0% (i.e.'perfect' homogeneous regions) and  = 50% in Figs. 2 and 3, respectively.In both cases, it is found that  ′ ,  2 ′ , P CI and GI are not affected by N, although they show some departure from their constant Z mean value and a larger variability (i.e.larger box) when N ≤ 10.In this regard, Das and Cunnane (2012) also found an effect for N < 10 on quantile error measures (considering n = 35).In general this effect is less marked for GI when  = 0% (Fig. 2c,d) and for  ′ and  2 ′ when  = 50% (Fig. 3a,b).
It is also found that results for H, and to a lower degree for  2 , change depending on the value of .Both measures behave correctly for  = 0% (Fig. 2a,b); yet they depend on N for  = 50% (Fig. 3a,b).This is likely due to the nature of H and  2 as homogeneity test statistics.Note that this undesirable effect increases as  increases (e.g.Fig. 4).‖  ‖ does not behave correctly neither for  = 0% nor  = 50%, as it depends on N. The same holds for AD, for which such dependence is higher.
The influence of varying regional average L-moments is shown by comparing the Z mean values for  R = 0.4 with those previously obtained for  R = 0.2.Z mean values varying  are displayed in Fig. 1b,d.In this regard,  2 ′ and AD are shown not to be suitable, as results for  R = 0.2 and  R = 0.4 varying  are far from each other.H and  2 work worse for higher degrees of regional heterogeneity than for smaller ones; whereas  ′ , GI and ‖  ‖ show the opposite behaviour with an overall better performance of  ′ and GI.P CI presents a favorable similar behaviour for both small and high .Results for Z mean values varying N are displayed for  = 0% in Fig. 2b,d; and for  = 50% in Fig. 3b,d.In both cases  2 ′ and AD present a similar bad behaviour to the one shown in Fig. 1b,d.A suitable behaviour is found for  ′ , P CI and GI for  = 50%, while a worse behaviour is found for H,  2 and ‖  ‖ (Fig. 3b,d).Such a behaviour of H,  2 and ‖  ‖ is also shown for  = 0% (Fig. 2b,d), for which the remaining measures also present similar bad results.In this regard, it is important to remark that no 'perfect' homogeneous regions exist in reality (Stedinger and Lu, 1995).And that according to the practical threshold H < 2, commonly used for considering a region as homogeneous enough to perform a regional analysis, even regions with  = 50% may be taken as homogenous in practice (see values of H for  in Fig. 1a).Hence, for the purpose of the assessment of the regional heterogeneity degree, the behaviour of the measures for  = 50% is more relevant than for  = 0%.
Finally, the effect of varying the record length n for  = 0% and  = 50% is shown in Fig. 5. Recall that it is expected that increasing n affects Z, as more information of the at-site distributions is available in such a case.In this regard, it is found that the measures H,  2 , AD and P CI are not (or slightly) affected by n when  = 0%, but they highly increase their values as n increases when  = 50%.Whereas  ′ ,  2 ′ , GI and ‖  ‖ are affected by n when  = 0%; becoming less affected when  = 50%, by decreasing less their values as n increases.As a result,  ′ and GI are the only measures that become relatively stable for a given data length.Such a data length is around n = 30, which is a value usually considered in practice (e.g.Hosking  and Wallis, 1997, page 134;Chebana and Ouarda, 2009).It can be mentioned that for a very small data length (n = 10), the approximation used in P CI for estimating var(t) was not reliable.Nevertheless, this issue is not relevant since such a data length is too short to be considered in practice, and such values do not affect the overall interpretation of the results.
As a result of the aforementioned sensitivity analyses (see Table 1 for a summary),  ′ , P CI and GI are considered as potentially suitable heterogeneity measures.Thus, the following steps of the assessment procedure are only applied to these measures.Results of H are also included for comparison purposes.

Success rate
The ability of the measures to identify the most heterogeneous region between two regions A and B is shown via the success rate SR (Sect.2.2).A summary of the results obtained for  A =  B and  A ≠  B (with  A <  B ), when considering several values of N and  for each region is displayed in Table 2 to facilitate their interpretation.Note that each combination  A vs.
 B corresponds to a total of 48 cases obtained by varying N and .Results for a small difference between   values, characterised by  A = 0.2 ≠  B = 0.3 and vice versa; and for a large difference, characterised by  A = 0.1 ≠  B = 0.4 and vice versa, are displayed as representative of the behaviour of the measures.Note that the summarised information reflects the main conclusions extracted from the partial results.
The SR average is shown as a notion of the overall behaviour of the measures.Recall that the larger SR is, the better Z will be.When  A =  B the SR average of H,  ′ and GI are comparable, with  ′ and GI leading to the largest values; while P CI leads to the lowest ones.When  A <  B the largest SR average is obtained for  ′ and is very closely followed by GI.Yet, in this case H presents a worse behaviour, which is similar to that of P CI .When  A >  B the situation changes, with H leading to the largest values.Yet, the difference between the values obtained by  ′ (or GI) and H is less marked than when Note that the larger the difference between  A and  B is, the larger the difference between the SR average of H and  ′ (or GI) is; whereas the value of P CI remains almost constant.Therefore, although P CI does not obtain the greatest values in any case, it outperforms H or GI (and  ′ ) when  A ≪  B or when  A ≫  B , respectively, i.e. for high differences between  A and  B .
The best results for the total SR average are obtained by GI, followed by  ′ .
The SR minimum and SR maximum are displayed as a notion of the variability of the SR results (Table 2).Results related to the SR minimum are analogous to those obtained by the SR average; giving H an overall worse behaviour.This highlights the low ability of H to identify the most heterogeneous region in certain circumstances.Note that the overall behaviour of H regarding SR is partially due to existing trends regarding N and   .H obtains larger heterogeneity values as N increases and as   decreases (as shown in Fig. 3b), entailing an 'unfounded' better behaviour when  A >  B and N A < N B , and vice versa.
Also note that all measures have difficulties obtaining a large SR minimum when  A >  B .This includes H also, even though it obtained a good SR average in such a situation.This arises from the fact that, in such a case, the region with the lowest degree of heterogeneity (region A) is associated with a larger   entailing a larger sample variability, and complicating its identification as the less heterogeneous region.SR maximum values show that even though the maximum difference between  A and  B considered in the analysis is 30%, all measures obtain (in certain circumstances) a SR equal or close to 100%.In summary, GI obtains the best results for the SR analysis followed by  ′ .

Evolution of the variability of the Z average with respect to the degree of regional heterogeneity
The variability of the heterogeneity measures as a function of the degree of regional heterogeneity, represented by , is shown in Fig. 6.The boxplot of the 22 representative (mean over Ns = 500) values of Z obtained from cases in which a given region A and a given region B with the same  but different characteristics are considered is shown for varying values of  in the x-axis (see Sect. 2.3).As expected from the results of Fig. 1, heterogeneity measures in Fig. 6 increase with , showing a monotonic positive dependence.Regarding their variability along such a monotonic relation, H presents a different behaviour from the rest of the measures.It shows a strong increasing variability as  increases.Then, in this case, H overlaps its interquantile ranges from  = 70% to 100%.This behaviour may imply an unappropriated ranking of the regions with these high values of the heterogeneity rate .Indeed, overlapped values cannot be considered significantly different, whereas they correspond to two different  values.Such behaviour is not seen for the other considered measures.In this regard, an overall favorable larger distance between interquantile ranges is found for  ′ , followed by P CI and then GI.However, the four considered measures present an overlapping for  = 0% and 10%.This may imply an unappropriated ranking of the regions related to these very small values of , yet those regions are less common in practice.In summary,  ′ obtains the best performance for the variability evolution analysis.It presents a small variability for a given  value; and it almost presents no overlapping between interquantile ranges for varying .

Effect of discordant sites
The effect of discordant sites (Sect.2.4) is shown in Fig. 7.The mean values of the heterogeneity measures over Ns = 500 are obtained when replacing k sites (with k = 1,…, 10) in an initially homogeneous region (with N = 20) by k discordant sites belonging to a given parent distribution defined by   .For the mixed region formed by sites from both   and   , the overall results confirm that the considered measures involve larger values of Z for larger k values, as a result of replacing a larger number of discordant sites in the region; and larger values of Z as the difference between   and   increases, as a result of replacing sites with a larger discordance (Fig. 7a).
However, when   >   (Fig. 7b) the measures face some difficulties in ranking the degree of heterogeneity for high values of k.This is due to the larger sample variability entailed by the discordant sites in such a case, which makes the whole mixed region seem less heterogeneous.Note that this is also the reason of the lack of asymmetry of the results regarding the vertical line at the midpoint of the x-axis (i.e.  = 0.25 =   ).Nevertheless, not all measures are equally affected by this issue.GI obtains the best results, as for instance it is able to differentiate the degree of heterogeneity for k ≤ 8 when   = 0.35 and 0.4.
It is followed by P CI , which behaves properly for k ≤ 8 when   = 0.35 and for k ≤ 7 when   = 0.4; and by  ′ , which obtains of the initial homogeneous region (Fig. 7a) support the results in Fig. 2, as H and GI are practically not affected by the number of sites of the homogeneous region, while  ′ and P CI present a slight decrease in their heterogeneity values as the number of sites (N -k) decreases.In summary, GI presents the best results for the analysis of discordant sites.

Discussion
Overall, GI can be considered as the best heterogeneity measure among all the evaluated measures, closely followed by  ′ (see a summary in Table 1).However, as expected, none of the measures are perfect, due to their inability to perfectly fulfill all the desirable properties in practice.GI presents the advantage of being computed as a measure of the standardised mean distance between pairs of   values.Hence, it does not depend on any assumptions concerning parameters or parent distributions. ′ is similar but it specifically depends on the estimate of the regional average   , as it compares it to each   value.Thus, due to the similar but slightly better results obtained by GI and its widely accepted use in other fields, the use of GI would be preferable in practice.
H is by nature the statistic of a homogeneity test.Hence, it is defined to identify whether a given region can be considered as homogeneous or not, not to compare the heterogeneity degree of several regions.Note that this is also valid for other test statistics (e.g.AD).As a consequence of the intrinsic disadvantages of H (see Sect. 1) and the obtained results, the use of H as a heterogeneity measure for ranking regions is not recommended.The unsatisfactory results obtained for  2 ′ and  2 could be related to the way in which t and  3 are combined (see Sect. 3), which may not be appropriate for assessing the degree of regional heterogeneity.The unsuitable results associated with ‖  ‖ could be related to considering the whole information of the data, which may mask the effect of factors favouring heterogeneity.It should be noted that other norms aside from the one in Eq. ( 15) were considered, but they did not lead to better performances.Further research should focus on the development of a better adaptation of the entropy-based measures to estimate the degree of regional heterogeneity.
The P CI measure is obtained without assuming a given parent distribution of the data; although it considers a log-Student distribution for estimating the L-CV confidence interval.Also, even though it depends partially on the selected confidence level, mean P CI values over Ns = 500 for different confidence levels (90% and 95%) were found to be highly correlated (not shown).This fact removes subjectivity from the use of P CI as a heterogeneity measure, as for such a purpose only the ranking of values is needed.It is also important to highlight the stable performance of P CI regardless of the value of   .This makes P CI outperform GI and  ′ for identifying the most heterogeneous region when such a region has a much lower   than others to be compared with (see Table 2).As a consequence, P CI and GI could be used together in practice as two different and complementary criteria.This is common in other applications; for instance several criteria are commonly applied when ranking candidate distributions (e.g. the Akaike information criterion and the Bayesian information criterion).It is important to mention that the use of P CI as a homogeneity test in practice may lead to the false rejection of homogeneous regions.Indeed, even when a region is 'perfectly' homogeneous ( = 0%) the mean value of P CI may indicate slight heterogeneity (e.g. it is slightly larger than 10% in Fig. 1).
As indicated in Sect. 1, the heterogeneity measures selected in this study may be used for the assessment of the degree of heterogeneity of regions obtained through the use of different delineation methods.When a region is divided into several sub-regions by a given delineation method, the GI (or P CI ) value can be evaluated at each sub-region.Then, the average value can be used to compare several delineation methods applied on the given region.The best delineation method will be the one with the lowest GI (or P CI ) value for the region of study.It is important to note that a heterogeneity measure should not be used as a decision variable for the delineation of regions, as it would imply using redundant information at different steps of the regional analysis.The heterogeneity measure can also be used for evaluating the heterogeneity of a given region when particular sites are removed, with the aim of helping in the identification of homogeneous regions.For instance, if a region is found as heterogeneous by using a given test and by entailing a number of discordant sites, the heterogeneity measure can help in the identification of the 'most homogeneous region' as a result of removing different combinations of sites.However, it is important to highlight that physical reasoning has to be provided for removing a given (discordant) site.
Thus the heterogeneity measure serves only as a facilitator for the identification of the site(s) to be further analysed (e.g.Viglione, 2010;Ilorme and Griffis, 2013).

Conclusions
Delineation of homogeneous regions is required for the application of regional frequency analysis methods such as the index flood procedure.The availability of an estimate of the degree of heterogeneity of these delineated regions is necessary in order to compare the performances of different delineation methods or to evaluate the impact of including particular sites.
Due to the unavailability of a well-justified and generally recognised measure for performing such comparisons, a number of studies have relied on measures that are not well-defined or approaches that involve additional steps during the delineation stage of regional frequency analysis.
In the present paper, a general framework is presented for assessing the performance of potential heterogeneity measures in the field of regional hydrological frequency analysis (RHFA), according to a number of desirable properties.The proposed four-step assessment procedure consists of: sensitivity analysis by varying the factors of a region; evaluation of the success rate for identification of the most heterogeneous region; estimation of the evolution of the variability for the heterogeneity measure average with respect to the degree of regional heterogeneity; and study of the effect of discordant sites.The procedure is applied on a set of measures including commonly used ones, measures that are derived from recent approaches, and measures that are adapted from other fields to the regional hydrological context.The assumption-free Gini Index (GI) frequently considered in economics and applied here on the L-variation coefficient (L-CV) of the regional sites obtained the best results.A lower performance was obtained for the measure of the percentage of sites (P CI ) for which the regional L-CV is outside the confidence interval for the at-site L-CV.However, this measure was considered relevant because of its stable Hydrol.EarthSyst.Sci.Discuss., doi:10.5194/hess-2016-136,2016   Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.
should theoretically entail different   values ( A ≠  B ), having similar or different values of other characteristics (i.e.N A ≠ N B or N A = N B ).In order to be able to evaluate the behaviour of the Z average, the same degree of heterogeneity is considered for both regions ( A =  B = ), as under this assumption Z values should be similar.The procedure is the following: N S = 500 simulated regions A and B with  A =  B =  and given values N A ,  A and N B ,  B are generated, obtaining for each simulation the average of Z over the two regions.These averages are aggregated into their mean value over Ns as representative value.Hydrol.Earth Syst.Sci.Discuss., doi:10.5194/hess-2016-136,2016 Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.The representative value is obtained for 22 cases as a result of combining: N A = 10, 25; N B = 10, 25; and  R = 0.1, 0.2, 0.3, 0.4 with  A ≠  B , keeping the remaining values of the reference region (i.e.n = 30).Then, the variability of the set of representative values of the Z average is analysed through a boxplot for the given .The aforementioned procedure is performed for each  = 0%, 10%,…, 90%, 100%, obtaining a boxplot for each  value.For a given , Z is better as the variability of the corresponding set of representative values is smaller, since similar values of Z should be expected due to  A =  B .Then, Z is better as the interquantile range is shorter, where the interquantile range is the box of the boxplot.For varying , Z is better as it does not imply overlapping of the interquantile ranges for different  values, which leads to a more appropriate ranking of the regions.
Hydrol.Earth Syst.Sci.Discuss., doi:10.5194/hess-2016-136,2016   Manuscript under review for journal Hydrol.Earth Syst.Sci.Published: 5 April 2016 c Author(s) 2016.CC-BY 3.0 License.behaviour regardless of the regional value of L-CV.The application of both measures is recommended in practice.The use of different criteria to determine the degree of regional heterogeneity can help adequately identify the sites to be further analysed for obtaining homogeneous regions.Further research efforts are necessary to develop robust and general heterogeneity measures in the field of RHFA.Table 2. Summary of the success rate (SR) minimum, average and maximum of the considered measures (H,  ′ , P CI and GI), expressed in percentage, when comparing the heterogeneity of two regions A and B. For a given   and   , such values are computed as the minimum, average and maximum of SR over 48 cases, respectively.For each case, SR is obtained as the mean over Ns = 500 simulations of two regions with n = 30 and given N A, N B,  A and  B .Values in bold indicate the measure obtaining the largest SR minimum, SR average and SR maximum, respectively.