In this study, we propose a data-driven approach for automatically identifying rainfall-runoff events in discharge time series. The core of the concept is to construct and apply discrete multivariate probability distributions to obtain probabilistic predictions of each time step that is part of an event. The approach permits any data to serve as predictors, and it is non-parametric in the sense that it can handle any kind of relation between the predictor(s) and the target. Each choice of a particular predictor data set is equivalent to formulating a model hypothesis. Among competing models, the best is found by comparing their predictive power in a training data set with user-classified events. For evaluation, we use measures from information theory such as Shannon entropy and conditional entropy to select the best predictors and models and, additionally, measure the risk of overfitting via cross entropy and Kullback–Leibler divergence. As all these measures are expressed in “bit”, we can combine them to identify models with the best tradeoff between predictive power and robustness given the available data.

We applied the method to data from the Dornbirner Ach catchment in Austria, distinguishing three different model types: models relying on discharge data, models using both discharge and precipitation data, and recursive models, i.e., models using their own predictions of a previous time step as an additional predictor. In the case study, the additional use of precipitation reduced predictive uncertainty only by a small amount, likely because the information provided by precipitation is already contained in the discharge data. More generally, we found that the robustness of a model quickly dropped with the increase in the number of predictors used (an effect well known as the curse of dimensionality) such that, in the end, the best model was a recursive one applying four predictors (three standard and one recursive): discharge from two distinct time steps, the relative magnitude of discharge compared with all discharge values in a surrounding 65 h time window and event predictions from the previous time step. Applying the model reduced the uncertainty in event classification by 77.8 %, decreasing conditional entropy from 0.516 to 0.114 bits. To assess the quality of the proposed method, its results were binarized and validated through a holdout method and then compared to a physically based approach. The comparison showed similar behavior of both models (both with accuracy near 90 %), and the cross-validation reinforced the quality of the proposed model.

Given enough data to build data-driven models, their potential lies in the way they learn and exploit relations between data unconstrained by functional or parametric assumptions and choices. And, beyond that, the use of these models to reproduce a hydrologist's way of identifying rainfall-runoff events is just one of many potential applications.

Discharge time series are essential for various activities in hydrology and water resources management. In the words of Chow et al. (1988), “[…] the hydrograph is an integral expression of the physiographic and climatic characteristics that govern the relations between rainfall and runoff of a particular drainage basin.” Discharge time series are a fundamental component of hydrological learning and prediction, since they (i) are relatively easy to obtain, being available in high quality and from widespread and long-existing observation networks; (ii) carry robust and integral information about the catchment state; and (iii) are an important target quantity for hydrological prediction and decision-making.

Beyond their value in providing long-term averages aiding water balance considerations, the information they contain about limited periods of elevated discharge can be exploited for baseflow separation; water power planning; sizing of reservoirs and retention ponds; design of hydraulic structures such as bridges, dams or urban storm drainage systems; risk assessment of floods; and soil erosion. These periods, essentially characterized by rising (start), peak and recession (ending) points (Mei and Anagnostou, 2015), will hereafter simply be referred to as “events”. They can have many causes (rainfall, snowmelt, upstream reservoir operation, etc.) and equally as many characteristic durations, magnitudes and shapes. Interestingly, while for a trained hydrologist with a particular purpose in mind, it is usually straightforward to identify such events in a time series, it is hard to identify them automatically based on a set of rigid criteria. One reason for this is that the set of criteria for discerning events from non-events typically comprises both global and local aspects, i.e., some aspects relate to properties of the entire time series and some to properties in time windows. And to make things worse, the relative importance of these criteria can vary over time, and they strongly depend on user requirements, hydroclimate and catchment properties.

So why not stick to manual event detection? Its obvious drawbacks are that it is cumbersome, subject to handling errors and hard to reproduce, especially when working with long-term data. As a consequence, many methods for objective and automatized event detection have been suggested. The baseflow separation, and consequently the event identification (since the separation allows the identification of the start and end time of the events), has a long history of development. Theoretical and empirical methods for determining baseflow are discussed since 1893, as presented in Hoyt et al. (1936). One of the oldest techniques according to Chow et al. (1988) dates back to the early 1930s, with the normal depletion curve from Horton (1933). As stated by Hall (1968), fairly complete discussions of baseflow equations, mathematical derivations and applications were already present in the 1960s. In the last 2 decades, more recent techniques embracing a multitude of approaches (graphical-, theoretical-, mathematical-, empirical-, physical- and data-based) aim to automate the separation.

Ehret and Zehe (2011) and Seibert et al. (2016) applied a simple discharge
threshold approach with partly unsatisfactory results; Merz et al. (2006)
introduced an iterative approach for event identification based on the
comparison of direct runoff and a threshold. Merz and Blöschl (2009)
expanded the concept to analyze runoff coefficients and applied it to a large
set of catchments. Blume et al. (2007) developed the “constant

While all of these methods have the advantage of being objective and automatable, they suffer from limited generality. The reason is that each of them contains some kind of conceptualized, fixed relation between input and output. Even though this relation can be customized to a particular application by adapting parameters, it remains to a certain degree invariant. In particular, each method requires an invariant set of input data, and sometimes it is constrained to a specific scale, which limits its application to specific cases and to where these data are available.

With the rapidly increasing availability of observation data, computer storage and processing power, data-based models have become increasingly popular as an addition or alternative to established modeling approaches in hydrology and hydraulics (Solomatine and Ostfeld, 2008). According to Solomatine and Ostfeld (2008) and Solomatine et al. (2009), they have the advantage of not requiring detailed consideration of physical processes (or any kind of a priori known relation between model input and output); instead, they infer these relations from data, which however requires that there are enough data to learn from. Of course, including a priori known relations among data into models is an advantage as long as we can assure that they really apply. However, when facing undetermined problems, i.e., for cases where system configuration, initial and boundary conditions are not well known, applying these relations may be over-constraining, which may lead to biased and/or overconfident predictions. Predictions based on probabilistic models that learn relations among data directly from the data, with few or no prior assumptions about the nature of these relations, are less bias-prone (because there are no prior assumptions potentially obstructing convergence towards observed mean behavior) and are less likely to be overconfident compared to established models (because applying deterministic models is still standard hydrological practice, and they are overconfident in all but the very few cases of perfect models). This applies if there are at least sufficient data to learn from, appropriate binning choices are made (see the related discussion in Sect. 2.2) and the application remains within the domain of the data that was used for learning.

In the context of data-based modeling in hydrology, concepts and measures from information theory are becoming increasingly popular for describing and inferring relations among data (Liu et al., 2016), quantifying uncertainty and evaluating model performance (Chapman, 1986; Liu et al., 2016), estimating information flows (Weijs, 2011; Darscheid, 2017), analyzing spatio-temporal variability in precipitation data (Mishra et al., 2009; Brunsell, 2010), describing catchment flow (Pechlivanidis et al., 2016), and measuring the quantity and quality of information in hydrological models (Nearing and Gupta, 2017).

In this study, we describe and test a data-driven approach for event detection formulated in terms of information theory, showing that its potential goes beyond event classification, since it enables the identification of the drivers of the classification, the choice of the most suitable model for an available data set, the quantification of minimal data requirements, the automatic reproduction classifications for database generation and the handling of any kind of relation between the data. The method is presented in Sect. 2. In Sect. 3, we describe two test applications with data from the Dornbirner Ach catchment in Austria. We present the results in Sect. 4 and draw conclusions in Sect. 5.

The core of the information theory method (ITM) is straightforward and generally applicable; its main steps are shown in Fig. 1 and will be explained in the following.

Main steps of the ITM.

The process starts by selecting the target (what we want to predict) and the
predictor data (that potentially contain information about the target).
Choosing the predictors constitutes the first and most important model
hypothesis, and there are almost no restrictions to this choice. They can be
any kind of observational or other data, transformed by the user or not;
they can be part of the target data set themselves, e.g., time lagged or
space shifted; and they can even be the output of another model. The second
choice and model hypothesis is the mapping between items in the target and
the predictor data set, i.e., the relation hypothesis. It is important for
the later construction of conditional histograms that a 1 : 1 mapping exists
between target and predictor data, i.e., one particular value of the target is
related to one particular value of predictor (in contrast to 1 :

The next step is the first part of model building. It consists of choosing the value range and binning strategy for target and predictor data. These choices are important, as they will frame the estimated multivariate probability mass functions (PMFs) constituting the model and directly influence the statistics we compute from them for evaluation. Generally, these choices are subjective and reflect user-specific requirements and should be made while taking into consideration data precision and distribution, the size of the available data sets, and required resolution of the output. According to Gong et al. (2014), when constructing probability density functions (PDFs) from data via the simple bin-counting method, “[...] too small a bin width may lead to a histogram that is too rough an approximation of the underlying distribution, while an overly large bin width may result in a histogram that is overly smooth compared to the true PDF.” Gong et al. (2014) also discussed the selection of an optimal bin width by balancing bias and variance of the estimated PDF. Pechlivanidis et al. (2016) investigated the effect of bin resolution on the calculation of Shannon entropy and recommended that bin width should not be less than the precision of the data. Also, while equidistant bins have the advantage of being simple and computationally efficient (Ruddell and Kumar, 2009), hybrid alternatives can overcome weaknesses of conventional binning methods to achieve a better representation of the full range of data (Pechlivanidis et al., 2016).

With the binning strategy fixed, the last part of the model building is to construct a multivariate PMF from all predictors and related target data. The PMF dimension equals the number of predictors plus one (the target), and the way probability mass is distributed within it is a direct representation of the nature and strength of the relationship between the predictors and the target as contained in the data. Application of this kind of model for a given set of predictor values is straightforward; we simply extract the related conditional PMF (or PDF) of the target, which, under the assumption of system stationarity, is a probabilistic prediction of the target value.

If the system is non-stationary, e.g., when system properties change with time, the inconsistency between the learning and the prediction situation will result in additional predictive uncertainty. The problems associated with predictions of non-stationary systems apply to all modeling approaches. If a stable trend can be identified, a possible countermeasure is to learn and predict detrended data and then reimpose the trend in a post-processing step.

In order to evaluate the usefulness of a model, we apply concepts from information theory to select the best predictors (the drivers of the classification) and validate the model. With this in mind, this section provides a brief description of the information theory concepts and measures applied in this study. The section is based on Cover and Thomas (2006), which we recommend for a more detailed introduction to the concepts of information theory. Complementarily, for specific applications to investigate hydrological data series, we refer the reader to Darscheid (2017).

Entropy can be seen as a measure of the uncertainty of a random variable; it
is a measure of the amount of information required on average to describe a
random variable (Cover and Thomas, 2006). Let

We can describe the conditional entropy as the Shannon entropy of a random
variable conditional on the (prior) knowledge of another random variable.
The conditional entropy

It is also possible to compare two probability distributions

Note that the uncertainty measured by Eqs. (1) to (3) depends only on event probabilities, not on their values. This is convenient, as it allows joint treatment of many different sources and types of data in a single framework.

As a benchmark, we can start with the case where no predictor is available,
but only the unconditional probability distribution of the target is known.
As seen in Eq. (1), the associated
predictive uncertainty can be measured by the Shannon entropy

Obviously, advantages of setting up data-driven models in the described way are that it involves very few assumptions and that it is straightforward when formulating a large number of alternative model hypotheses. However, there is an important aspect we need to consider: from the information inequality, we know that conditional entropy is always less than or equal to the Shannon entropy of the target (Cover and Thomas, 2006). In other words, information never hurts, and consequently adding more predictors will always either improve or at the least not worsen results. In the extreme, given enough predictors and applying a very refined binning scheme, a model can potentially yield perfect predictions if applied to the learning data set. However, besides the higher computational effort, in this situation, the curse of dimensionality (Bellman, 1957) occurs, which “covers various effects and difficulties arising from the increasing number of dimensions in a mathematical space for which only a limited number of data points are available” (Darscheid, 2017). This means that with each predictor added to the model, the dimension of the conditional target–predictor PMF will increase by 1, but its volume will increase exponentially. For example, if the target PMF is covered by two bins and each predictor by 100, then a single, double and triple predictor model will consist of 200, 20 000 and 2 000 000 bins, respectively. Clearly, we will need a much larger data set to populate the PMF mentioned last than the first. This also means that increasing the number of predictors for a fixed number of available data increases the risk of creating an overfitted or non-robust model in the sense that it will become more and more sensitive to the absence or presence of each particular data point. Models overfitted to a particular data set are less likely to produce good results when applied to other data sets than robust models, which capture the essentials of the data relation without getting lost in detail.

Investigating the effect of sample size through cross entropy and Kullback–Leibler divergence.

We consider this effect with a resampling approach: from the available data
set, we take samples of various sizes and construct the model from each sample
(see repetition statement regarding

The curve represents the mean of several repetitions, which were randomly
taken with replacement among these repetitions. Note that, comparable to the
Monte Carlo cross-validation, the analysis presented in Fig. 2 summarizes a
large number of training and testing splits performed repeatedly, and, in
addition, were also performed in different split proportions (subsets of
various sizes). The difference here is that, in contrast to a standard split
where data sets for training and testing are mutually exclusive, we build the
model in the training set and apply it in the full data set, where one part
of the data has not been seen yet and another part has. In other words, we
use the training subsets for building the model (a supervised learning
approach), and the resulting model is then applied to and evaluated on the
full data set. If, on the one hand, the use of the full data set for the
application includes data of the training set, on the other hand, the
procedure favors the comparison of the results always with the same model.
Thus, the stated procedure allows a robust and holistic analysis, in the
sense that it works with the mean of

Particularly, Fig. 2 shows that for small sample sizes,

Another application for Fig. 2 is to use these kinds of plots to select the best among competing models with different numbers of predictors. Typically, for small sample sizes, simple models will outperform multi-predictor models, as the latter will be hit harder by the curse of dimensionality; but with increasing data availability, this effect will vanish, and models incorporating more sources of information will be rewarded.

In order to reduce the effect of chance when taking random samples, we
repeat the described resampling and evaluation procedure many times for each
sample size (see repetition statement

The proposed cross entropy curve contains a joint visualization of model analysis and model evaluation and, at the same time, provides the opportunity to compare models with different numbers of predictors, being a support tool to decide, for a given amount of data, which number of predictors is optimal in the sense of avoiding both ignoring the available information (by choosing too few predictors) and overfitting (by choosing too many predictors). And since it incorporates a sort of cross-validation in its construction, one of the advantages of this approach is that it avoids splitting the available data into a training and a testing set. Instead, it makes use of all available data for learning and provides measures of model performance across a range of sample sizes.

Once a model has been selected, the ITM application is straightforward; from the multivariate PMF that represents the model, we simply extract the conditional PMF of the target for a given set of predictor values. The model returns a probabilistic representation of the target value. If the model was trained on all available data, and is applied within the domain of these data, the predictions will be unbiased and will be neither overconfident nor underconfident. If instead a model using deterministic functions is trained and applied in the same manner, the resulting single-value predictions may also be unbiased, but due to their single-value nature they will surely be overconfident.

For application in a new time series, if its conditions are outside of the range of the empirical PMF or if they are within the range but have never been observed in the training data set, the predictive distribution of the target (event yes or no) will be empty and the model will not provide a prediction. Several methods exist to guarantee a model answer, however they come with the cost of reduced precision. The solutions range from (i) coarse graining, where the PMF can be rebuilt with fewer, wider bins and an extension of the range until the model provides an answer to the predictive setting, as have been proposed by Darbellay and Vajda (1999), Knuth (2013) and Pechlivanidis et al. (2016), to (ii) gap filling, where the binning is maintained and the empty bins are filled with non-zero values based on a reasonable assumption. Gap-filling approaches comprise adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample PDF, smoothing methods such as kernel density smoothing (Blower and Kelsall, 2002; Simonoff, 1996) or Bayesian approaches based on the Dirichlet and multinomial distribution or a maximum-entropy method recently suggested by Darscheid et al. (2018), the latter being applied in the present study.

In this section, we describe the hydroclimatic properties of the data and the two performed applications. For demonstration purposes, the first test application was developed according to the Sect. 2 in order to explain which additional predictors we derived from the raw data and their related binning and to present our strategy for the model setup, classification and evaluation. For benchmarking purposes, the second application compares the proposed data-driven approach (ITM) with the physically based approach proposed by Mei and Anagnostou (2015), the characteristic point method (CPM), and applies the holdout method (splitting the data set into training and testing set) for the cross-validation analysis.

We used quality-controlled hourly discharge and precipitation observations
from a 9-year period (31 October 1996–1 November 2005, 78 912 time steps).
Discharge data are from the gauge Hoher Steg, which is located at the
outlet of the 113 km

Input data of discharge, precipitation and user-based event
classification. Overview of the time series

For the available period, we manually identified hydrological events by visual
inspection of the discharge time series. To guide this process, we used a
broad event definition, which can be summarized as follows: “an event is a
coherent period of elevated discharge compared to the discharge immediately
before and after and/or a coherent period of high discharge compared to the
data of the entire time series.” We suggest that this is a typical
definition if the goal is to identify events for hydrological process studies
such as analysis of rainfall-runoff coefficients, baseflow separation or
recession analysis. Based on this definition, we classified each time step of
the time series as either being part of an event (value 1) or not (value
0). Altogether, we identified 177 individual events covering 9092 time
steps, which is 11.5 % of the time series. For the available 9-year
period, the maximum precipitation is 28.5 mm h

Both the input data and the event classification are shown in Fig. 3.

Since we wanted to build and test a large range of models, we not only
applied the raw observations of discharge and precipitation but also derived
new data sets. The target and all predictor data sets with the related
binning choices are listed in Table 1; additionally, the predictors are
explained in the text below. For reasons of comparability, we applied uniform
binning (fixed-width interval partitions) to all data
used in the study, except for discharge; here we grouped all values exceeding
15.2 m

Target and predictors – characterization and binning strategy.

This is the discharge as measured at
Hoher Steg. In order to predict an event at time step

We also used a log transformation of discharge to evaluate whether this non-linear
conversion preserved more information in

This is a local identifier of discharge magnitude at time

This is the local inclination of the hydrograph. This predictor was created
to take into consideration the rate and direction of discharge changes. We
calculated both the slope from the previous to the current time step
applying Eq. (5) and the slope from the current to the next time step
applying Eq. (6), where positive values always indicate rising discharge:

This is the precipitation as measured at Ebnit.

In general, information about a target of interest can be encoded in related
data such as the predictors introduced above, but it can also be encoded in
the ordering of data. This is the case if the processes that are shaping the
target exhibit some kind of temporal memory or spatial coherence. For
example, the chance of a particular time step to be classified as being part
of an event increases if the discharge is on the rise, and it declines if the
discharge declines. We can incorporate this information by adding to the
predictors discharge from increasingly distant time steps, but this comes at
the price of a rapidly increasing impact of the curse of dimensionality. To
mitigate this effect, we can use sequential or recursive modeling approaches;
in a first step, we build a model using a set of predictors and apply it to
predict the target. In a next step, we use this prediction as a new,
model-derived predictor, combine it with other predictors in a second model,
use it to make a second prediction of the target and so forth. Each time we
map information from the multi-dimensional set of predictors onto the
one-dimensional model output, we compress data and reduce dimensionality while
hoping to preserve most of the information contained in the predictors. Of
course, if we apply such a recursive scheme and want to avoid iterations, we
need to avoid circular references, i.e., the output of the first model must
not depend on the output of the second. In our application, we assured this by
using the output from the first model at time step

To select the most informative window size when using relative magnitude of
discharge as a predictor, we calculated conditional entropy of the target
given discharge and the

Window size definitions for window types.

The best (lowest) value of conditional entropy was obtained for a
time-centered window (

All the models we set up and tested in this study can be assigned to one of
three distinct groups. The groups distinguish both typical situations of data
availability and the use of recursive and non-recursive modeling approaches.
Models in the

In order to streamline the model evaluation process, we applied an approach
of supervised model selection and gradually increasing model complexity, we
started by setting up and testing all possible one-predictor models in the

Among models with the same number of predictors, we compared model performance via the conditional entropy (target given the predictors), calculated from the full data set. However, when comparing models with different numbers of predictors, the influence of the curse of dimensionality needs to be taken into account. To this end, we calculated sample-based cross entropy and Kullback–Leibler divergence as described in Sect. 2.3.2 for samples of size of 50 up to the size of the full data set, using the following sizes: 50, 100, 500, 1000, 1500, 2000, 2500, 5000, 7500, 10 000, 15 000, 20 000, 30 000, 40 000, 50 000, 60 000, 70 000 and 78 912. To eliminate effects of chance, we repeated the resampling 500 times for each sample size and took their averages. In Appendix A, the resampling strategy and the choice of repetitions are discussed in more detail.

The second application aims to compare the performances of the ITM and another automatic event identification method from a more familiar perspective. The predictions were performed in a separate data set, and, as a measure of diagnostic, concepts from the receiver operating characteristic (ROC) curve quantified the hits and misses of the predictions of both models according to a time series of user-classified events (considered the true value). More about the ROC analysis can be found in Fawcett (2005).

For the comparison, the characteristic point method (CPM) was chosen,
because, in contrast with the data-driven ITM, it is a physically based
approach for event identification, which is applicable to and recommended for
the characteristics of the available data set (hourly timescale data on
catchment precipitation and discharge) and open source. The essence of the
method is to characterize flow events with three points (start, peak(s) and
end of the event) and then associate the event to a corresponding rainfall
event (Mei and Anagnostou, 2015). For the event identification, a baseflow
separation is previously needed and proposed by coupling the revised
constant

Since the outcome of the CPM is dichotomous, classified as either event or
non-event, the probabilistic outcome of
the ITM must be converted into a binary solution. The binarization was
reached in the study by choosing an optimum threshold of the probabilistic
prediction (

A detailed discussion about the cut-off values of the ROC curve can be found in Habibzadeh et al. (2016).

:Due to the

After that, the calibrated models (ITM and CPM) were applied to a new data
set (testing data set), and measures of quality based on the ROC curve were
computed in order to evaluate and compare their performance, such as
(i) the true positive rate (

Here we present and discuss the model results when constructed and applied to
the complete data set. As we stick to the complete data set, Kullback-Leibler
divergence will always be zero, and model performance can be fully expressed
by conditional entropy (see Sect. 3.2.3; Model Evaluation), with the (unconditional) Shannon entropy of the
target data

One-predictor models based on

Conditional entropy and relative uncertainty reduction of one-predictor models.

Based on these considerations and the model selection strategy described in Sect. 3.2.3, we built and evaluated all possible two-predictor models. The models and results are shown in Table 3.

As could be expected from the information inequality, adding a predictor
improved the results, and for some models (no. 16 and no. 20), the

In the

Conditional entropy and relative uncertainty reduction of two-predictor models.

Finally, from both the

Again, for both models, the added predictor improved results considerably,
and we used both of them to build a recursive four-predictor model as
described in Sect. 3.2.3. The new predictor,

Conditional entropy and relative uncertainty reduction of three-predictor models.

Conditional entropy and relative uncertainty reduction of recursive four-predictor models.

Again, model performance improved, and model no. 29 was the best among all tested models, though so far the effect of sample size was not considered, which might have a strong impact on the model rankings. This is investigated in the next section.

Models selected for sample-based tests.

The sample-based model analysis is computationally expensive, so we restricted these tests to a subset of the models from the previous section. Our selection criteria were to (i) include at least one model from each predictor group, (ii) include at least one model from each dimension of predictors and (iii) choose the best-performing model. Altogether we selected the seven models shown in Table 6. Please note that despite our selection criteria, we ignored the one-predictor model using precipitation due to its poor performance.

Cross entropy for models in Table 6 as a function of sample size.

For these models, we computed the cross entropies between the full data set
and each sample size

In Fig. 5, the cross entropies at the right end of the

As mentioned in Sect. 2.3.2, besides Fig. 5 informing the amount of data
needed to have a robust model (implying that sample size is enough to
represent the full data set), it allows the comparison of competing models
with different dimensions and selection of the optimal number of predictors
(taking advantage of the available information and avoiding overfitting). In
this sense, in the

Interestingly, the best-performing model for large sample sizes (no. 29)
includes predictors which reflect the definition criteria that guided manual
event detection (Sect. 3.1):

Application I – curse of dimensionality and data size validation for models in Table 6.

We also investigated the contribution of sample size effects to total
uncertainty by analyzing the ratio of

In Table 7 (fifth column), we show the minimum sample size to keep the

As expected, the models with few predictors require only small samples to
meet the 5 % requirement (starting from a subset of 12.6 % of the
full data set for the one-predictor model to 37.3 % for the two-predictor
model), but for multi-predictor models such as models no. 29 and no. 30, more
than 60 000 data points are required (87.6 % and 79.4 % of the full
data set, respectively). This happens because the greater the number of
predictors, the greater the number of bins in the model. This means that we
need a much larger data set to populate the PMF with the largest number of
bins; for example, model no. 29 has 279 752 bins and requests 7.9 years of
data. Considering that the amount of data available in the study is limited,
this also means that increasing the number of predictors and/or bins also increases
the risk of creating an overfitted or non-robust model. Thus, the ratio

In the previous sections, we developed, compared and validated a range of
models to reproduce subjective, manual identification of events in a
discharge time series. Given the available data, the best model was a four-predictor
recursive model built with the full data set and

Application I – probabilistic prediction of four-predictor model no. 29 (Table 5) for a subset of the training data.

In the period from 1 to 21 June, four distinct rainfall-runoff events occurred which were also classified as such by the user. During these events, the model-based predictions for event probability remained consistently high, except for some times at the beginning and end of events or in times of low flow during an event. Obviously, the model here agrees with the user classification, and if we wished to obtain a binary classification from the model, we could get it by introducing an appropriate probability threshold (as further described in Sect. 4.2).

Things look different, though, in the period of 26 April to 10 May, when snowmelt induced diurnal discharge patterns. During this time, the model identified several periods with reasonable (above 50 %) event probability, but the user classified the entire period as a non-event. Arguably, this is a difficult case for both manual and automated classification, as the overall discharge is elevated, but it is not elevated by much, and diurnal events can be distinguished but are not pronounced. In such cases, both the user-based and the model-based classifications are uncertain and may disagree.

To identify snowmelt events or potentially improve the information contained in the precipitation set, other predictors could have been used in the analysis (such as aggregated precipitation, snow depth, air temperature, nitrate concentrations, moving average of discharge, etc.), or the target could have been classified according to it type (rainfall, snowmelt, upstream reservoir operation, etc.), instead of having a dichotomous outcome, i.e., event and non-event. The choice of target and potential predictors occurs according to user interest and data availability.

Another point that may be of interest to the user is the improvement of the
consistency of the event duration. This can be reached by selection of
predictors or through a post-processing step. As previously discussed in
Sect. 3.2.1, by applying a recursive predictor

Finally, in contrast to the evaluation approach presented, where the subsets are compared to the full data set (subset data plus data not seen during training), the next section will present the evaluation of the ITM and CPM applied for mutually exclusive training and testing sets.

Section 4.1 showed that, for the full data set, the best model was the
recursive one with

Cross-validation data set – characteristics of the user event classification set.

For model training, input data from both models, the ITM and CPM, were smoothed. First, a 24 h moving average was applied to the discharge of the CPM (this was recommended by the first author of the method, Yiwen Mei, during personal communications in 2018), and to avoid misleading comparison, it was then applied to the probabilities of the ITM right before the binarization. The smoothing improved the results of both models and worked as a post-processing filter which removed some noise (events with a very short duration) and attenuated effects from snowmelt. Note that this is a feature of our training data set, and it is therefore not necessarily applicable to other similar problems and neither is a required step.

Application II – ITM and CPM performance.

Following the data smoothing, we proceeded with the optimization of the
following parameters: the threshold for the probability output of the ITM and
rate of no change for the CPM (Sect. 3.3). The results of the two models also
improved with the optimization performed. The optimum parameters obtained
were

After the model training, the calibrated models were applied to the testing data set to predict binary events. The event predictions were then compared to the true classification (Table 8, testing row), and their hits and misses were calculated in order to evaluate and compare their performance. The results are compiled in Table 9.

The quality parameters presented in Table 9 show that the ITM true
positive rate equals 97.5 %, i.e., it is 13.0 % higher than the CPM

Application II – binary prediction of ITM and CPM for a subset of the testing data set.

Considering only the hits of the models, both methods performed similarly, reaching almost 90 % accuracy, with the CPM being slightly better than the ITM. However, it should be emphasized that although the accuracy of the model gives a good notion of the model hits, it was not used as a criterion for success because it is a myopic criterion for the false event classifications. False positives are essential in the context of event prediction, since most of the data are non-events (88.2 % of the training data set; Table 8), and a blind classification of all time steps as being non-event, for example, would overcome the accuracy obtained by both models (90.4 % of the testing data set; Table 8), even though it is not a useful model.

As an illustration, in the context of the binary analysis, the observed discharge, the true event classification (manually made by an expert), the ITM-predicted events and CPM-predicted events are shown in Fig. 7 for a subset of the testing data, from 29 June to 19 August 2005.

For the analyzed subset, nine distinct rainfall-runoff events occurred and were identified as such by the ITM and CPM. However, different from the true identification, both models grouped some of these events (20 July, 7 and 16 August) with events with longer duration. False events were also observed in both models, where three false events were identified by the ITM (5, 7 and 26 July), and two (but contemplating the same period as the ITM) were identified by the CPM. It should be noted that they are false in relation to the user classification; however, we can not exclude the possibility of false classification by the visual inspection process. A further criticism is that the holdout cross-validation involves a single run, which is not as robust as multiple runs. Nevertheless, the way that the split was proposed recognizes the logical order of obtaining the data. Thus, despite the subjectivity of event selection by a user and the application of a simplified method of cross-validation, it is possible to conclude that, overall, the ITM and CPM behaved similarly and provided reasonable predictions, as seen numerically in Table 9 and qualitatively through Fig. 7.

An interesting conclusion is that the ITM was able to overcome the CPM while requiring only discharge data and a training data set of classified events (also based on the discharge set), whereas the CPM demanded precipitation, catchment area and discharge as inputs. It is important to note that the CPM can be modified to be used without precipitation data; however in our case it resulted in a considerably higher false positive rate, since the rainfall event-related filters cannot be applied. In contrast, since the CPM is a physically based approach, it does not require a training data set with identified events (although the optimization in the calibration step has representatively improved its results), and there are no limitations in terms of data set size, which eliminates the robustness analysis, being then a method more easily implemented for binary classification. The binarization of the ITM predictions and parameter optimization in the CPM are not included in the original methods, however, they were essential adaptations to allow a fair comparison of the models. Finally, the suitability or not of the existing event detection techniques depends mainly on the user's interest and the data available for application.

Typically, it is easy to manually identify rainfall-runoff events due to the high discriminative and integrative power of the brain–eye system. However, this is (i) cumbersome for long time series; (ii) subject to handling errors; and (iii) hard to reproduce, since it dependents on acuity and knowledge of the event identifier. To mitigate these issues, this study has proposed an information theory approach to learn from data and to choose the best predictors, via uncertainty reduction, for creating predictive models that automatically identify rainfall-runoff events in discharge time series.

The method was established in four main steps: the model hypothesis, building, evaluation and application. Each association of predictor(s) to the target is equivalent to formulating a model hypothesis. For the model building, non-parametric models constructed discrete distributions via bin-counting, requiring at least a discharge time series and a training data set containing a yes or no event identification as target. In the evaluation step, we used Shannon entropy and conditional entropy to select the more informative predictors and Kullback–Leibler divergence and cross entropy to analyze the model in terms of overfitting and curse of dimensionality. Finally, the best model was applied to its original data set to compare the predictability of the events. For the purpose of benchmarking, a holdout cross-validation and a comparison of the proposed data-driven method with an alternative physically based approach were performed.

The approach was applied to discharge and precipitation data from the Dornbirner Ach catchment in Austria. In this case study, 30 models based on 16 predictors were built and tested. Among these, seven predictive models with a number of predictors varying from one to four were selected. Interestingly, across these models, the three best-performing ones were obtained using only discharge-based predictors. The overall best model was a recursive one applying four predictors: discharge from two different time steps, the relative magnitude of discharge compared to all discharge values in a surrounding 65 h time window and event predictions from the previous time step. When applying the best model, the uncertainty of event classification was reduced by 77.8 %, decreasing conditional entropy from 0.516 to 0.114 bits. Since the conditional entropy reduction of the models with precipitation was not higher than the ones exclusively based on discharge information, it was possible to infer that (i) the information coming from precipitation was likely already contained in the discharge data series and (ii) the event classification is not so much dependent on precipitation at a particular time step but rather on the accumulated rainfall in the period preceding it. Furthermore, precipitation data are often not available for analysis, which makes the model exclusively based on discharge data even more attractive.

Further analysis using cross entropy and Kullback–Leibler divergence showed that the robustness of a model quickly dropped with the number of predictors used (an effect known as the curse of dimensionality) and that the relation between number of predictors and sample size was crucial to avoid overfitting. Thus, the model choice is a tradeoff between predictive power and robustness, given the available data. For our case, the minimum amount of data to build a robust model varied from 9952 data points (one-predictor model with 0.260 bits of uncertainty) to 69 102 data points (four-predictor model with 0.114 bits of uncertainty). Complementarily, the quality of the model was verified in a more traditional way, by a cross-validation analysis (where the model was built in a training data set and validated in a testing data set), and a comparative investigation between our data-driven approach and a physically based model. As a result, in general, both models presented reasonable predictions and reached similar quality parameters, with almost 90 % of accuracy. In the end, the comparative analysis and cross-validation reinforced the quality of the method, previously validated in terms of robustness using measures from information theory.

In the end, the data-driven approach based on information theory is a consolidation of descriptive and experimental investigations, since it allows one to describe the drivers of the model through predictors and investigates the similarity of the model hypothesis with respect to the true classification. In summary, it presents advantages such as the following: (i) it is a general method that involves a minimum of additional assumptions or parameterizations; (ii) due to its non-parametric approach, it preserves the full information of the data as much as possible, which might get lost when expressing the data relations with functional relationships; (iii) it obtains data relations from the data itself; (iv) it is flexible in terms of data requirement and model building; (v) it allows one to measure the amount of uncertainty reduction via predictors; (vi) it is a direct way to account for uncertainty; (vii) it permits explicitly comparing information from various sources in a single currency, the bit; (viii) it allows one to quantify minimal data requirements; (ix) it enables one to investigate the curse of dimensionality; (x) it is a way of understanding the drivers (predictors) of the model (also useful in machine learning, for example); (xi) it one permits to choose the most suitable model for an available data set; and (xii) the predictions are probabilistic, which compared to a binary classification, additionally provides a measure of the confidence of the classification.

Although the procedure was employed to identify events from a discharge time series, which for our case were mainly triggered by rainfall and snowmelt, the method can be applied to reproduce user classification of any kind of event (rainfall, snowmelt, upstream reservoir operation, etc.) and even identify them separately. Moreover, one of the strengths of the data-based approach is that it potentially accepts any data to serve as predictors, and it can handle any kind of relation between the predictor(s) and the target. Thus, the proposed approach can be conveniently adapted to another practical application.

The event detection program, containing the functions
to develop multivariate histograms and calculate information theory measures,
is published alongside this manuscript via GitHub:

In the study, samples of size

Thus, in order to find the value of

Dispersion analysis of the cross entropy. The effect of the number of repetitions in the target model (no. 0 in Table 7).

Considering the graph in Fig. A1, in general, the behavior of the Shannon entropy among the repetitions is similar for each sample size analyzed, indicating that the dispersion of the results according to the number of repetitions does not vary too much, i.e., the bins are similarly filled. However, it is possible to see that, as the sample size increases, the Shannon entropy for the different number of repetitions approaches that for the 50 000 repetitions. For sample sizes up to 7500, the bars from 50, 100 and 300 repetitions present some peaks and troughs, indicating some dispersion in filling the bins. Thus, in this case study, the minimum of 500 repetitions was assumed as a reasonable number of repetitions for computing the mean of the cross entropy in the sample size investigation. This number of repetitions was also validated considering the smoothness and logical behavior of the curves obtained during the data size validation and curse of dimensionality analyses (Fig. 5 in Sect. 4.1.2).

UE and PD developed the model program (calculation of information theory
measures, multivariate histograms operations and event detection) and developed
a method for avoiding infinitely large values of

The authors declare that they have no conflict of interest.

Stephanie Thiesen and Uwe Ehret acknowledge support from the Deutsche Forschungsgemeinschaft (DFG) and Open Access Publishing Fund of the Karlsruhe Institute of Technology (KIT). We thank Clemens Mathis from Wasserwirtschaft Vorarlberg, Austria, for providing the case study data. The article processing charges for this open-access publication were covered by a research center of the Helmholtz Association. Edited by: Bettina Schaefli Reviewed by: Yiwen Mei and one anonymous referee