Dissolved oxygen prediction using a possibility-theory based fuzzy neural network

Introduction Conclusions References


Introduction
The City of Calgary is a major economic hub in western Canada.With a rapidly growing population, currently estimated in excess of 1 million, the city is undergoing expansion and urbanisation to accommodate the changes.The Bow River is a relatively small river that flows through the city and provides approximately 60 % of the residents with potable water (Khan andValeo, 2015, 2016).In addition to this, water is diverted from within the city for irrigation, is used as a source for commercial and recreational fisheries, and is the source of drinking water for communities downstream of Calgary (Robinson et al., 2009;Bow River Basin Coun-cil, 2015).This highlights the importance of the Bow River, not just as a source of potable water, but also as a major economic resource.
However, urbanisation has the potential to reduce the health of the Bow River, which is fast approaching its assimilative capacity and is one of the most regulated rivers in Alberta (Bow River Basin Council, 2015).Three wastewater treatment plants (shown in Fig. 1) and numerous stormwater outfalls discharge their effluent into the river and are considered to be a major cause of water quality degradation in the river (He et al., 2015).This highlights some of the major impacts on the Bow River from the surrounding urban area.A number of municipal and provincial programmes are in place to reduce the loading of nutrients and sediments into the river such as the Total Loadings Management Plan and the Bow River Phosphorus Management Plan (Neupane et al., 2014) as well as modelling efforts -namely the Bow River Water Quality Model (Neupane et al., 2014;Golder Associates Ltd., 2004) -to predict the impact of different water management programmes on the water quality.
One of the major concerns is that low dissolved oxygen (DO) concentration has occurred on a number of occasions over the last decade in the Bow River within the city limits.DO is an indicator of overall health of the aquatic ecosystem (Dorfman and Jacoby, 1972;Hall, 1984; Canadian Council of Ministers of the Environment, 1999;Kannel et al., 2007;Khan andValeo, 2014a, 2015), and low DOwhich can be caused by a number of different factors (Pogue and Anderson, 1995;Hauer and Hill, 2007;He et al., 2011;Wen et al., 2013) -can impact various organisms in the water body.While the impact of long-term effects of low DO are largely unknown, acute events can have devastating effects on aquatic ecosystems (Adams et al., 2013).Thus, main- taining a suitably high DO concentration, and water quality in general, is of utmost importance to the City of Calgary and downstream stakeholders, particularly as the city is being challenged to meet its water quality targets (Robinson et al., 2009).A number of recent studies have examined the DO in the Bow River and the factors that impact its concentration.Iwanyshyn et al. (2008) found the diurnal variation in DO and nutrient (nitrate and phosphate) concentration were highly correlated, suggesting that biogeochemical processes (photosynthesis and respiration of aquatic vegetation) had a dominant impact on nutrient concentration rather than wastewater treatment effluent.Further, Robinson et al. (2009) found that the DO fluctuations in the river were primarily due to periphyton rather than macrophyte biogeochemical processes.In both studies, the seasonality of DO, nutrients, and biological concentration, and external factors (e.g.flood events) were demonstrative of the complexity in understanding river processes in an urban area, and that consideration of various inputs, outputs and their interaction is important to fully understand the system.He et al. (2011) found that seasonal variations in DO in the Bow River could be explained by a combination of abiotic factors (such as climatic and hydrometric conditions), as well as by biotic factors.The study found that while photosynthesis and respiration of biota are the main drivers of DO fluctuation, the role of nutrients was ambiguous.Neupane et al. (2014) found that organic materials and nutrients from point and non-point sources influence DO concentration in the river.The likelihood of low DO was highest downstream of wastewater treatment plants and that non-point sources have a significant impact in the open-water season.Using a physically based model, Neupane et al. (2014) predicted low DO concentration more frequently in the future in the Bow River owing to higher phosphorus concentration in the water, as well as climate change impacts.
A major issue of modelling DO in the Bow River is that rapid urbanisation within the watershed has resulted in substantial changes to land-use characteristics, sediment and nutrient loads, and to other factors that govern DO.Major flood events (like those in 2005 and 2013) completely alter the aquatic ecosystem, while new wastewater treatment plants (e.g. the Pine Creek wastewater treatment plant) added in response to the growing population further increases the stress downstream.These types of changes in a watershed increase the complexity of the system: the interaction of numerous factors over a relatively small area and across different temporal scales means that DO trends and variability in urban areas are more difficult to model and evaluating water quality in urban riverine environments is a difficult task (Hall, 1984;Niemczynowicz, 1999).
The implication of this is that the simplistic representation described in conceptual, physically based models is not suitable for complex systems, i.e.where the underlying physical mechanisms behind the factors that govern DO are still not clearly understood, in a rapidly changing urban environment.Physically based models require the parameterisation of several different variables which may be unavailable, expensive and time consuming (Antanasijević et al., 2014;Wen et al., 2013;Khan et al., 2013).In addition to this, the increase in complexity in an urban system proportionally increases the uncertainty in the system.This uncertainty can arise as a result of vaguely known relationships among all the factors that influence DO, in addition to the inherent randomness in the system (Deng et al., 2011).The rapid changes in an urban area render the system dynamic as opposed to stationary, which is what is typically assumed for many probability based uncertainty quantification methods.Thus, not only is DO prediction difficult, but it is also beset with uncertainty, hindering water resource managers from making objective decisions.
In this research, we propose a new method to predict DO concentration in the Bow River using a data-driven approach, as opposed to a physically based method that uses possibility theory and fuzzy numbers to represent the uncertainty rather than the more commonly used probability theory.Data-driven models are a class of numerical models based on generalised relationships, links or connections between input and output data sets (Solomatine and Ostfeld, 2008).These models can characterise a system with limited assumptions and are useful in solving practical problems, es-pecially when there is lack of understanding of the underlying physical process, the time series are of insufficient length, or when existing models are inadequate (Solomatine et al., 2008;Napolitano et al., 2011).

Fuzzy numbers and data-driven modelling
Possibility theory is an information theory that is an extension of fuzzy set theory for representing uncertain, vague or imprecise information (Zadeh, 1978).Fuzzy numbers are an extension of fuzzy set theory, and express an uncertain or imprecise quantity.These types of numbers are particularly useful for dealing with uncertainties when data are limited or imprecise (Bárdossy et al., 1990;Guyonnet et al., 2003;Huang et al., 2010;Zhang and Achari, 2010) -in other words when epistemic uncertainty exists.This type of uncertainty is in contrast to aleatory uncertainty that is typically handled using probability theory.Possibility theory and fuzzy numbers are thus useful when a probabilistic representation of parameters may not be possible, since the exact values of parameters may be unknown or only partial information available (Zhang, 2009).Thus, the choice of using a datadriven approach in combination with possibility theory lends itself well to the constraints posed by the problem in the Bow River: the difficulty in correctly defining a physically based model for a complex urban system and the use of possibility theory to model the uncertainty in the system when probability theory based methods may be inadequate.
Data-driven models, such as neural networks, regression based techniques, fuzzy rule based systems, and genetic programming, have seen widespread use in hydrology, including DO prediction in rivers (Shrestha and Solomatine, 2008;Solomatine et al., 2008;Elshorbagy et al., 2010).Wen et al. (2013) used artificial neural networks (ANNs) to predict DO in a river in China using ion concentration as the predictors.Antanasijević et al. (2014) used ANNs to predict DO in a river in Serbia using a Monte Carlo approach to quantify the uncertainty in model predictions and temperature as a predictor.Chang et al. (2015) also used ANNs coupled with hydrological factors (such as precipitation and discharge) to predict DO in a river in Taiwan.Singh et al. (2009) used water quality parameters to predict DO and BOD in a river in India.Other studies (e.g.Heddam, 2014;Ay and Kisi, 2011) have used regression to predict DO in rivers using water temperature or electrical conductivity, amongst others, as inputs.In general, these studies have demonstrated that there is a need and demand for less complex DO models, have led to an increase in the popularity of data-driven models (Antanasijević et al., 2014), and shown that the performance of these types of models is suitable.Recent research into predicting DO concentration in the Bow River in Calgary using abiotic factors (these are non-living, physical and chemical attributes) as inputs have shown promising results (He et al., 2011;Khan et al., 2013;Khan and Valeo, 2015).The advantage of using readily available data (i.e. the abiotic inputs) in these studies is that if a suitable relationship between these factors and DO can be found, changing the factors (e.g.increasing the discharge rate downstream of a treatment plant) can potentially reduce the risk of low DO.
While fuzzy set theory based applications, particularly applications using fuzzy logic in neural networks, have been widely used in many fields including hydrology (Bárdossy et al., 2006;Abrahart et al., 2012), the use of fuzzy numbers and possibility theory based applications has been limited in comparison (Bárdossy et al., 2006;Jacquin, 2010).Some examples include maps of soil hydrological properties (Martin-Clouaire et al., 2000), remotely sensed soil moisture data (Verhoest et al., 2007), climate modelling (Mujumdar and Ghosh, 2008), subsurface contaminant transport (Zhang et al., 2009), and streamflow forecasting (Alvisi and Franchini, 2011).Khan et al. (2013) and Khan and Valeo (2015) have introduced a fuzzy number based regression technique to model daily DO in the Bow River using abiotic factors with promising results.Similarly, Khan and Valeo (2014a) used an autoregressive time series based approach combined with fuzzy numbers to predict DO in the Bow River.In these studies, the use of fuzzy numbers meant that the uncertainty in the system could be quantified and propagated through the model.However, due to the highly non-linear nature of DO modelling, the use of an ANN based method is of interest since these types of models are effective for modelling complex, non-linear relationships without the explicit understanding of the physical phenomenon governing the system (Alvisi and Franchini, 2011;Antanasijević et al., 2014).A fuzzy neural network method proposed by Alvisi and Franchini (2011) for streamflow prediction that uses fuzzy weights and biases in the network is further refined in this research for predicting DO concentration.

Objectives
Given the importance of DO concentration as an indicator of overall aquatic ecosystem health, there is a need to accurately model and predict DO in urban riverine environments, like that in Calgary, Canada.In this research a new datadriven method is proposed that attempts to address the issues that plague numerical modelling of DO concentration in the Bow River.The FNN method proposed by Alvisi and Franchini ( 2011) is adapted and extended in two critical ways.The existing method uses crisp (i.e.non-fuzzy) inputs and outputs to train the network, producing a set of fuzzy number weights and biases, and fuzzy outputs.The method is adapted to be able to handle fuzzy number inputs to produce fuzzy weights and biases, and fuzzy outputs.The advantage is that the uncertainties in the input observations are also captured within the model structure.To do this, a new method of creating fuzzy numbers from observations is presented based on a probability-possibility transformation.Second, the existing training algorithm is based on capturing a predetermined set of observations (e.g.100, 95 or 90 %) within the fuzzy outputs.The selection of the predetermined set of observations in the original study was an arbitrary selection.A new method that exploits the relationship between possibility theory and probability theory is defined to create a more objective method of training the FNN.A consequence of this is that the resulting fuzzy number outputs from the model can then be directly used for risk analysis, specifically to quantify the risk of low DO concentration.This information is extremely valuable for managing water resources in the face of uncertainty.The impact of using fuzzy inputs and the new training criteria is evaluated by comparing results to the existing FNN method (by Alvisi and Franchini, 2011) as well as with a traditional, crisp ANN.
Following previous research for this river, two abiotic inputs (daily mean water temperature, T , and daily mean flow rate, Q) will be used to predict daily minimum DO.An advantage of using these factors is that they are routinely collected by the City of Calgary, and thus a large data set is available.Also, their use in previous studies has shown that they are good predictors of daily DO concentration in this river basin (He et al., 2011;Khan et al., 2013;Khan and Valeo, 2015).The following sections outline the background of fuzzy numbers and existing probability-possibility transformations.This is followed by the development of the new method to create fuzzy numbers from observations.Then, the new FNN method using fuzzy inputs is developed mathematically using new criteria for training, also based on possibility theory.Lastly, a method to measure the risk of low DO is described.

Data collection
The Bow River is 645 km long and averages a 0.4 % slope over its length (Bow River Basin Council, 2015) from its headwaters at Bow Lake in the Rocky Mountains to its confluence with the Oldman River in southern Alberta, Canada (Robinson et al., 2009;Environment Canada, 2015).The river is supplied by snowmelt from the Rocky Mountains, rainfall and discharge from groundwater.The City of Calgary is located within the Bow River Basin and the river has an average annual discharge of 90 m 3 s −1 , and an average width and depth of 100 and 1.5 m, respectively (Khan andValeo, 2014b, 2016).
The City of Calgary routinely samples a variety of water quality parameters along the Bow River to measure the impacts of urbanisation, particularly from three wastewater treatment plants and numerous stormwater runoff outfalls that discharge into the river.DO concentration measured upstream of the city is generally high throughout the year, with little diurnal variation (He et al., 2011;Khan et al., 2013;Khan and Valeo, 2015).The DO concentration downstream of the city is lower and experiences much higher diurnal fluc- tuation.The three wastewater treatment plants are located upstream of this monitoring site, and are thought to be responsible, along with other impacts of urbanisation, for the degradation of water quality (He et al., 2015).For this research, 9 years of DO concentration data were collected from one of the downstream stations from 2004 to 2012.The monitoring station was located at Pine Creek and sampled water quality data every 30 min (from 2004 to 2005) and every 15 min (from 2006 to 2007).The station was then moved to Stier's Ranch and sampled data every hour (in 2008) and every 15 min (2009 to 2011).The monitoring site was moved further downstream to its current location (at Highwood) in 2012 where it samples every 15 min.During this period a number of low DO events have been observed in the river and are summarised below in Table 1 corresponding to different water quality guidelines.
Note that even though daily minimum DO was observed to be below 5 mg L −1 on several occasions in 2004 and 2006 (in Table 1), the minimum DO was below 9.5 mg L −1 only 107 and 164 days, respectively, for those 2 years.In contrast, in 2007 and 2010, no observations below 5 mg L −1 were seen, yet 182 and 180 days, respectively, below the 9.5 mg L −1 guideline were seen for those years.The total number of days below 9.5 mg L −1 constitute approximately 90 % of all observations for those years.This shows that despite no DO events below 5 mg L −1 , generally speaking minimum DO on a daily basis was quite low in these 2 years.The implication of this is that only using one guideline for DO might not be a good indicator of overall aquatic ecosystem health.
A YSI sonde is used to monitor DO and T , and the sonde is not accurate in freezing water; thus, only data from the ice-Hydrol.Earth Syst.Sci., 20, 2267Sci., 20, -2293Sci., 20, , 2016 www.hydrol-earth-syst-sci.net/20/2267/2016/ free period were considered, which is approximately from April to October for most years (YSI Inc., 2015).Since low DO events usually occur in the summer (corresponding to high water temperature and lower discharge), the ice-free period data set contains the dates that are of interest for low DO modelling.
Daily mean flow rate, Q, was collected from Water Survey of Canada site "Bow River at Calgary" (ID: 05BH004) for the same period.These data are collected hourly throughout the year; thus, data where considerable shift corrections were applied (usually due to ice conditions) were removed from the analysis.The mean annual water temperature ranged between 9.23 and 13.2 • C, the annual mean flow rate was between 75 and 146 m 3 s −1 , and the mean annual minimum daily DO was between 6.89 and 9.54 mg L −1 , for the selected period.

Probability-possibility transformations
Fuzzy sets were proposed by Zadeh (1965) in order to express imprecision in complex systems, and can be described as a generalisation of classical set theory (Khan and Valeo, 2015).In classical set theory, an element x either belongs or does not belong to a set A. In contrast, using fuzzy set theory, the elements x have a degree of membership, µ, between 0 and 1 in the fuzzy set A. If µ equals 0, then x does not belong in A, and µ = 1 means that it completely belongs in A, while a value µ = 0.5 means that it is only a partial member of A.
Fuzzy numbers express uncertain or imprecise quantities, and represent the set of all possible values that define a quantity rather than a single value.A fuzzy number is defined as a specific type of fuzzy set: a normal and convex fuzzy set.Normal implies that there is at least one element in the fuzzy set with a membership level equal to 1, while convex means that the membership function increases monotonically from the lower support (i.e.µ = 0 L ) to the modal element (i.e. the element(s) with µ = 1) and then monotonically decreases to the upper support (i.e.µ = 0 R ) (Kaufmann and Gupta, 1985).
Traditional representation of a fuzzy numbers has been using symmetrical, linear membership functions, typically denoted as triangular fuzzy numbers.The reason for selecting this type of membership function has to do with its simplicity: given that a fuzzy number must, by definition, be convex and normal, a minimum of three elements are needed to define a fuzzy number (two elements at µ = 0 and one element at µ = 1).For example, if the most credible value for DO concentration is 10 mg L −1 (µ = 1), with a support about the modal value between (µ = 0 L ) and 12 mg L −1 (µ = 0 R ).This implies that the simplest membership function is triangular, though not necessarily symmetrical.Also, as we demonstrate below, in some probability-possibility frameworks, a triangular membership function corresponds to a uniform probability distribution -the least specific distribution in that any value is equally probable, and hence represents the most uncertainty (Dubois and Prade, 2015;Dubois et al., 2004).
However, recent research (Khan et al., 2013;Khan and Valeo, 2014a, b, 2015, 2016) has shown that such a simplistic representation may not be appropriate for hydrological data, which are often skewed and non-linear.This issue is further highlighted if the probability-possibility framework mentioned above is used: it implies that for a triangular membership function, the fuzzy number bounded by the support [8 12] mg L −1 has a uniform probability distribution bounded between 8 and 12 mg L −1 with a mean value of 10 mg L −1 , suggesting that values between the support are equally likely to occur.It not difficult to see that this an oversimplification of hydrological data.In many cases enough information (i.e. from observations) is available to define the membership function with more specificity, and this information should be used to define the membership function.
Multiple frameworks exist to transform a probability distribution to a possibility distribution and vice versa; a comparison of different conceptual approaches is provided in Klir and Parvais (1992), Oussalah (2000), Jaquin (2010), Mauris (2013) and Dubois and Prade (2015).However, a major issue of implementing fuzzy number based methods in hydrology is that there is no consistent, transparent and objective method to convert observations (e.g.time series data) into fuzzy numbers, or generally speaking to construct the membership function associated with fuzzy values (Abrahart et al., 2012;Dubois and Prade, 1993;Civanlar and Trussel, 1986).
A popular method (Dubois et al., 1993(Dubois et al., , 2004) converts a probability distribution to a possibility distribution by relating the area under a probability density function to the membership level (Zhang, 2009).In this framework, the possibility is viewed as the upper envelope of the family of probability measures (Jacquin, 2010;Ferrero et al., 2013;Betrie et al., 2014).There are two important considerations for this transformation, first it guarantees that something must be possible before it is probable; hence, the degree of possibility cannot be less than the degree or probability -this is known as the consistency principle (Zadeh, 1965).Second is order preservation, which means if the possibility of x i is greater than the possibility of x j then the probability of x i must be greater than the probability of x j (Dubois et al., 2004).For a discrete system, this can be represented as follows: then the possibility distribution of x(π(x)) follows the same order, that is, The transformation is given by For p(x 1 ) > p(x 2 ) > . . .> p(x n ): where the x i are elements of a fuzzy number A, π(x i ) is the possibility of element x i , and p(x i ) is the probability of element x i .The concept of this transformation may be more illustrative when viewed in the continuous case: for any interval [a, b], the membership level µ (where π(a) = π(b) = µ) is equal to the sum of the areas under the probability density function curve between (−∞, a) and (b, ∞) (Zhang et al., 2009).It is important to highlight that this particular transformation has an inverse transformation associated with, where a probability distribution can be estimated from the possibility distribution.However, a major drawback of this transformation is that it theoretically requires a full description of the probability density function, or in the finite case, the probability associated with each element of the fuzzy number, the probability mass function.For many hydrological applications this might not be possible because the hourly time series data may not adequately fit the mould of a known class of probability density functions, or one distribution amongst many alternatives may have to be selected based on best-fit.This best-fit function may not be universal, e.g.data from one 24 h period may be best described by one class or family of probability density function, while the next day by a completely different class of density function.This means working with multiple classes of distribution functions for one application, which can be cumbersome.Also, given that each day may only have 24 data points (or fewer on days with missed samples), it is difficult to select one particular function.
In previous research by Khan and Valeo (2015), a new approach to create a fuzzy number based on observations was developed.This process used a histogram based approach to estimate the probability mass function of the observations, and then Eq. (1) was used to estimate the membership function of the fuzzy number.To create the histogram, the bin size was selected based on the extrema observations for a given day and the number of the observations.A linear interpolation scheme was then used to calculate the fuzzy number at five predefined membership levels.This method has a few shortcomings, namely that the bin-size selection was arbitrarily selected based on the magnitude and number of observations, which does not necessarily result in the optimum bin size.This lack of optimality means that the resulting histogram may either be too smooth so as not to capture the variability between membership levels, or too rough and uneven so that the underlying shape of the membership function is difficult to discern.This is a common issue with histogram selection in many applications (Shimazaki and Shinomoto, 2007).Secondly, the aforementioned transformation used by Khan and Valeo (2015) only allows one element to have µ = 1 when p(x) is maximum.However, there are a number of cases (e.g.bimodal distributions, or arrays when all elements are equal) where multiple elements have jointequal maximum p(x), and hence multiple elements with µ = 1.This means that all elements within the α-cut interval [a b] µ=1 (where a and b are the minimum and maximum elements with µ = 1) must by definition also have a membership level equal to 1. Thus, a method is necessary to be flexible enough to accommodate these types of issues.
In this research, a two-step procedure is proposed to create fuzzy numbers on the inputs (i.e.Q and T ) using hourly (or sub-hourly) observations.First, a bin-size optimisation method is used (an extension of an algorithm proposed by Shimazaki and Shinomoto, 2007) to create histograms to represent the estimate of discretised probability density functions of the observations.This estimate of the probability distribution is then transformed to the membership function of the fuzzy number using a new numerical procedure and the transformation principles described in Eq. ( 1).This updated method requires no assumptions regarding the distribution of the underlying data or selection of an arbitrary bin size, has the flexibility to create different shapes of fuzzy numbers depending on the distribution of the underlying data, and allows multiple elements to have equal µ = 1.The proposed algorithm is described in the proceeding section.
A new algorithm to create fuzzy numbers Shimazaki and Shinomoto (2007) proposed a method to find the optimum bin size of a histogram when the underlying distribution of the data is unknown.The basic premise of the method is that the optimum bin size (D opt ) is one that minimises the error between the theoretical (but unknown) probability density and the histogram generated using the D opt .The error metric used by Shimazaki and Shinomoto (2007) is the mean integrated squared error (E MISE ) which is frequently used for density estimation problems.It is defined as where f (t) is the unknown density function, f n (t) is the histogram estimate of the density function, t denotes time and P is the observation period, and E[ q ] is the expectation.In practice, E MISE cannot be directly calculated since the underlying distribution is unknown, and thus an estimate of the E MISE is used in its place (see C D below).Thus, f n (t) can be found without any assumptions of the type of distribution (e.g.class, unimodality); the only assumption is that the number of events (i.e. the counts k i ) in the ith bin of the histogram follow a Poisson point process.This means that the events in two disjoint bins (e.g. the ith and i + 1th bins) are independent, and that mean (k) and variance (v) of the k i in each bin are equal, due to the assumption of a Poisson process (Shimazaki and Shinomoto, 2007).Using this property, the optimum bin size can be found as follows.Let X be the input data vector for the observation period (P ), e.g. a [24 × 1] vector corresponding to hourly samples for a given day.The elements in X are binned into N bins of equal bin size D. The number of events k i in each ith bin are then counted and the mean (k) and variance (v) of the k i are calculated as follows: The k and v are then used to compute the cost function C D , which is defined as This cost function is a variant of the original E MISE listed in Eq. ( 2) and is derived by removing the terms from E MISE that are independent of the bin size D, and by replacing the unobservable quantities (i.e.E[f (t)]) with their unbiased estimators (details of this derivation can be found in the original paper by Shimazaki and Shinomoto, 2007).The objective then is to search for D opt : the value of D that minimises C D .
To do this two systematic modification are made: first, C D is recalculated at different partitioning positions, and secondly, the entire process is repeated for different values of N and D, until a "reliable" estimate of minimum C D and thus D opt is found.Using different partitioning positions means that the variability in k i resulting from the position of the bin (rather than the size of the bin) can be quantified.Repeating the analysis at different N and D accounts for the variability due to different bin sizes.Both these techniques are ways of accounting for the uncertainty associated with estimating the histogram.
Partitioning positions are defined as the first and last points that define a bin.The most common way of defining a partitioning position is to centre it on some value a; for example, the bin defined at [a − D/2, a + D/2] is centred on a and has a bin size D. Variations of this partitioning position can be found by using a moving-window technique, where the bin size D is kept constant, but the first and last points are perturbed by a small value δ: where δ ranges incrementally between 0 and D. Using these different values of δ whilst keeping D constant will result in different values of k i and hence unique values of C D .Thus, for a single value of D, multiple values of C D are possible.
For this research this bin-size optimisation algorithm is implemented to determine the optimum histogram for the two input variables, Q and T .The array of daily data X (at hourly or higher frequency; see Sect.2.1 for details regarding the sampling frequency of both inputs) for each variable was collected for the 9-year period.The bin size was calculated for each day as follows: where x max and x min are the maximum and minimum sampled values for X, respectively, and N is the number of bins.
As described above, a number of different D were considered to find the optimum C D .This was done by selecting a number of different values of N, ranging from N min to N max .The minimum value N min was set equal to 3 for all days; this is the necessary number of bins to define a fuzzy number (two elements for µ = 0, and one element for µ = 1).The highest value, N max , was calculated as where 2r is the measurement resolution of the device used to measure either Q or T , set at twice the accuracy (r) of the device.The rationale for this decision is that as N increases, D necessarily decreases (as per Eq. 6).However, D cannot be less than the measurement resolution; this constraint (i.e.N ≤ N max ) ensures that the optimum bin size is never less than what the measurement devices can physically measure.For this research, the accuracy for T is listed as ±0.1 • C, and thus, the resolution (2r 2015).For Q all measurements below 99 m 3 s −1 have an accuracy of ±0.1 m 3 s −1 and thus, a resolution of 0.2 m 3 s −1 , while measurements above 99 m 3 s −1 have an accuracy of ±1 m 3 s −1 , and thus a resolution of 2 m 3 s −1 .This is based on the fact that all data provided by the Water Survey of Canada is accurate to three significant figures (Lane, 1999).
Note that for the case where x min equals x max (i.e.no variance in the daily observed data) then D = 2r, which means that the only uncertainty considered is due to the measurement.
Once the N max is determined, the bin size D was calculated for each N between N min and N max .Then, starting at the largest D (i.e.D = (x max − x min )/N min ), the cost function C D is calculated at the first partitioning positing, where the first bin is centred at and the Nth bin is centred on The value of δ ranged between 0 and D at (D/100) intervals.Thus, for this value of D, 100 values of C D were calculated since 100 different partitioning positions were used.The mean value of these C D was used to define the final cost-function value for the given D.
This process is then repeated for the next N between N min and N max , using the corresponding D at 100 different partitioning positions, and so on until the smallest D (at N max ).This results in [N max − N min ] values of mean C D : the value of D corresponding to the minimum value of C D is considered to be the optimum bin size D opt .This D opt is then used to construct the optimum histogram of each daily observation.This histogram can be used to calculate a discretised probability density function (p(x)), where for each x (an element of X), the p(x) is calculated by dividing the number of events in each bin by the total number of elements in X. x and p(x) can then be used to calculate the possibility distribution using the transformation described in Eq. ( 1).
First, the p(x) are ranked from highest to lowest, and the x corresponding to the highest p(x) has a membership level of 1. Then the π(x) values for the remaining x are calculated using Eq. ( 1).For cases where multiple elements have equal p(x), the highest π(x) is assigned to each x.For example, if p(x i ) = p(x j ), and x i > x j , then π(x j ) = π(x i ).This means that in some cases, for each calculated membership level, π(x), there exists an α-cut interval [a, b] µ=π(x) where all the elements between a and b have equal p(x) and hence equal π(x).By definition of α-cut intervals, all values of x within the interval [a, b] have at least a possibility of π(x).A special case of this occurs when multiple x have jointequal maximum p(x), meaning that multiple elements have a membership level of µ = 1.Thus, an α-cut interval is created for the µ = 1 case, creating a trapezoidal membership function, where the modal value of the fuzzy number is defined by an interval rather than a single element.
Once all the π(x) are calculated for each element x in X, a discretised empirical membership function of the fuzzy number X can be constructed using the calculated α-cut intervals.That is, the fuzzy number is defined by a number of intervals at different membership levels.The upper and lower limits of the intervals at higher membership levels define the extent of the limits of the intervals at lower membership levels.This way the constructed fuzzy numbers maintain convexity (similar to a procedure used by Alvisi and Franchini, 2011), where the widest intervals have the lowest membership level.For example, the interval at µ = 0.2 will contain the interval µ = 0.4, and this interval will contain the interval at µ = 0.8.
In creating this discretised empirical membership function this way (rather than assuming a shape of the function) means that this function best reflects the possibility distribution of the observed data.However, it also means that all fuzzy numbers created using this method are not guaranteed to be defined at the same π(x), nor have an equal number of π(x) intervals used to define the fuzzy number.Thus, direct fuzzy arithmetic between multiple fuzzy numbers using the extension principle is not possible since it requires each fuzzy number to be defined at the same α-cut intervals (Kaufmann and Gupta, 1985).Thus, linear interpolation is used to define each fuzzy number at a pre-set α-cut interval using the empirical π(x) calculated using the transformation.To select the pre-set α-cut intervals it is illustrative to see the impact of selecting two extreme cases: (i) if only two levels are selected (specifically µ = 0 and 1) the constructed fuzzy number will reduce to a triangular fuzzy number.As discussed above there are important implications of using triangular membership functions that make it undesirable for hydrological data; (ii) if a large number of intervals (e.g. 100 intervals between µ = 0 and 1) are selected, there is a risk that the number of pre-set intervals is much larger than the empirical π(x), which means not enough data (empirical α-cut levels intervals) to conduct interpolation, leading to equal interpolated values at multiple α-cut levels.For this research, results (discussed in the following section) of the bin-size optimisation showed that most daily observations for T and Q resulted in 2 to 10 unique p(x) values.Based on this, six preset α-cut intervals were selected: 0, 0.2, 0.4, 0.6, 0.8 and 1.The empirical π(x) can then be converted to a standardised function at pre-defined membership levels using linear interpolation.

Background on artificial neural networks
Artificial neural networks (ANNs) are a type of data-driven model that are defined as a massively parallel distributed information processing system (Elshorbagy et al., 2010;Wen et al., 2013).ANN models have been widely used in hydrology when the complexity of the physical systems is high owing partially to an incomplete understanding of the underlying process and the lack of availability of necessary data (He et al., 2011;Kasiviswanathan et al., 2013).Further, ANNs arguably require fewer data and do not require an explicit mathematical description of the underlying physical process (Antanasijević et al., 2014), making them a simpler and practical alternative to traditional modelling techniques.
A multilayer perceptron (MLP) is a type of feedforward ANN and is one of the most commonly used in hydrology (Maier et al., 2010).A trained MLP network can be used as a universal approximator with only one hidden layer (Hornik et al., 1989).This means that models are relatively simple to develop and theoretically have the capacity to approximate any linear or non-linear mapping (ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 2000; Elshorbagy et al., 2010;Napolitano et al., 2011;Kasiviswanathan et al., 2013).Further, the popularity of MLPs has meant that subsequent research has continued to use MLPs (He and Valeo, 2009;Napolitano et al., 2011) and thus form a reference for the basis of comparing ANN performance (Alvisi and Franchini, 2011).
In the simplest case, an MLP consists of an input layer, a hidden layer, and an output layer as shown in Fig. 2. Each layer consists of a number of neurons (or nodes) that each receive a signal, and on the basis of the strength of the signal, emit an output.Thus, the final output layer is the synthesis and transformation of all the input signals from both the input and the hidden layer (He and Valeo, 2009).
The number of neurons in the input (n I ) and output (n O ) layers corresponds to the number of variables used as the input and the output, respectively, and the number of neurons in the hidden layer (n H ) are selected based on the relative complexity of the system (Elshorbagy et al., 2010).A typical MLP is expressed mathematically as follows: where x i is the ith observation (an n I × 1 vector) from a total of n observations, W IH is an n H × n I matrix of weights between the input and hidden layers, B H is a vector (n H × 1) of biases in the hidden layer, and y i is the ith output (an n H × 1 vector) of the input signal through the hidden-layer transfer function, f HID .Similarly, W HO is an n O × n H matrix of weights between the hidden and output layers, B O is an n O × 1 vector of biases in the output layer, and f OUT the final transfer function to generate the ith modelled output z i (an n O × 1 vector).
The values of all the weights and biases in the MLP are calculated by training the network by minimising the errortypically mean squared error (E MSE ) (He and Valeo, 2009) -between the modelled output and the target data (i.e.observations).The Levenberg-Marquardt algorithm (LMA) is one of the most common training algorithms (Alvisi et al., 2006).In LMA, the error between the output and target is back-propagated through the model using a gradient method where the weights and biases are adjusted in the direction of maximum error reduction.The LMA is well suited for problems that have a relatively small number of neurons.To counteract potential over-fitting issues, an early-stopping procedure is used (Alvisi et al., 2006;Maier et al., 2010), which is a form of regularisation where the data are split into three subsets (for training, validation and testing).The training is terminated when the error on the validation subset increases from the previous iteration.
Most ANNs have a deterministic structure without a quantification of the uncertainty corresponding to the predictions (Alvisi and Franchini, 2012;Kasiviswanathan and Sudheer, 2013).This means that users of these models may have excessive confidence in the forecasted values and misinterpret the applicability of the results (Alvisi and Franchini, 2011).This lack of uncertainty quantification is one reason for the limited appeal of ANN among water resource managers (Abrahart et al., 2012;Maier et al., 2010).Without this characterisation, the results produced by these models have limited value (Kasiviswanathan and Sudheer, 2013).
In this research, two methods are proposed to quantify the uncertainty in MLP modelling to predict DO in the Bow River.First, the uncertainty in the input data (daily mean water temperature and daily mean flow rate) is represented through the use of fuzzy numbers.These fuzzy numbers are created using the probability-possibility transformation discussed in the previous section.Second, the total uncertainty (as defined by Alvisi and Franchini, 2011) in the weights and biases of an MLP are quantified using a new possibility theory based FNN.The total uncertainty represents the overall uncertainty in the modelling process, and not of the individual components (e.g.randomness in observed data).The following section describes the proposed FNN method.

FNN with fuzzy inputs and possibility based intervals
Alvisi and Franchini (2011) proposed a method to create a FNN, where the weights and biases, and by extension the output, of the neural network are fuzzy numbers rather than crisp (non-fuzzy) numbers.These fuzzy numbers quantify the total uncertainty of the calibrated parameters.Most fuzzy set theory based applications of ANN in hydrology have used fuzzy logic, e.g. the widely used Adaptive Neuro-Fuzzy Inference System, where automated IF-THEN rules are used to create crisp outputs (Abrahart et al., 2010;Alvisi and Franchini, 2011).Thus, the advantage of fuzzy outputs (as developed by Alvisi and Franchini, 2011) is that it provides the uncertainty of the predictions in addition to the uncertainty of the parameters.This uncertainty quantification can be used to by end users to assess the value of the model output.
In their FNN, the MLP model presented in Eqs. ( 5) and ( 6) is modified to predict an interval rather than a single value for the weights, biases and output, corresponding to an α-cut interval (at a defined membership level µ).This is repeated for several α-cut levels, thus building a discretised fuzzy number at a number of membership levels.This is done by using a stepwise, constrained optimisation approach: where all the variables are as described as before, and the superscripts U and L represent the upper and lower limits of the α-cut interval, respectively.The constraints are defined so that the upper and lower limits of each weight and bias (in both layers) minimise the width of the predicted interval: where t is that target (observed data) and P CI is a predefined percentage of data.Alvisi and Franchini (2011) defined P CI to be 100 % at µ = 0, 99 % at µ = 0.25, 95 % at µ = 0.5 and 90 % at µ = 0.75.This algorithm was built starting at µ = 0 and moving to higher membership levels to maintain convex membership functions of the generated fuzzy numbers by using the results of the previous optimisation as the upper and lower limit constraints for the proceeding optimisation.Lastly, at µ = 1, the interval collapses to a singleton, represent the crisp results from non-fuzzy ANN.Therefore, these α-cut intervals of the FNN output quantify the uncertainty around the crisp prediction, within which is expected to contain P CI percentage of data.
In this research, this method is modified in two ways.First, the inputs x are also fuzzy numbers, which means that Eqs. ( 10) and ( 11) are revised as follows: (14) Note that now the input vector is represented by its upper and lower limits.The major impact on this is that the training algorithm for the FNN needs to accommodate this fuzzy α-cut interval, which requires the implementation of fuzzy arithmetic principles (Kaufmann and Gupta, 1985).The cost function for the optimisation remains unchanged.
The second modification of the original algorithm is related to the selection of the percent of data included in the predicted interval (P CI ).In the original, the selection is arbitrary and end-users of this method may be interested in the events that are not included in the selected P CI .Thus, a full spectrum of possible values for a given prediction is required.Thus, the Alvisi and Franchini (2011) approach is further refined by utilising the same relationship between probability and possibility that was used to define the input fuzzy numbers, giving a more objective means of designing FNNs with fuzzy weights, biases and output.
In the adopted possibility-probability framework, the interval [a b] α created by the α-cut at µ = α implies that This can be used to calculate the probability This means that there is a probability of (1 − α) that the random variable x will fall within the interval [a b] α .In other words, the α-cuts of a possibility distribution (at any µ) correspond to the (1 − α) confidence interval of the probability distribution of the same variable (Serrurier and Prade, 2013).This principle is used to select the different P CI for the optimisation constraints rather than the predetermined P CI selected by Alvisi and Franchini (2011).These are shown in Table 2.
Note that for practical purposes, P CI was selected as 99.50 % at µ = 0 to prevent over-fitting.The implication of this selection is that at µ = 0, nearly all the observed data should fall within this predicted FNN interval, reflecting the highest uncertainty in the prediction.The uncertainty decreases as µ increases.For the µ = 1 case the values of the weights and biases were determined to be the mid-point of the interval at µ = 0.8 to maintain convexity of the produced fuzzy numbers, and the difficulty in finding an interval containing 0 % of the data.

Network architecture and implementation
For this research a three layer, feedforward MLP architecture was selected to model minimum daily DO (the output) using fuzzified daily flowrate (Q) and fuzzified daily water temperature (T ) as the inputs.The three layers consist of an input layer, an output layer, and a hidden layer (with 5 neurons based on a trial-and-error search procedure).This architecture was selected for three reasons: it is one of the most commonly used in hydrology (Maier et al., 2010), it can be used as a universal approximator (Hornik et al., 1989), and as reference for comparing performance with previous research (He and Valeo, 2009;Napolitano et al., 2011).In particular, a previous study modelling minimum DO in the Bow River used a three-layer MLP feedforward network (see He et al., 2011).Two transfer functions are required for FNN implementation: the hyperbolic tangent sigmoid function was selected for f HID , and a pure linear function for f OUT .Both function selections follow Alvisi and Franchini (2011), Wen et al. (2013) and Elshorbagy et al. (2010), and are described as follows: The LMA method was used to train the network, minimising E MSE .The input and output data were pre-processed before training, validating and testing: the data were normalised so that all input and output data fell within the interval [−1 1].Furthermore, the data were randomly divided into training, validation and testing subsets, following a 50-25-25 % split.This FNN optimisation algorithm was implemented in MATLAB (version 2015a).First, the built-in MATLAB Neural Network Toolbox was used to estimate the value of weights and biases using the midpoint of the interval at µ = 1.The results from this were used as the constraints to solve the FNN optimisation (Eqs.12 to 14) at subsequent lower membership levels.The Shuffled Complex Evolution algorithm (commonly known as SCE-UA, Duan et al., 1992) was used to find the optimisation solution.The optimisation is run such that the intervals at higher membership levels govern the upper and lower bounds of the predicted interval in order to preserve the convexity of fuzzy numbers.The same process and network architecture was used to run the original FNN method (proposed by Alvisi and Franchini, 2011) using crisp inputs for comparison purposes.For this case, further refinement of the optimised solution was conducted using the built-in MATLAB minimisation function, f mincon .Note that for the crisp inputs, values of fuzzified daily flowrate (Q) and fuzzified daily water temperature (T ) at µ = 1 were used to enable direct comparison.This option allows for the closest comparison between the two approaches that have completely distinct applications.Other options for the crisp inputs (e.g.mean daily value, or maximum daily value) may also be selected for the existing FNN case.

Risk analysis using defuzzification
Risk analyses for complex systems is challenging for a number of reasons, including an insufficient understanding of the failure mechanisms (Deng et al., 2011).The use of imprecise information (e.g.fuzzy numbers) is an effective method of conducting a risk analysis (Deng et al., 2011).However, communicating uncertainty is an important, yet difficult task, and many different frameworks exist to do so; water quality indices (Sadiq et al., 2007;Van Steenbergen et al., 2012) are one example.Since water resource managers often prefer to use probabilistic measures (rather than possibilistic ones), it is important to convert the possibility of low DO to a comparable probability for effective communication of risk analysis.Note that the linguistic parameters (e.g."most likely") that are often used to convey risk or uncertainty (Van Steenbergen et al., 2012) have a probability based meaning -in this case "most likely" is a measure of likelihood.
In this research, a defuzzification procedure is used to convert the possibility of low DO to a probability measure, to represent the risk of observing low DO (below a given threshold) in the Bow River.This method uses the inverse of the transformation described in Eq. (1); however, instead of calculating the probability of one element, p(x), which is of limited value in most applications, it is generalised to calculated P ({X < x}), as follows (from Khan andValeo, 2014a, 2016): for any x in the support (defined as the α-cut interval at µ = 0) of a fuzzy number [a b], we have the corresponding µ and the paired value x which shares the same membership level.The value µ is the sum of the cumulative probability between [a, x] and [x , b], labelled P L and P R , respectively: where P L represents the cumulative probability between a and x which is assumed to equal the probability P ({X < x}), since the fuzzy number defines any values to less than a to be impossible (i.e.µ = 0).Given the fact that the fuzzy number is not symmetrical, the lengths of the two intervals [a, x] and [x , b] can be used to establish a relationship between P L and P R .Then, P L can be estimated as Thus, Eq. ( 20) gives the probability that the predicted minimum DO for a given day is below the threshold value x.
For example, if the lowest acceptable DO concentration for the protection of aquatic life for cold water ecosystems (6.5 mg L −1 , Canadian Council of Ministers of the Environment, 1999) is selected, then this transformation can be used to calculate the probability that the predicted fuzzy DO will be below 6.5 mg L −1 .
3 Results and discussion

Probability-possibility transformation using bin-size optimisation
The bin-size optimisation and the probability-possibility transformation algorithms were applied to the collected Q and T data for the 9-year period.The constructed fuzzy numbers were then used to calibrate the FNN model.This section compares the results of constructing a discretised probability distribution with and without the bin-size optimisation algorithm and its impact on the resulting membership function of the fuzzy number.The comparison is illustrated through five examples each for Q and T as a means of illustrating the advantages of using the proposed approach.
Figure 3 shows sample results of converting hourly Q observations to fuzzy numbers for five cases.The left-most column in the figure shows the raw data, i.e. the observations sampled over the course of 24 h.The resulting histogram based probability functions are shown for both the optimised (D opt , illustrated with circles) and original (D orig = 2r; see Sect."A new algorithm to create fuzzy numbers" for the definition, illustrated with squares) bin sizes in the second column.The third and fourth columns in Fig. 3 show the resulting discretised empirical membership function using each of the histograms.The five examples selected here represent a full spectrum of results for the bin-size optimisation.The first row shows an example of when the optimum result was equal to the measurement resolution (D orig = D opt = 2), followed by cases where the D opt was 4, 4.5, 10 and 20 times greater than the initial bin size.
The example in the first row illustrates cases where the bin-size optimisation algorithm calculates an optimum bin size, corresponding to the minimum cost function C D , which is equal to the instrument measurement resolution.Thus, the resulting probability distributions for both cases are equivalent, as are the membership functions.In most cases, this occurred when the calculated minimum C D would result in a D opt smaller than D orig = 2r, and since this is not physically feasible (measurable), the algorithm did not consider any bin sizes below 2r.Of note in this example is that the transformation of the probability distribution results in five empirical membership levels.Only one element was found to have a membership level equal to 1 (at Q = 161 m 3 s −1 ).Thus, the α-level cut at this level is a simple singleton: [161] µ=1 .The next membership level was calculated as 0.58; again the resulting α-cut level only has one element at Q = 149 (which is less than the modal value).However, at this level the upper and lower limits of α-cuts at higher membership levels define the upper and lower limits of α-cuts at lower levels.Thus, using the information from the α-level cut at µ = 1, the level at µ = 0.58 was defined as [149 161] µ=0.58 .The next membership level calculated was 0.46, and four elements had equal membership levels, ranging between 147 and 165.The α-cut interval at this level was defined as [147 165] µ=0.46 .Note that in this case, this interval captures both the intervals at higher membership levels within its limits, i.e. the lower limit is less than the lower limits of higher intervals, and the upper limit is greater than then upper limits at higher levels.The next membership level was calculated to be 0.125, and three elements between 157 and 171 were assigned this value.However, the lower limit at µ = 0.46 (the next higher membership level) was 147, which is less than 157, and thus, for the α-cut level at this membership, the interval is then revised to [147 171] µ=0.125 rather than [157 171] µ=0.125 to maintain convexity.Again, the reason here is that if something is possible at µ = 0.46, it must be possible (by definition) at µ = 0.125.The last membership level found for this particular example was µ = 0, with six elements sharing this value, ranging from 145 to 173, resulting in an α-cut level of [145 173] µ=0 .Together, these five membership levels define a discretised membership function of the fuzzy number for Q on 27 July 2008.Following this, linear interpolation was conducted to find the elements corresponding to the six predefined membership levels of µ = 0, 0.2, 0.4, 0.6, 0.8 and 1.The results are not explicitly shown in the figure for clarity, but can are essentially located on the dashed line in the last column on the corresponding membership levels.
The second row in Fig. 3 shows the results for 20 August 2009, where the optimum bin size was found to be 4 times higher than the original bin size (D opt = 0.8 vs. D orig = 0.2).The impact of this change is clearly evident in the distribution functions in the second column.The original histogram is multi-modal, and with multiple candidates as the modal value (where µ = 1), whereas the post-optimisation histogram is considerably smoother, with a definitive modal value at Q = 91.4m 3 s −1 .The impact of this increase in bin size is that the resulting membership function is defined at four membership levels (0, 0.25, 0.54 and 1), whereas the original function was defined at six levels, including an interval (rather than singleton) at µ = 1.This decrease in membership levels in this case has a consequence of smoothing out the membership function, as can be seen by comparing the shapes of the functions in columns 3 and 4. The overall impact of this smoothing out of both the distribution and the membership functions is that the heightened specificity of the original function at µ = 0.54 and above is reduced to a more generalised shape.
Since the objective of the bin-size algorithm was to reduce the error between the histogram created using D opt and the unknown theoretical distribution, the density function plotted in Fig. 3 represents the closest distribution to the unknown distribution.Hence, the membership function generated using this optimum distribution better reflects the underlying phenomenon than the membership function generated using D orig .Thus, in comparing columns 3 and 4 for the second row, the smoother membership function representing D opt is preferred.Linear interpolation is then performed on this membership function to get values of Q at the six predefined membership levels.
Similar results can be seen in the third row in Fig. 3, where the optimised bin size is 4.5 times greater than the original bin size (D opt = 9 vs. D orig = 2).Again, the original histogram is extremely uneven, whereas the post-optimisation histogram is considerably smoother, with a definitive modal value at Q = 277 m 3 s −1 .The overall impact of this smoothing is that the specificity of the function at µ = 0.6 and higher of the original function is reduced to a more general shape in the optimised function.
The fourth row shows a different phenomenon, where instead of smoothing out the original membership function, the combined bin-size optimisation and transformation algorithm creates a membership function with more specificity.In this case D opt is 10 times higher than D orig , and the consequence of this increase is the smoother probability density function with one clear modal value (at Q = 70 m 3 s −1 ).In contrast, the original histogram had six elements with jointequal p(x), resulting in a membership function that is shaped similarly to a uniform distribution (column 3) and defined with only three membership levels.This means that all values are considered equally possible and represents maximum vagueness.However, using the optimised value, this is no longer the case and the modal value is assigned a membership level of 1, and the remaining elements are defined at three other membership levels.This suggests that this modal value is more possible (since it has a higher possibility), and this is reflected in the observations.This example illustrates that the method can not only generalise the data to smoother functions (as shown in the first three examples), but can also be more specific when the underlying data demonstrate this; however, this is not captured by the non-optimised bin-width distribution function.
The last example for Q in Fig. 3 is an example of a case where the number of membership levels for both the original and optimised membership functions are equal (four in this case); however, the bin size is 20 times greater for the optimised case.In this case, an optimum bin size was found that did not change the specificity of the membership function, i.e. it is still defined with the same number of intervals but at different membership levels.In this case, the probability for D orig is extremely uneven but smoothed out to a unimodal function with the D opt .The final membership function for D opt is defined more generally (smoothly), especially at higher membership levels compared to the one defined by D orig .This example again demonstrates the utility of the new coupled optimisation-transformation method to create fuzzy numbers for data where the underlying distribution is unknown.Figure 4 shows similar results for the five water temperature examples, where the D opt was equal to the D orig (the first example in the top row) or increased by a factor of 1.5, 2.5, 3 or 5.The first example shows a case with very little T variation over a given day and the water temperature falls between 5.2 and 6.2 • C for the entire day.This lack of variability is responsible for the minimal bin-size selection D opt : a unimodal distribution is best constructed using smaller bin sizes for these cases.The second example shows another case where D opt is only slightly greater than the original, resulting in a somewhat smoother probability function and a slightly smoother membership function.
A major difference between the T and Q data is that the former is strongly diurnal, increasing after sunrise in the morning, peaking in late afternoon, and then decreasing through the night.This temporal trend is seen for all examples in Fig. 4, but most significantly in the bottom three examples.A major implication for this in developing a probability density function for these data is that the resulting shape will have a tendency to be bimodal.This means that the resulting membership functions might be trapezoidal or Table 3.The E MSE and E NSE for each subset of the fuzzy neural network using the method proposed (using fuzzy inputs) and using the original method (using crisp inputs).Thus, without using the bin-size optimisation algorithm, there is a risk that the resulting membership functions will be too vague and will not represent the information that can be gained from the observations.It is worth nothing that for these three examples, if linear interpolation is used on the original membership function, the resulting interpolated fuzzy numbers will all have equal intervals (due to the trapezoidal shape), transferring no useful information to the final fuzzy number.
Overall, the above examples illustrate the advantages of using the coupled methods of bin-size optimisation and probability-possibility transformation to create fuzzy numbers for the FNN application.The applicability of this method is not necessarily restricted to this application and can be applied whenever there is a need to construct fuzzy numbers from observed data.The utility of the first component, bin-size optimisation to estimate the density function is that in cases where either not enough information is available to define a probability distribution, or if the data do not follow the mould of a known density function, or if assumptions about the class of distribution cannot be made, the optimum bin size can be calculated to define an empirical distribution for the probability-possibility transformation.The advantage of the second component, the algorithm to construct the possibility distribution (i.e. the membership function of the fuzzy number) is that it provides a consistent, transparent and objective method to convert observations (e.g.time series data) into fuzzy numbers -which has been cited as a major hurdle in implementing fuzzy number based applications in the literature (Abrahart et al., 2010;Dubois and Prade, 1993;Civanlar and Trussel, 1986).A noteworthy component of this algorithm is that the fuzzy numbers do not reduce to the simple, triangular shaped functions that are widely used, but rather the functions better represent the information from the observations.

Training the fuzzy neural network
Once the observations of the abiotic input parameters (Q and T ) were converted to fuzzy numbers, the FNN training algorithm was run using five neurons in the hidden layer, to predict daily minimum DO in the Bow River.First, the values of the fuzzy numbers at µ = 1 were used to train the crisp network.This was done in order to have initial estimates of the 10 W IH (5 for each input), 5 B IH , 5 W HO , and 1 B O .These initial estimates were used to provide the upper and lower limits of the constraints for the proceeding optimisation algorithm.Once these estimates were calculated, the optimisation algorithm was used to calculate the fuzzy weights and biases using fuzzy inputs, and was started from µ = 0, moving sequentially to higher membership levels until µ = 0.8.The final level (at µ = 1) was calculated using the midpoint of the intervals estimated at µ = 0.8.The total optimisation time for the proposed method was 13 h, whereas the existing method with crisp inputs was 8 h, using a 2.40 GHz Intel ® Xeon microprocessor (with 4 GB RAM).
The E MSE and the Nash-Sutcliffe model efficiency coefficient (E NSE ; Nash and Sutcliffe, 1970) for the training, validation and testing scenarios for µ = 1 for both methods are shown in Table 3.The E MSE for each data set are low, between 11 and 16 % of the mean annual minimum DO seen in the Bow River for the study period.The E NSE values are approximately equal to 0.5 for each subset, which is higher than E NSE values in the literature for water quality parameters when modelled daily (see Moriasi et al. (2007) for a survey of results) and is considered to be "satisfactory" by their standards.In comparing the two methods, it is obvious that including additional information (in the form of fuzzy inputs) does not decrease performance, as the metrics are nearly identical for both methods.This shows that the proposed method has successfully incorporated input data uncertainty in the model architecture.These model performance metrics highlight that in general, predicting minimum DO using abiotic inputs and a data-driven approach is an effective technique.
The results of the optimisation component of the algorithm are summarised in Table 4, which shows the percentage of data (P CI ) captured within the resulting α-cut intervals for each of the three data subsets.The performance for each of the data sets (i.e.train, validation and test) for both methods is nearly identical (on an interval-by-interval basis): the exact amount of data captured within the intervals, as required by the constraints, except for the µ = 0.8 interval.At this interval, the amount of coverage decreases (i.e.lower performance) as the membership level increases, which is unavoidable when the width of the uncertainty bands decrease.As required by Eq. ( 12), the amount of data within the interval has to be greater-than or equal-to the limit defined by P CI (as per Alvisi and Franchini, 2011) which is true for all training data.This means that a solution to satisfy the constraints with a lower amount of data (e.g.reducing the 29.91 % for the µ = 0.8 interval for the proposed method) would either result in non-minimal intervals (though this is unlikely) or that the constraints on the values of the intervals could not be maintained.This latter issue will be discussed in detail with Fig. 5 below.Lastly, as mentioned above, the performance of non-training data sets for both methods decrease as the interval get narrow: this can be seen best by the inability for both methods to capture the exact amount of data required at the µ = 0.8 interval for the validation and testing data sets.These results are similar to the testing data set in Alvisi and Franchini (2011).This comparison again demonstrates the ability of the proposed method with fuzzy inputs to function in similar manner to the original algorithm that used crisp inputs.
A sample of the fuzzy weights and biases produced through the optimisation are shown in Fig. 5.Note that the membership functions are assumed to be piecewise linear (following similar assumptions made in Alvisi and Franchini, 2011;Khan et al., 2013;Khan and Valeo, 2015), i.e. that the intervals at each membership levels can be joined to create a fuzzy number.This can be confirmed by the fact that each of the weights and biases are convex where intervals at lower levels are wider than intervals at higher levels, and are normal with at least one element with µ = 1.Note that each weight and bias has a non-linear membership function, i.e. none of the functions produced follow the typical triangular functions and are not necessarily symmetric about the modal value.The shapes of the fuzzy weights and biases for the proposed and existing methods are generally the same for the input-hidden layer; however, differences can be seen for the hidden-output layer plots.Since the existing method uses crisp inputs, it requires the produced weights and biases to represent the uncertainty in the data, to produce output intervals wide enough to capture the set amount of observations.This is reflected in hidden-output layer plots where the lower limit of the membership function for Weight #5 is highly skewed, which enables this method to capture the low DO events.Similarly, Bias #1 in the hidden-output layer has been translated to a lower value, to produce fuzzy DO outputs that capture the low DO observations.
The figure demonstrates that enough α-cut levels (i.e. six levels equally spaced between 0 and 1) have been selected  to completely define the shape of the membership functions.At a smaller number of levels, e.g. two levels, one at µ = 0 and one at µ = 1, the fuzzy number collapses to a triangular fuzzy number, which is not desirable for this research, as discussed in previous sections.When only two levels are selected, the figures demonstrate that significant differences exist between those simple functions and the ones generated using six membership levels: the decrease in the width of the intervals with an increase in membership level is not linear as is in triangular shaped function.Similarly, a higher number of intervals e.g. 100 intervals, equally spaced between 0 and 1, could be selected.The risk in selecting many intervals is that as the membership level increases (closer to 1) the intervals become narrower as a consequence of convexity.This will result in numerous closely spaced intervals, with essentially equal upper and lower bounds, making the extra information redundant.This is demonstrated in the sample membership functions in Fig. 5 for W IH number 5 and B O (for the proposed method) where the intervals at the higher membership levels collapse to a singleton, or are extremely narrow.Thus, defining more uncertainty bands between the existing levels would not add more detail but would merely replicate the information already calculated.
Connecting this back to the results in Table 4, these two particular weights and biases show why the percentage of data calculated at µ = 0.8 (for training) cannot be improved by further optimisation.At some point, if the intervals at µ = 0.8 for the various weights and biases collapse to a single element, no further refinement in the model is possible (since all the constraints are met) and the minimum interval width of the predicted DO whilst capturing at least the P CI amount of data has been reached.It is worth emphasizing here that the uncertainty represented by these fuzzy number weights and biases is not the uncertainty of the particular weight or bias, but is the total forecasting uncertainty defined by the quantifying bands around the crisp predicted value.Tables 3 and 4 and Fig. 5 demonstrate the overall success of the proposed approach to calibrate an FNN model as compared to a crisp ANN, as well as an FNN that uses crisp inputs.The optimisation algorithm is defined based on the principles of possibility theory (i.e.defining the amount of data to include in each interval) and is a transparent, repeatable and objective (not arbitrary) method to create the fuzzy numbers for the FNN model.
The observed versus crisp predictions (black dots) and fuzzy predictions at µ = 0 (grey lines) for daily minimum DO for the three different data subsets (training, validation and testing) are shown in Fig. 6.The figure shows that nearly all (specifically, 99.4,98.8 and 99.0 % of the training, validation, and testing subsets, respectively, for the proposed method, with similar results for the crisp input FNN method) of the observations fall within the µ = 0 interval, since the observed values (black dots) tend to fall inside the grey lines.This figure also highlights one of the major advantages of the FNN over a simple non-fuzzy ANN: almost all of the fuzzy results intersect the 1 : 1: line, whereas many of the crisp results are quite far from that line, especially at low DO values (which is marked at 6.5 mg L −1 in the figure).In other words, while the fuzzy number prediction may not predict the observed value exactly, it provides at least some possibility of the observed value within its various α-cut intervals, but the crisp results do not provide this additional information.This figure illustrates that E NSE (listed in Table 2 for the µ = 1 case only) is not representative of the entire fuzzy number predictions, since it does not capture the performance at different membership levels.Thus, there is a need to develop an equivalent performance metric when comparing crisp observations to fuzzy number predictions.
Figure 6 also demonstrates the benefit of the FNN approach as compared to the crisp ANN approach with respect to predicting low DO (i.e. when DO is less than 6.5 mg L −1 ).Both the FNN methods predict more of the low DO events within its intervals as compared to the crisp method.The figure demonstrates that both the crisp (µ = 1) and fuzzy predictions tend to overpredict the low DO events (since they fall above the 1 : 1: line), but the fuzzy intervals are closer to the observations (i.e. they intersect the 1 : 1 line for the majority of low DO events), and therefore predict some possibility (even if it is a low probability) of the low DO events occurring.Thus, generally speaking, the ability of the FNN to capture nearly all of the data within its predicted intervals guarantees that most of the low DO events will be successfully predicted.This is a major improvement over conventional methods used to predict low DO.In comparing the two FNN methods, both methods give similar results: the average width of predicted low DO intervals for the 9-year period (at µ = 0) is 8.84 mg L −1 for the proposed method, and 8.60 mg L −1 for the existing method.The impact of the width of the predicted intervals is discussed later.
Trend plots of observed minimum DO and predicted fuzzy minimum DO for the years 2004, 2006, 2007and 2010 are illustrated in Figs. 7 are illustrated in Figs. 7 and 8.These results are shown only for the proposed method for clarity; the difference between the existing method (using crisp inputs) and the proposed method (using fuzzy inputs) is discussed later.These years were se-U.T. Khan and C. Valeo: Dissolved oxygen    lected due to the high number of low DO occurrences in each year (as listed in Table 1), and highlight the utility of the proposed method to predict minimum DO using abiotic factors in the absence of a complete understanding of the physical mechanisms that govern DO in the Bow River.Note that for each year, 50 % of the data are training data, 25 % are validation data and 25 % are testing data.However, for clarity this difference is not individually highlighted for each data point in these figures.
In Figs. 7 and 8, the predicted minimum DO at equivalent membership levels (e.g.0 L or 0 R ) at different time steps are joined together, creating bands representative of the predicted fuzzy numbers calculated at each time step.In doing so, it is apparent that all the observed values fall within the µ = 0 interval for the years 2006, 2007 and 2010, and all but one observation in 2004.The width of each band represents the amount of uncertainty associated with each membership level.For example, the bands are the widest at µ = 0, meaning the results have the most vagueness associated with it.Narrower bands are seen as the membership level increases until µ = 1.This reflects a decrease in vagueness, increase in credibility, or less uncertainty of the predicted value, as the membership level increases.Note that the majority of the predictions at µ = 1 are single elements but some predictions are α-cut intervals (e.g.[a b] µ=1 ).This means that when not enough information is available, the fuzzy prediction collapse to trapezoidal membership functions.
In each of the years shown, the majority of the observations tend to fall within the µ = 0.2 interval or higher, with only the low DO (i.e.< 5 mg L −1 ) falling within the µ = 0 and µ = 0.2 bands.This suggests that the low DO events are predicted with less certainty compared to the occasions when DO concentration is high.Also note that the interval at µ = 0 is highly skewed towards the lower limit (µ = 0 L ), i.e. the modal value is not at the centre of the interval.This shows that the FNN has been trained to capture these low DO events, but predicts them with lower credibility.Compared to the crisp results (i.e.those at µ = 1), for these low DO events, the proposed method provides some possibility of low DO, whereas the crisp results do not predict a possibility of low DO.Thus, the ability to capture the full array of DO observations within different intervals is an advantage of the proposed method over existing methods.
The trend plot for 2004 shows that observed DO decreases rapidly from late June to late July, followed by a few days of missing data and near-zero observations, before increasing to higher concentrations.Details of this trend are shown in Fig. 9 which shows magnified versions of important periods for each year.The reason for this rapid decrease in 2004 is unclear and may be related to problems with the real-time monitoring device which was in its first year of operation that year.However, it demonstrates that the efficacy of data-driven methods is dependent on the quality of the data.Since the proposed method was calibrated to capture nearly all the observations (including outliers like those seen in 2004) within the least certain band at µ = 0, the resulting network predicts results to include these outliers, but at low credibility levels.As the data length increases (i.e. more data are added and the FNN is subsequently updated), the number of these types of outliers included within the µ = 0 band will decrease because the optimisation algorithm (Eqs.12 to 14) searches for the smallest width of the interval whilst including 99.5 % of the data.Thus, with more data, it is expected that these extreme events (i.e. the outliers seen in 2004) will no longer be captured within the µ = 0 band.
The time series plot for 2006 shows that all the observations fall within the predicted intervals, and that the predicted trend generally follows the observed trend.The majority of the 25 low DO events (< 5 mg L −1 ) occur from mid-July and continue occasionally until mid-September.Details of some of these low DO events are plotted in Fig. 9. Figure 7 demonstrates these low DO events are captured between µ = 0 and 0.2 intervals, similar to the 2004 case, meaning that the credibility of these predictions is the lowest.However, unlike the 2004 case, Fig. 9 demonstrates that in 2006 the predicted intervals tend to follow the same trend as the observations for these low DO events, even if it is predicting them at a low credibility.
In contrast to the results from 2004 and 2006, the majority of observations are captured at higher membership levels (i.e.greater than µ = 0.2) in 2007 as shown in Fig. 8.That is, only a limited number of observations are captured within the lowest credibility band.More importantly, 26 out of the 27 low DO (< 6.5 mg L −1 ) events are captured at a membership level greater than 0.2 L , meaning that the low DO predic-tions in 2007 for the 6.5 mg L −1 guideline are predicted with higher credibility than the 2004 and 2006 cases.Another difference for the results from this year is that many of the low DO observations for 2007 are more evenly scattered around the µ = 1 predictions (as seen in Fig. 9), in contrast to the 2004 and 2006 cases, where the low DO events were always predicted to be closer to lower bounds of the intervals.
The trend plot for 2010 is shown in Fig. 8, and it is clear that all observations fall within the µ = 0.2 interval or at higher α-level intervals, meaning that the predictions capture the observations with higher certainty.This is likely due to the lack of DO events below 5 or 6.5 mg L −1 in 2010.Also of note for this year is that all observations are less than 10 mg L −1 , and about 90 % of all observations are below the 9.5 mg L −1 guideline (as listed in Table 1).The trend plot again illustrates that the FNN generally reproduces the overall trend of observed minimum DO.This can be seen in a period in early May where DO falls from a high of 10 mg L −1 to a low of 7 mg L −1 , and all the predicted intervals replicate the trend.This is an indication that the two abiotic input parameters are suitable parameters for predicting minimum DO in this urbanised watershed.
Figure 9 shows details of a low DO (< 9.5 mg L −1 ) event in 2010 in late July through to late August.The bulk of low DO events are captured between the µ = 0.6 R and 0.2 R intervals -demonstrating that these values are predicted with higher credibility than the other low DO cases in 2004 and 2006, and are predicted closer to the upper end of the interval.All of the low DO (< 9.5 mg L −1 ) observations in this plot are underpredicted by the crisp method (though not with the FNN method, since they are captured within a fuzzy interval).This shows that the crisp ANN results tend to overpredict extremely low DO events (i.e.< 5 mg L −1 ) while underpredicting the DO < 9.5 mg L −1 events.
The analysis of the trend plots for these four sample years shows that the proposed FNN method is extremely versatile in capturing the observed daily minimum DO in the Bow River using Q and T as inputs.The crisp case (at µ = 1) cannot capture the low DO events (as shown in Figs.7  and 8); however, the FNN is able to capture these low DO events.Generally speaking, the training method selected for the FNN has been successful in creating nested intervals to represent the predicted fuzzy numbers.The widths of the predicted intervals correspond to the certainty of the predictions (i.e.larger intervals for more uncertainty).The utility of this method is further demonstrated in Sect.3.3, where the risk of low DO is estimated using a possibility-probability (i.e.defuzzification) technique.
Figure 10 shows a comparison of the predicted minimum DO trends from both the proposed FNN method (solid black line) and the existing FNN method (dashed black line), along with the observed data (circles) for each membership level for the 2009 data.These figures show that despite the use of more data for the inputs (i.e.fuzzy numbers versus crisp numbers), both methods are optimised to show similar re-sults (due to the optimisation algorithm requiring a specific amount of data being captured at each level).This shows that the optimisation algorithm developed in this manuscript for fuzzy inputs successfully mimics the original algorithm developed by Alvisi and Franchini (2011) that only used crisp inputs.Thus, when modelling a complex system, such as the minimum daily DO in the Bow River, the uncertainty in the inputs can also be quantified and propagated through the data-driven model, by using the proposed method.This is a major advantage over the original model (Alvisi and Franchini, 2011) that only allowed crisp inputs to be used.Note that as per Table 4, both methods approximately capture the same amount of data at each interval, however as Fig. 10 indicates this does not necessarily mean that the predicted intervals are exactly the same for both methods.Both methods predict unique intervals, with the overall result being that P CI amount of data is captured within each interval.Note that at the interval µ = 1, the existing method using crisp inputs by design predicts this interval as a singleton (thus the interval width will always be zero), whereas the proposed method has an additional feature of predicting an interval for the µ = 1 level.An instance of this can be seen near the end of September where an interval is predicted rather than a singleton.
Figure 11 compares with width of predicted intervals at four selected membership levels for both methods, generally showing mixed results.As discussed above the existing method does not predict an interval for µ = 1, thus, a comparison cannot be made and is not included.At the µ = 0.8 level, the average width of the intervals for the existing method is close to 0, whereas for the proposed method is 0.36 mg L −1 .This is consistent with the results shown in Table 4 and Fig. 5, which demonstrate the narrow interval at higher membership levels.Annual comparisons of the remaining intervals show that the intervals at µ = 0 are larger for the proposed method compared to the existing method for all but one year (2009).However, at µ = 0.2, 0.4 and 0.6, the width of the intervals is smaller for the proposed method for a majority of years, however the overall differences are not statistically significant (p > 0.05 using the two-sample Kolmogorov-Smirnov test).The results demonstrate that while both method can achieve the optimisation objectives whilst respecting the constraints (i.e.P CI ), it is reflected differently in the predicted fuzzy intervals.Generally, the predicted fuzzy numbers using the proposed method have a larger support (at µ = 0) signifying that the increased uncertainty due to fuzzy inputs into the model are propagated through the model.Whereas, the lack of inclusion of uncertain input data in the existing method results in a slightly narrower average support.In essence, the proposed FNN model is modelling a more complex system (because of the inclusion of input uncertainty) whereas the existing method models the system by assuming lower complexity (by ignoring the input uncertainty).

Risk analysis for low DO events
The utility of the FNN method is illustrated through an analysis of the ability of the proposed model to predict low DO events, and then a possibility-probability transformation is used to assess the risk of these low DO events.The number of occasions when observed DO was below any of the three guidelines used for this research are summarised in Table 1.The FNN model was cable to capture 100 % of all low DO events (i.e.below 5, or 6.5 or 9.5 mg L −1 ) within the predicted intervals.In comparison, the crisp ANN network (i.e. at µ = 1) did not predict DO to be less than 5 mg L −1 on any of the 51 occasions.Similarly, it predicted DO to be less than the more conservative limit of 6.5 mg L −1 in only 53 % of the 184 occurrences.For the last case, the 9.5 mg L −1 limit, the ANN method still trailed the FNN method, by predicting 96 % of these low DO events.This illustrates that not only can the FNN method capture more low DO events within its predicted intervals, it also performs exceptionally better for the highest risk case (DO < 5 mg L −1 ).In general, more days were correctly identified when there was a risk of low DO using FNN rather than the typical ANN approach.This is one of the major advantages of using a fuzzy number based uncertainty analysis component for low DO prediction.

07-Sep-2004
Predicted DO Observed DO Guideline Dissolved oxygen, mg L -1 Figure 12.Sample plots of low DO events and the corresponding risk of low DO calculated using a possibility-probability transformation for the 5 mg L −1 (top panels), 6.5 mg L −1 (middle panels), and 9.5 mg L −1 (bottom panels) guidelines.
Once all the low DO events were identified, the inverse transformation (defuzzification) described in Sect.2.4 was used to estimate the probability of low DO.The primary reason for converting from possibility to probability is to improve the communication of the risk of low DO.For each low DO event (i.e. at 5, or 6 or 9.5 mg L −1 ), the predicted membership function was used to determine the possibility of low DO, i.e. identify the membership level where the membership function intersects either of the low DO guidelines (some examples of low DO events are shown in Fig. 12).Once these were identified, the defuzzification technique was used to predict the probability of low DO (e.g.P ({DO < 5 mg L −1 })).For the first case, P {(DO < 5 mg L −1 }), the probability ranged between 11.5 and 16.6 % for the 51 events, with a median value of 14 %.This means that on days when DO was observed to be below 5 mg L −1 , the FNN results identified the possibility of low DO and the probability of DO to be below the 5 mg L −1 guideline of ∼ 14 %.Thus, the FNN method predicts a probability of low DO (even if it is relatively small) on days when the crisp ANN does not predict a low DO event.This value can be used as a threshold by water resource managers for estimating the risk of low DO.For ex-ample, if forecast water temperature and flow rate are used to predict minimum fuzzy DO using the calibrated model, and if the risk of low DO reaches 14 %, the event can be flagged.Appropriate defence mechanisms can then be implemented to prevent the occurrence of low DO.
For the 184 cases where DO was observed to be less than 6.5 mg L −1 , the probability-possibility transformation estimated the risk of low DO to be between 13.7 and 92.9 %, with a median value of 73.4 %.Compared to the first case, the probability of low DO for this threshold is higher and more variable.The low probabilities are associated with predictions of low DO at lower credibility levels at the lower limit of the intervals (i.e.L), whereas the higher probabilities are associated with predictions corresponding to the upper limits of the intervals (R).For 43 out of the 184 low DO events, the probability of low DO was less than 21 % -these events correspond to predictions of low DO at low credibility levels at the lower limits.For the majority of events (107 out of 184), the risk was high, more than 65 %.It is worth noting that the crisp network only predicted 53 % of these low DO events, and of those correctly identified, the majority were over-predicted.
Hydrol.Earth Syst. Sci., 20, 2267-2293, 2016 www.hydrol-earth-syst-sci.net/20/2267/2016/For the last, most conservative case, the probability of predicting DO to be less than 9.5 mg L −1 (1179 events) varied between 21.9 and 100 %, with a median value of 98.1 %.Only 46 out of the 1179 events had a probability of less than 70 %; the majority of events had a high risk of low DO: more than 80 % of the events had a risk of low DO of more than 90 %.This shows that the FNN can predict, with high probability, the events where minimum daily DO is observed to be below the 9.5 mg L −1 limit.
It is worth noting here that the proposed FNN model was designed to only include data from the April to October each year, corresponding to the ice-free period (as defined in Sect.2.1).This implies that the analysis has been conducted on the time period that is most critical or susceptible to low DO.Thus, as the proposed FNN model predicts, there is possibility of low DO on most days (as shown in the trend plots in Figs.7 and 8).However, the consistency principle (Zadeh, 1978) implies that an event must be possible before it is probable.Thus, a possibility to predict low DO does not imply that it will occur with a high probability.In fact, nearly all the possibility of low DO events occurs at low membership levels (i.e.µ < 0.2) implying a low possibility -and the skewed nature of the results deem the probability to be low as well.For example, for the DO < 5 mg L −1 case, the proposed FNN model predicted 1367 days where low DO was predicted but not observed, however, on average the probability of low DO for 98 % of these events was below the threshold criteria (14 %) mentioned above.Thus, the number of "false alarms" predicted by the proposed method is very low.Similarly, for the DO < 6.5 and 9.5 mg L −1 cases, each had ∼ 94 % of the low DO cases to fall below their respective threshold criteria.This shows that while FNN model correctly predicts a possibility of low DO for the majority of the days (corresponding to the typical low DO conditions), the risk of predicting a "false alarm" is low.Lastly, it should be noted that the wide intervals predicted at µ = 0 are a function of the rapidly decreasing low DO value seen in 2004 (discussed in Sect.3.2) that are likely due to instrument error.With the inclusion of new data as it becomes available, and as the model parameters are updated, it is expected that these outliers will be part of the 0.5 % of P CI not included in the predicted intervals, resulting in narrower bands of predictions.
The predicted membership functions of minimum DO for nine examples are shown Fig. 12, along with the observed minimum DO (the vertical dashed line).Three samples are shown for each low DO guideline: 5, 6.5, or 9.5 mg L −1 ; along with the associated risk of low DO calculated using the defuzzification technique.Note that the membership functions of the predicted fuzzy numbers show that each is uniquely shaped, convex and normal, highlighting the fact that the proposed optimisation algorithm successfully produces nested intervals at each membership levels (as it does for the weights and biases shown in Fig. 5).For the predicted fuzzy DO, the intervals are largest at µ = 0, which decrease in size as the membership level increases.The shape of the membership functions are not triangular shaped as assumed in many fuzzy number based applications.This is of significance because it shows that the amount of uncertainty (or credibility) of the model output does not change linearly with the magnitude of DO, which has important implications regarding the risk of low DO.
For the 5 mg L −1 guideline, the intersection of the membership function and the guideline occurs at low possibility levels (between µ = 0 L and 0.2 L ), meaning that the corresponding probability will be low as well, as illustrated by the probability values shown in the figure.This again highlights that the risk of low DO (< 5 mg L −1 ) is predicted to be low by the FNN mostly due to the fact that the observations are captured at low membership levels.Note that the crisp ANN results (at µ = 1) always overpredict low DO, as shown in these three examples.The observed value falls within the predicted interval for each case, also at low membership levels.
The examples for the 6.5 mg L −1 guideline (second row in Fig. 12) show that the intersection between the membership function and the guidelines occurs between µ = 0.4 and 0.6 on 26 July 2006, between µ = 0.6 and 0.8 on 8 August 2007, and at about µ = 0.6 on 29 September 2004.This illustrates the broader trend with the 6.5 mg L −1 guideline (which was discussed earlier and had a large range of risk predictions), which is that for the full data set, the possibility of low DO (< 6.5 mg L −1 ) occurs at every interval, with the majority occurring at higher intervals.This is in contrast to the 5 mg L −1 guideline where the possibility of low DO only occurs between µ = 0 and 0.2.
The last row in Fig. 12 show sample low DO results for the 9.5 mg L −1 guideline.As discussed above, more than 80 % of these events had a high (more than 90 %) risk of low DO.In the first example, on 23 September 2004, the guideline intersects the membership function at µ = ∼ 0.2 R , corresponding to a ∼ 97 % risk of low DO.The 6 August 2008 has a low DO prediction of 100 % -this is because the predicted fuzzy number is entirely below the guideline limit.A similar result can be seen for the last example.These examples also illustrate that had only a triangular membership function been used (i.e. the fuzzy numbers defined at two membership levels), the probability of low DO could not be quantified as specifically as it has been here.The slight changes in membership function shapes between intervals impact the final probability, and a linear function would have not captured these changes.
These examples are meant to illustrate the potential utility of the data-driven and abiotic input parameter DO model, which can be used to assess the risk of low DO.Given that it is a data-driven approach, the model can be continually updated as more data are available, further refining the predictions.Various combinations of input values can be used to predict fuzzy minimum DO and defuzzification technique can be used to quantify the risk of low DO given the input values.The utility of this method is that a water-resource U. T. Khan and C. Valeo: Dissolved oxygen prediction using a possibility theory based fuzzy neural network manager can use forecasted water temperature data and expected flow rates to quantify the risk of low DO events in the Bow River, and can plan accordingly.For example, if the risk of low DO reaches a specific numerical threshold or trigger, different actions or strategies (e.g.increasing flow rate in the river by controlled release from the upstream dams) can be implemented.The quantification of the risk to specific probabilities means that the severity of the response can be tuned to the severity of the calculated risk.

Conclusions
A new method to predict DO concentration in an urbanised watershed is proposed.Given the lack of understanding of the physical system that governs DO concentration in the Bow River (in Calgary, Canada), a data-driven approach using fuzzy numbers is proposed to account for the uncertainty.Further, the model uses abiotic (non-living, physical and chemical attributes) factors as inputs to the model.Specifically, water temperature and flow rate were selected, which are routinely monitored, and thus a large data set is available.
The data-driven approach proposed is a modification of an existing fuzzy neural network method that quantifies the total uncertainty in the model by using fuzzy number weights and biases.The proposed model refines the exiting model by (i) using possibility theory based intervals to calibrate the neural network (rather than arbitrarily selecting confidence intervals), and (ii) using fuzzy number inputs rather than crisp inputs.This research also proposes a new twostep method to construct these fuzzy number inputs using observations.First a bin-size optimising algorithm is used to find the optimum histogram (as an estimate of the underlying but unknown probably density function of the observations).Then a probability-possibility transformation is used to determine the shape of the fuzzy number membership function.
The results demonstrate that the network training algorithm proposed can be successfully implemented.Model results demonstrate that low DO events are better captured by the fuzzy network as compared to a non-fuzzy network.A defuzzification technique is then used to calculate the risk of low DO events.Generally speaking, the method demonstrates that a data-driven approach using abiotic inputs is a feasible method for predicting minimum daily DO.Results from this research can be implemented by water resource managers to assess conditions that lead to and quantify the risk of low DO.

Figure 1 .
Figure 1.An aerial view of the City of Calgary, Canada, showing the locations of (a) flow monitoring site Bow River at Calgary (Water Survey of Canada ID: 05BH004), three wastewater treatment plants at (b) Bonnybrook, (c) Fish Creek, and (d) Pine Creek, and two water quality sampling sites, (e) Stier's Ranch and (f) Highwood.

Figure 2 .
Figure 2.An example of a three-layer multilayer perceptron feedforward ANN, with two input neurons, the hidden-layer neurons, and one output neuron.WIH are the weights between the input and hidden layers, WHO are the weights between the hidden and output layers, BH are the biases in the hidden layer, and BO is the bias in the output layer.

Figure 3 .
Figure 3. Sample results of probability-possibility transformation for flow rate, Q.

Figure 4 .
Figure 4. Sample results of probability-possibility transformation for water temperature, T .

Figure 5 .
Figure5.Sample plots of the produced membership functions for the weights and biases of the fuzzy neural network for both the proposed and existing methods.

Figure 6 .
Figure6.A comparison of the predicted and observed minimum DO at the µ = 0 interval (grey line) and at µ = 1 (black dots) for the proposed (top row panels) and existing (bottom row panels) methods.

Figure 7 .
Figure 7.A comparison of the observed and predicted minimum DO trends for 2004 (top panel) and 2006 (bottom panel).

Figure 8 .
Figure 8.A comparison of the observed and predicted minimum DO trends for 3 sample years: 2007 (top panel) and 2010 (bottom panel).

Figure 9 .
Figure 9. Zoomed-in views of the trend plots for 4 sample years corresponding to important periods with low DO occurrences.

Figure 10 .Figure 11 .
Figure 10.Comparison of predicted trends of the proposed (solid black line) and existing (dashed black line) methods shown for 2009 for each membership level.Observations are shown as black circles.

U
. T.Khan and C.  Valeo: Dissolved oxygen prediction using a possibility theory based fuzzy neural network

Table 1 .
A summary of low DO events in the Bow River between 2004 and 2012 and the corresponding minimum acceptable DO concentration guidelines.
a For the protection of aquatic life for 1 day (AENV, 1997); b for the protection of aquatic life in cold freshwater for other life (i.e.not early) stages (CCME, 1999); c for the protection of aquatic life in cold freshwater for early life stages(CCME, 1999).

Table 2 .
Selected values for P CI for the FNN optimisation.

Table 4 .
Percentage of data captured within each α-cut interval for the three subsets of data.
prediction using a possibility theory based fuzzy neural network