Complex networks for streamflow dynamics

Introduction Conclusions References

It is important to recognize that a fundamental idea in streamflow (and other hydrologic) studies is to establish connections that generally exist between the different elements or items (known or assumed) of the underlying system. Depending upon the situation (e.g. catchment, purpose, problem), these elements include hydroclimatic variables, catchment characteristics, model parameters, and others (and their com-5 binations), and their connections are often different with respect to space, time, and space-time. Unraveling the nature and extent of these connections has always been a great challenge, not to mention the challenge in the identification of all the relevant elements in the first place. Thus far, a plethora of concepts and methods has been proposed and applied for studying the connections associated with streamflow, 10 including those based on time, distance, correlation, variability, scale, patterns, and many other properties/measures as well as their combinations and variants, in both single-variable and multi-variable perspectives; see, for example, Gupta et al. (1986), Salas et al. (1995), Grayson and Blöschl (2000), Yang et al. (2004), Archfield and Vogel (2010), and Li et al. (2012) for some details. Despite the progress made through these 15 concepts and methods, our understanding of the connections in streamflow is still far from adequate.
In view of this, there is indeed a need to greatly advance our studies on streamflow connections. Some important current and foreseeable future problems, including our ever-increasing demands for water, the potential impacts of climate change on wa- 20 ter security and hydroclimatic disasters, and the numerous issues associated with the management of our environment and ecosystems, further reflect the urgency to this need. A greater understanding of streamflow connections will also enhance our recent and current efforts in the estimation of data at ungaged locations (e.g. predictions in ungaged basins -PUB) (see Hrachowitz et al., 2013) and development of a generaliza- 25 tion framework for hydrologic modeling (e.g. catchment classification) (see Sivakumar et al., 2014), among others. The question, however, remains on the identification of a suitable theory that can help bring advancement to studies on streamflow connections. In this regard, recent developments in the field of complex systems science can offer some crucial clues. The present study introduces the theory of complex networks, or simply networks, for studying connections in streamflow. In particular, the study focuses on spatial connections in streamflow. The origin of the concept of networks can be traced back to the works of Leonhard Euler, during the first half of the eighteenth century, on the Seven Bridges of  berg (Euler, 1741), which laid the foundations of what would become popularly known as graph theory. Graph theory witnessed several important theoretical developments in the nineteeth century, including topology (originally introduced as topologie in German) (Listing, 1848) and trees (Cayley, 1857). Further significant advances were made during the twentieth century, especially with the development of random graph theory 10 by Erdös and Rényi (1960). The concepts of graph theory, and random graph theory in particular, have found a wide variety of applications in numerous fields, including linguistics, physics, chemistry, biology, sociology, engineering, economics, and ecology; see, for example, Berge (1962), Bondy and Murty (1976), and Bollobás (1998) for extensive reviews.

15
Despite the above-mentioned developments and applications, studies on graph theory, including random graph theory, had some major deficiencies. First, the studies largely focused on networks that are regular, simple, small, and static. As a result, they are generally unsuitable for examining real networks, as such networks are often highly irregular, complex, large, and dynamically evolving in time. Second, even 20 while examining complex and large-scale networks, they assumed that such networks are wired randomly together (Erdös and Rényi, 1960). Such an assumption, however, is not necessarily valid for real networks, since order and determinism are inherent in real systems and networks. Indeed, real networks are neither completely ordered nor completely random, but generally exhibit important properties of both. These ob-Introduction new discoveries about complex networks, including small-world networks (Watts and Strogatz, 1998), scale-free networks (Barabási and Albert, 1999), network motifs (Milo et al., 2002), as well as other notable advances, such as a new method for identifying community structure (Girvan and Newman, 2002). Since then, the science of networks has found applications in many different fields, including natural and physical sciences, 5 social sciences, medical sciences, economics, and engineering and technology (e.g. Albert et al., 1999;Bouchaud and Mézard, 2000;Newman, 2001;Liljeros et al., 2001;Tsonis and Roebber, 2004;Davis et al., 2013). In hydrology, applications of networks are just starting to emerge, and so far include river networks, virtual water trade, precipitation, and agricultural pollution due to international trade, among others (Rinaldo et al., 2006;Suweis et al., 2011;Dalin et al., 2012;Boers et al., 2013;Scarsoglio et al., 2013). In a very recent study, Sivakumar (2014) has argued that networks can be useful for studying all types of connections in hydrology and, hence, can provide a generic theory for hydrology. With the encouraging results reported by the above studies, the present study ex-15 plores the usefulness of the theory of networks for studying connections in streamflow, especially the spatial connections. To this end, monthly streamflow data observed over a period of 52 years  from each of 639 gaging stations in the contiguous United States are studied. The connections are examined using the concept of clustering coefficient. The clustering coefficient is a measure of local density and, 20 hence, quantifies the tendency of a network to cluster. The implications of the clustering coefficient results for interpolation/extrapolation of streamflow data as well as for classification of catchments are also discussed. The rest of this paper is organized as follows. Section 2 introduces the concept of networks and describes the procedure for calculation of the clustering coefficient in 25 a network. Section 3 presents details of the study area and streamflow data considered. Section 4 reports the results, first from the traditional linear correlation analysis and then from the network-based clustering coefficient analysis. Section 5 highlights the implications of the results.

Network
A network or a graph is a set of points connected together by a set of lines, as shown in Fig. 1. The points are referred to as vertices or nodes and the lines are referred to as edges or links; here, the term nodes are used for points and the term 5 links are used for lines. Mathematically, a network can be represented as where P is a set of N nodes (P 1 , P 2 , . . . , P N ) and E is a set of n links. The network shown in Fig. 1 has N = 7 (nodes) and n = 8 (links), with P = {1, 2, 3, 4, 5, 6, 7} and Figure 1 is perhaps the simplest form of network, i.e. one with a set of identical 10 nodes connected by identical links. There are, however, many ways in which networks may be more complex. For instance, a network: (1) may have more than one different type of node and/or link, (2) may contain nodes and links with a variety of properties, such as different weights for different nodes and links depending on the strength of nodes and connections, (3) may have links that can be directed (pointing in only one 15 direction), with either cyclic (i.e. containing closed loops of links) or acyclic form, (4) may have multilinks (i.e. repeated links between the same pair of nodes), self-links (i.e. links connecting a node to itself), and hyperlinks (i.e. links connecting more than two nodes together); and (5) may be bipartite, i.e. containing nodes of two distinct types, with links running only between unlike types. 20 There are many different ways and measures to study the characteristics of networks. In the context of the modern theory of complex networks (which also include random graphs), three concepts are prominent: (1) clustering coefficient, (2) smallworld networks; and (3) degree distribution. As the present study uses the concept of clustering coefficient for studying streamflow connections, it is described next. Introduction

Clustering coefficient
The clustering coefficient quantifies the tendency of a network to cluster, which is one of the most fundamental properties of networks (Watts and Strogatz, 1998). The clustering coefficient of a network is basically a measure of local density. The concept of clustering has its origin in sociology, under the name "fraction of transitive triples" (Wasserman and Faust, 1994). The procedure for calculating the clustering coefficient is as follows.
Let us consider first a selected node i in the network, having k i links which connect it to k i other nodes. For illustration, Fig. 2 presents a network consisting of eight nodes, with the node i having four links (see Fig. 2, left). The four nodes corresponding to 10 these four links are the neighbors of node i ; the neighbors are identified based on some conditions (e.g. correlation between node i and other nodes in the network). If the neighbors of the original node (i ) were part of a cluster, there would be k i (k i − 1)/2 links between them. As shown in Fig. 2 (right), there are 4(4 − 1)/2 = 6 links in the cluster of node i . The clustering coefficient of node i is then given by the ratio between 15 the number E i of links that actually exist between these k i nodes (shown as solid lines on Fig. 2, right) and the total number k i (k i − 1)/2 (i.e. all lines on Fig. 2, right), The clustering coefficient of the whole network C is the average of the clustering coefficients C i 's of all the individual nodes. 20 The clustering coefficient of a random graph is C = p (where p is the probability of two nodes being connected), since the links in a random graph are distributed randomly. However, the clustering coefficient of real networks is generally much larger than that of a comparable random network (i.e. having the same number of nodes and links as the real network). Therefore, the clustering coefficient analysis offers useful 25 information about the nature of the network and, hence, the appropriate model (e.g. level of complexity), among others.

Study area and data
In the present study, streamflow data from the United States are studied to explore the usefulness of the theory of networks for identifying connections in streamflow, with a focus on spatial connections. Monthly data from an extensive network of 639 streamflow gaging stations in the contiguous US are studied. The locations of these 639 stations 5 are shown in Fig. 3. The streamflow data are obtained from the US Geological Survey database (http://nwis.waterdata.usgs.gov/nwis). Streamflow data in the US are commonly expressed in "water years," which commence in October. The data used in this study are those observed over a period of 52 years (October 1951-September 2003, and are average monthly values. 10 During the past few decades, a large number of studies have investigated the above streamflow dataset (or a part or variant of it) in many different contexts (e.g. Slack and Landwehr, 1992;Kahya and Dracup, 1993;Tootle and Piechota, 2006;Sivakumar and Singh, 2012). Some of these studies have explicitly addressed the connections of streamflow, although with large-scale climatic patterns and relevant indices, including 15 El-Niño, La-Niña, Southern Oscillation Index (SOI), Pacific North America (PNA) Index, and Pacific Decadal Oscillation (PDO). However, within the specific context of the network analysis for connections among streamflow stations presented here, as well as in the broader context of complex systems science for streamflow analysis, the study by Sivakumar and Singh (2012) is worth mentioning, as it has addressed the aspects of 20 streamflow variability, nonlinearity, and dominant governing mechanisms, especially for studies on model simplification, data interpolation/extrapolation, and catchment classification framework.
The above 639 streamflow stations and the observed streamflow data exhibit tremendous variations in their characteristics, often by about four orders of magnitude.   (5) number of zero-flow months ranges from none to 424. Table 1 presents a summary of the minimum and maximum values of some important characteristics of the stations and flows, including the corresponding station numbers. Figure 3 presents the variations in the mean (Fig. 3a), standard deviation (Fig. 3b), and coefficient of variation (Fig. 3c)

20
The usefulness of the theory of networks for studying connections in streamflow is examined through the clustering coefficient analysis on the monthly streamflow data from the above 639 stations in the United States. To put the clustering coefficient analysis in a proper perspective, a preliminary linear correlation-based analysis is also performed.

Correlation analysis
A common approach to examine connections between streamflow observed at different stations is through a simple linear cross correlation analysis, where the correlation for any given station is given by the average of its correlation with all the other stations. Several variants of this procedure are also usually considered. These include: near-5 est neighbors -for example, number of nearby stations based on distance or stations within a pre-defined region of geographic promixity or neighborhood, with equal or unequal weightage (e.g. inverse distance); and similar stations -stations with similar properties (e.g. in terms of climate, rainfall, basin characteristics, land use), which may or may not include nearest stations. These and many other correlation-based proce-10 dures (e.g. spline fitting) are routinely employed for interpolation and extrapolation of streamflow and other hydrologic data.
In this study, two of the above-mentioned procedures are employed for examining the monthly streamflow from the 639 stations: (1) for each station, the correlation is the average of its correlation with all the other 638 stations; and (2) for each station, 15 the correlation is the average of correlations for a certain number of nearest neighbors -30, 15, and 5 neighbors. When all the 638 stations are considered, the correlation values are generally very low, as expected, with only 0.5 % of the stations exceeding a value of 0.4 (see Fig. 4a). This is mainly due to the consideration of a very large region, with the stations coming from different climatic, catchment, land use, and other 20 characteristics. When the number of stations is reduced, the results get generally better -see Fig. 4b (30 neighbors), Fig. 4c (15 neighbors), and Fig. 4d (5 neighbors). Among the three neighborhood cases, the best correlation results are obtained when the neighborhood is the smallest, i.e. 5 neighbors (Fig. 4d), with a large number of stations having correlations above 0.7. 25 While one can study a large number of combinations in terms of the neighborhood, what is evident from even the very few cases presented here is that there are obvious regional patterns in terms of correlations, regardless of the number of neighbors. These Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | regional patterns are considered to have important implications for a wide range of studies in hydrology and water resources, as they are commonly used as a basis for interpolation and extrapolation of streamflow and, subsequently, for water resources assessment, planning, and management. However, as Sivakumar and Singh (2012) point out, through their nonlinear dynamic study on streamflow data from the western 5 United States, the use of regional patterns as basis for streamflow studies may be misleading, as such patterns are not necessarily a true representation of the actual connections between the stations but may just be spurious. The obvious question, therefore, is: how to identify if the connections are actual or spurious? This is where the ideas from the theory of networks can be particularly useful, as presented next 10 using the clustering coefficient analysis of the streamflow data from the 639 stations.

Network analysis -clustering coefficient
The clustering coefficient is calculated for the monthly streamflow data from the network of 639 stations in the United States, according to the procedure described in Sect. 2. The essence of the procedure for the streamflow data is as follows. For a given 15 streamflow station or node i , the nearest neighbors k i in the network of 639 stations (more specifically, the remaining 638 stations) are identified based on a (pre-specified) threshold value (T ). To define the threshold value, the correlations in streamflow data between different stations are considered as a reasonable measure. With this, if, for example, the correlation between station i and any other station(s) in the entire net-20 work of 639 stations exceeds the threshold value, then that station(s) is considered as a neighbor(s), k i , for station i . The cluster of these k i neighbors then forms the basis for identifying the actual connections. Therefore, the actual connections are those links in the cluster of stations (not just nearest stations) having correlations among themselves exceeding the threshold value. streamflow studies, especially spatial and temporal correlations, offers some useful clues. For instance, streamflow data generally exhibit high spatial correlations (when compared to rainfall values, for example), especially at the monthly scale. With this knowledge, and also with the condition that −1 < T < 1.0, closer intervals of values are considered at the higher end of correlations and vice-versa. In addition, very low val- While the usefulness of the clustering coefficient values in assessing connections between streamflow stations and identifying regions having similarity/differences is abundantly clear, the actual links in the network would certainly offer more specific details 15 as to where and how connections exist. To facilitate this, Fig. 6 shows the actual links for four selected streamflow stations (red circles) for threshold values of 0.75 (Fig. 6a), 0.80 (Fig. 6b), and 0.85 (Fig. 6c); the nodes and links for T = 0.70 are too many, and so do not offer a good visualization. In each of these plots, for the station of interest (red circle), a green circle indicates a station that has a correlation coefficient value 20 exceeding the threshold, and a black circle indicates a station that has a correlation coefficient value smaller than the threshold. Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | characteristics, for example, for threshold level 0.85 (Fig. 6b), with one showing all the actual connections within a small neighborhood (see the enlarged plot on the top left) while the other showing no clear neighborhood for connectivity (see the enlarged plot on the bottom left). The latter station (see bottom left) is an even more curious case, as most of the neighbors of this station seem to be beyond its (perceived) circle of geo-5 graphic influence. The actual links observed for the other threshold values also support the above observations. These observations clearly suggest that our usual approach with consideration of geographic proximity, nearest neighbors, regional patterns, and linear correlation-based techniques for studying connections in streamflow may have serious limitations. Clus-10 tering coefficient, and other network-based techniques, offers a better means to examine streamflow connections. In what follows, we explore the clustering coefficient results even further.
As the clustering coefficient of a network is based on the actual links among all links in the cluster of neighbors of a node (rather than just the links between a node and 15 its neighbors), it would be interesting to see how it changes with respect to all links and actual links. To this end, Fig. 7a-d shows the clustering coefficient values against the number of all links (red circles) and the number of actual links (blue circles) for threshold values of 0.70, 0.75, 0.80, and 0.85 for the monthly streamflow data from the United States. The results lead to the following major observations: 20 in general, regardless of the threshold value, there is an inverse relationship between the clustering coefficient and number of links (both for all links and actual links), i.e. higher clustering coefficient for smaller number of links and vice-versa; the inverse relationship between the clustering coefficient and number of links is generally more evident for lower thresholds (see Fig. 7a  the clustering coefficient is generally far more sensitive when the number of links is smaller (see the significant larger spread of circles on the Y-axis), but has only very little or almost no sensitivity for a larger number of links (see the very narrow spread followed by a tapering towards a fixed value -especially in Fig. 7a and b). Further, larger numbers of links almost always give lower clustering coefficients; 5 and for a given number of links, the clustering coefficient for a lower threshold is generally higher than that for a higher threshold.
Another useful way to look at the clustering coefficient of a network is its relationship with the number of neighbors (k i ), which is defined by the threshold value and dictates the (number of) links and actual links. Figure 8a-d shows the relationship between the clustering coefficient values and the number of neighbors for threshold values of 0.70, 0.75, 0.80, and 0.85 for the monthly streamflow data. The results generally indicate an inverse relationship between the clustering coefficient and number of neighbors, but such a relationship is far more evident for lower threshold values (see Fig. 8a and b) 15 than that for higher threshold values (see Fig. 8c and d). Again, the clustering coefficient is generally far more sensitive when the number of neighbors is smaller (see the larger spread towards the left), but becomes less sensitive for a larger number of neighbors (see the narrow spread towards the right). These observations are somewhat consistent with those made in regard to the number of links (Fig. 7). It is important 20 to recall, however, that the neighbors are not necessarily geographic but defined by the threshold values (as shown in Fig. 6). While these results and observations are still preliminary in nature, they seem to suggest that there is a particular threshold value or range beyond which the inverse relationship between the clustering coefficient and number of neighbors/links/actual 25 links in the streamflow network may not hold well for monthly streamflow data from the United States, and streamflow data in general. Finally, the question arises as to the type of network. As mentioned previously, the clustering coefficient of a whole network (C) is the average of the clustering coefficients C i 's of all the individual nodes. The clustering coefficient of the eight different networks of the above 639 streamflow stations corresponding to threshold values of 0. 30, 0.40, 0.50, 0.60, 0.70, 0.75, 0.80, and 0.85 is 0.79, 0.77, 0.73, 0.71, 0.70, 0.70, 0.68, and 5 0.67 (see Table 2). These generally high clustering coefficient values seem to suggest that the streamflow monitoring network of 639 stations is not a random graph, since a (comparable) random graph, where the links are distributed randomly, will have a typically very low clustering coefficient, i.e. C = p, where p is the probability of two nodes being connected. As (natural) streamflow dynamics are neither completely ran-10 dom (there are inherent deterministic patterns) nor completely ordered (there are inherent stochastic components) (see Sivakumar, 2011;Sivakumar and Singh, 2012 for some details), it is also reasonable to assume that streamflow networks are not random graphs, but networks of some other nature. Whether they are small-world or scale-free or other types of networks remains to be seen. Studies in this direction are currently 15 underway, details of which will be reported in the future.

Study implications
One of the basic requirements in studying streamflow dynamics is to identify connections in space or time or space-time, depending upon the purpose. Although a wide variety of approaches have been developed and applied to identify connections in stream-20 flow dynamics, there is no question that significant improvements are still needed. In this regard, modern developments in the field of network theory, especially complex networks, offer new avenues, both for their generality about systems and for their holistic perspective about connections.
The present study has made an initial attempt to apply the ideas developed in the 25 field of complex networks to examine connections in streamflow dynamics, with particular focus on spatial connections. Application of the concept of clustering coefficient, which is a measure of local density and quantifies the tendency of a network to cluster, to monthly streamflow data from a large network of 639 monitoring stations in the contiguous United States has offered some very interesting results. The clustering coefficient values for the 639 stations suggest that: (1) even nearest stations can have significantly different connections and distant stations can have significantly similar con-5 nections, (2) connections can be significantly different for different threshold levels, (3) there is generally an inverse relationship between the clustering coefficient and number of neighbors, number of all links, and actual links (in the cluster of neighbors), (4) the clustering coefficient is far more sensitive when the number of neighbors/number of links is smaller, but has only little or no sensitivity when the latter is larger; and (5) the 10 high clustering coefficient value obtained for the entire network is not consistent with the one expected for a random graph, suggesting that the streamflow network is likely to be small-world or scale-free or some other type.
Although the present results are preliminary, they offer important information about the connections that possibly exist in the streamflow network, and especially their ex-15 tent. The clustering coefficient values, and the actual links, are particularly useful in the identification of the specific regions where interpolation and extrapolation of streamflow data may be more effective and also of the specific stations whose data can be more reliable for such purposes. For instance, regions consisting of stations with high clustering coefficient values would generally provide a more accurate estimation of 20 streamflow when interpolation and extrapolation schemes are employed. It is also important to emphasize, however, that such a region is identified based on cluster of actual connections, rather than based on our traditional way of geographic proximity, nearest neighbors, regional patterns, and linear correlations. The clustering coefficient values can also offer important clues and guidelines as to the setting up/removal of Finally, the present study and the results obtained have important implications for a wide range of issues and associated efforts in streamflow modeling, and hydrologic modeling in general. Among these are: (1) predictions in ungaged basins (PUB), where 5 approaches based on nearest neighbors, regionalization, similarity, and other concepts are commonly adopted, (2) formulation of a catchment classification framework, for simplification and generalization in our modeling paradigm and better communication among/between researchers and practitioners; and (3) development of an integrated framework for water planning and management, including in studies on climate change 10 impacts on water resources, that involves proper consideration and inclusion of stakeholders and concepts from a vast number of disciplines, including climate, hydrology, engineering, environment, ecology, social sciences, political sciences, economics, and psychology. In view of these, ideas gained from the modern theory of complex networks, and network theory at large, seem to have immense potential in hydrology and 15 water resources.