Data compression to define information content

Introduction Conclusions References

This discussion paper is/has been under review for the journal Hydrology and Earth System Sciences (HESS). Please refer to the corresponding final paper in HESS if available.

Data compression to define information content of hydrological time series 1 Introduction
How much information is contained in hydrological time series? This question is not often explicitly asked, but is actually underlying many challenges in hydrological modeling Introduction suprisal, defined as − log P , where P is the probability assigned to the event before observing it. To not lengthen this paper too much, we refer the reader to Shannon (1948); Cover and Thomas (2006) for more background on information theory. See also Weijs et al. (2010a,b) for introduction and interpretations of information measures in the context of hydrological prediction and model calibration. We also refer the reader to Singh 5 and Rajagopal (1987); Singh (1997); Ruddell et al. (2013), for more references on applications of information theory in the geosciences. In the following, the interpretation of information content as description length is elaborated.

Information theory: entropy and code length
Data compression seeks to represent the most likely events (most frequent characters 10 in a file) with the shortest codes, yielding the shortest total code length. As is the case with dividing high probabilities, also short codes are a limited resource that has to be allocated as efficiently as possible. When required to be uniquely decodable, short codes come at the cost of longer codes elsewhere. This follows from the fact that such codes must be prefix free, i.e. no code can be the first part (prefix) of another one. 15 This is formalized by the following theorem of McMillan (1956), who generalized the inequality (Eq. 1) of Kraft (1949) to all uniquely decodable codes.
in which A is the alphabet size (2 in the binary case) and l i is the length of the code assigned to event i . In other words, one can see the analogy between prediction and 20 data compression through the similarity between the scarcity of short codes and the scarcity of large predictive probabilities. Just as there are only 4 probabilities of 1 4 available, there are only 4 prefix-free binary codes as short as − log 2 Introduction Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | 2 and two of length 3, we can verify that it sharply satisfies Eq. (1), using A = 2, we find 1 · 2 −1 + 1 · 2 −2 + 2 · 2 −3 = 1 ≤ 1 In Fig. 1, it is shown how the total code length can be reduced, assigning codes of varying length depending on occurrence frequency. As shown by Shannon (1948), if every value could be represented with one code, allowing for non-integer code lengths, 5 the optimal code length for an event i is l i = log 1/p i . The minimum average code length is the expectation of this code length over all events, H bits per sample, where H can be recognized as the entropy of the distribution (Shannon, 1948;Cover and Thomas, 2006), which is a lower bound for the average description length. 10 However, because in practice the code lengths often have to be rounded to an integer number of bits, some overhead will occur. The rounded coding would be optimal for a probability distribution of events such as frequencies II in Fig. 1. In this equation, q i is the i th element of the probability 15 mass function q for which the code would be optimal and l i is the code length assigned to event i . The overhead in the case where p = q is D KL (p||q), yielding a total average code length of  Weijs et al. (2010b); see Appendix A for an elaboration of this connection. For probability distributions that do not coincide with integer ideal code lengths, the algorithm known as Huffman coding (Huffman, 1952) was proven to be optimal for value by value compression. It finds codes of an expected average length closest to 5 the entropy-bound and is applied in popular compressed picture and music formats like jpg, tiff, mp3 and wma. For a good explanation of the workings of this algorithm, the reader is referred to Cover and Thomas (2006). In Fig. 1, code A is optimal for probability distribution I and code B is optimal for the distribution II. Both these codes achieve the entropy bound. Code B is also an optimal Huffman code for the distribution 10 III (last column in Fig. 1). Although the expected code length is now more than the entropy, it is impossible to find a shorter code. The overhead is equal to the Kullback-Leibler divergence from the true distribution (III) to the distribution for which the code would be optimal. ((0.4, 0.05, 0.35, 0.2) || (0.5, 0.25, 0.125, 0.125)) = 0.4106 15 If the requirement that the codes are value by value (one code for each observation) is relaxed, blocks of values can be grouped together to approach an ideal probability distribution. When the series are long enough, entropy coding methods like Shannon and Huffman coding using blocks can get arbitrarily close to the entropy bound (Cover and Thomas, 2006). This happens for example in arithmetic coding, where the entire 20 time series is coded as one single number.

Dependency
If the values in a time series are not independent, however, the dependencies can be used to achieve even better compression. This high compression results from the fact that for dependent values, the joint entropy is lower than the sum of entropies of Introduction the other values in the series are known, because we can recognize patterns in the series, that therefore contain information about themselves. Hydrological time series often show strong internal depen encies, leading to better compression and better prediction. Consider, for example, the case where you are asked to assign probabilities (or code lengths) to possible streamflow values on 12 May 1973. In one case, the information 5 offered is the dark-colored climatological histogram (Fig. 2 on the right), in the second case, the time series is available (the left of the same figure). Obviously, the e pected compression and expected return for the bets are better in the second case, which shows the value of exploiting dependencies in the data. The surprise (− log P true value ) upon hearing the true value is 3.72 bits in case the guessed distribution was assumed 10 and 4.96 bits when using the climate as prior. These surprises are equivalent to the divergence scores proposed in Weijs et al. (2010b). Another example are the omitted characters that the careful reader may (not) have found in the previous paragraph. There are 48 different characters used, but the entropy of the text is 4.3 bits, far less than log (48) = 5.6, because of for example the relatively 15 high frequencies of the space (16 %) and the letter "e" (13 %). Although the entropy is more than 4 bits, the actual uncertainty about the missing letters is far less for most readers, because the structure in the text is similar to English language and that structure can be used to predict the missing characters. On the one hand this means that English language is compressible and therefore fairly inefficient. On the other hand this 20 redundancy leads to more robustness in the communication, because even with many typographical errors, the meaning is still clear. If English were 100 % efficient, any error would obfuscate the meaning.
In general, better prediction, i.e. less surprise, gives better results in compression. In water resources management and hydrology we are generally concerned with pre-25 dicting one series of values from other series of values, like predicting streamflow (Q) from precipitation (P ) and potential evaporation (E p ). In terms of data compression, knowledge of P and E p would help compressing Q, but would also be needed for decompression. When P , E p and Q would be compressed together in one file, the gain compared to compressing the files individually is related to what a hydrological model learns from the relation between these variables (Cilibrasi, 2007). Similarly, we can try to compress hydrological time series to investigate how much information those compressible series really contain for hydrological modeling. 5 Algorithmic information theory (AIT) was founded as a field by the appearance of three independent publications (Solomonoff, 1964;Chaitin, 1966;Kolmogorov, 1968). The theory looks at data through the lens of algorithms that can produce those data. The basic idea is that information content of an object, like a data set, is related to the shortest way to describe it. Although description length generally depends on the language 10 used, AIT uses the construct of a universal computer introduced by Turing (1937), the Universal Turing Machine (UTM), to show that this dependence takes the form of an additive constant, which becomes relatively less important when more data is available. Chaitin (1975) offered some refinements in the definitions of programs and showed a very complete analogy with Shannon's information theory, including e.g. the relations 15 between conditional entropy and conditional program lengths. Using the thesis that any computable sequence can be computed by a UTM and that program lengths are universal up to an additive constant (the length of the program that tells one UTM how to simulate another), Kolmogorov (1968) gave very intuitive definitions of complexity and randomness; see also (Li and Vitanyi, 2008) for more 20 background. Kolmogorov defined the complexity of a certain string (i.e. data set, series of numbers) as the length of the minimum computer program that can produce that output on a UTM and then halt. Complexity of data is thus related to how complicated it is to describe. If there are clear patterns in the data, then they can be described by a program that is shorter than the data itself. The majority of conceivable strings of 25 data cannot be "compressed" in this way. Data that cannot be described in a shorter way than literally stating those data is defined as random. This is analogous to the fact that a "law" of nature cannot really be called a law if its statement is more elaborate 2036 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | than the phenomenon that it explains. A problem with Kolmogorov complexity is that it is incomputable, but can only be approached from above. This is related to the unsolvability of the halting problem (Turing, 1937): it is always possible that there exists a shorter program which is still running (possibly in an infinite loop) that might eventually produce the output and then halt. A paradox that would arise if Kolmogorov complexity 5 were computable is the following definition known as the Berry paradox: "The smallest positive integer not definable in under eleven words".
A shortcut approximation to measuring information content and complexity, is to use a language that is sufficiently flexible to describe any sequence, while still exploiting most of commonly found patterns. While this approach cannot discover all patterns, 10 like a Turing complete description language can, it will offer an upper bound estimation, without having the problems of incomputability. Compressed files are such a language, that use a decompression algorithm to recreate the object in its original, less efficient language. The compressed files can also be seen a programs for a computer, which is simulated by the decompression algorithm on another computer. Since the language 15 is not Turing complete, it is less powerful than the original computer. The constant additional description length for some recursive patterns is replaced by one that grows indefinitely with growing amounts of data. As an example, one can think of trying to compress an ultra high resolution image of a fractal generated by a simple program. Although the algorithmic complexity with respect to the Turing complete executable 20 fractal program language is limited by the size of the fractal program executable and its settings, the losslessly compressed output image will continue to grow with increasing resolution.

Compression experiment set-up
In this experiment, a number of compression algorithms is applied to different data 25 sets to obtain an indication of the amount of information they contain. Most compression algorithms use entropy-based coding methods such as introduced in the previous HESSD 10,2013 Data compression to define information content Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | section, often enhanced by methods that try to discover dependencies and patterns in the data, such as autocorrelation and periodicity. The data compression perspective indicates that formulating a rainfall-runoff model has an analogy with compressing rainfall-runoff data. A short description of the data will contain a good model about it, whose predictive power outperforms the description 5 length of the model. However, not all patterns found in the data should be attributed to the rainfall-runoff process. For example, a series of rainfall values is highly compressible due to the many zeros (a far from uniform distribution), the autocorrelation, and the seasonality. These dependencies are in the rainfall alone and can tell us nothing about the relation between rainfall and runoff. The amount of information that the rainfall con-10 tains for the hydrological model is thus less than the number of data points multiplied by the number of bits to store rainfall at the desired precision. This amount is important because it determines the model complexity that is warranted by the data (Schoups et al., 2008). In fact, we are interested in the Kolmogorov complexity of the data, but this is incomputable. A crude practical approximation of the complexity is the file size 15 after compression by some commonly available compression algorithms. This provides an upper bound for the information in the data.
If the data can be regenerated perfectly from the compressed (colloquially referred to as zipped) files, the compression algorithm is said to be lossless. In contrast to this, lossy compression introduces some small errors in the data. Lossy compression is 20 mainly used for various media formats (pictures; video; audio), where these errors are often beyond our perceptive capabilities. This is analogous to a model that generates the observed values to within measurement precision, which could be a way to account for uncertainties in observation (Beven and Westerberg, 2011;Westerberg et al., 2011;Weijs and Van de Giesen, 2011). In this paper, we consider only lossless compression. 25 Roughly speaking, the file size that remains after compression, gives an upper bound for the amount of information in the time series. Actually, also the code-length of the decompression algorithm should be counted towards this file size (cf. a self-extracting archive). In the present exploratory example the inclusion of the algorithmic complexity HESSD 10,2013 Data compression to define information content of the decompression algorithm is not so relevant since the algorithm is general purpose and not biased towards hydrological data. This means that any specific pattern still needs to be stored in the compressed file. The compression algorithms will be mainly used to explore the difference in information content between different signals.

5
Due to the limited amount of data, quantization is necessary to make meaningful estimates of the distributions, which are needed to calculate the amount of information and compression. This is analogous to the maximum number of bins permitted to draw a representative histogram. As will be argued in the discussion, different quantizations imply different questions for which the information content of the answers is analyzed.
All series were first quantized to 8 bit precision, using a simple linear quantization scheme (Eq. 5). Using this scheme, the series were split into 2 8 = 256 equal intervals and converted into an 8 bit unsigned integer (an integer ranging from 0 to 255 that can be stored in 8 binary digits).
These can be converted back to real numbers using Because of the limited precision achievable with 8 bits , x quantized = x. This leads to rounding errors, which can be quantified as a signal to noise ratio (SNR). The SNR is the ratio of the variance of the original signal to the variance of the rounding errors.
Because the SNR can have a large range, it is usually measured in the form of a logarithm, which is expressed in the unit decibel: SNR dB = 10 log 10 (SNR

Compression algorithms
The algorithms that were used are a selection of commonly available compression programs and formats. Below are very short descriptions of the main principles and main features of each of the algorithms used and some references for more detailed descriptions. The descriptions are sufficient to understand the most significant pattern 5 in the results. It is beyond the scope of this paper to describe the algorithms in detail.
-WAVPACK: Is a lossless compression algorithm for audio files.
-JPG: The Joint Photography Experts Group created the JPEG standard, which includes a range of lossless and lossy compression techniques. Here the loss-10 less coding is used, which uses a Fourier-like type of transform (Discrete cosine transform) followed by Huffman coding of the errors).
-HDF RLE: HDF (hierarchical data format) is a data format for scientific data of any form, including pictures, time series and metadata. It can use several compression algorithms, including run length encoding (RLE). RLE replaces sequences of re- 15 occurring data with the value and the number of repetitions. It would therefore be useful to compress pictures with large uniform surfaces and rainfall series with long dry periods. to the previous location where the sequence occurred. The method is followed by range coding. Range coding (Martin, 1979) is an entropy-coding method which is mathematically equivalent to arithmetic coding (Rissanen and Langdon, 1979), it has less overhead than Huffman coding.
-PNG: Portable Network Graphics (PNG) uses a filter based on prediction of one pixel from the preceding pixels. Afterward, the prediction errors are compressed by the algorithm "Deflate" which uses dictionary coding (matching repeating sequences) followed by Huffman coding.

10
-TIFF: A container image format that can use several compression algorithms. In this case PackBits compression was used, which is a form of run length encoding.

Experiment A: comparison on generated and hydrological time series
In the first experiment, the algorithms are tested on a real world hydrological data set from Leaf River (MS, USA) consisting of rainfall, potential evaporation and streamflow. 15 See e.g. Vrugt et al. (2003) for a description of this data set. As a reference, various artificially generated series where used. The generated series consist of 50 000 values, while the time series of the Leaf River data set, contains 14 610 values (40 yr of daily values). The following series where used in this experiment. All are quantized directly with the linear scheme using Eq. (5).

Experiment B: Compression with a hydrological model
The second experiment is a first exploration of jointly compressing time series. In the previous experiment single time series were compressed to obtain an indication of their information content. Given the connection between modeling and data compression, a hydrological model should in principle be able to compress hydrological data. can be useful to identify good models in information-theoretical terms, but can also be useful for actual compression of hydrological data. Although a more detailed analysis is left for future work, we perform a first test of estimating the performance of hydrological models using data compression tools. The hydrological model HYMOD was used to predict discharge from rainfall for the 5 Leaf River data set; see e.g. Vrugt et al. (2009) for a description of model and data. Subsequently, the modeled discharges were quantized using the same quantization scheme as the observed discharges. An error signal was defined by subtracting the modeled (Q mod ) from the observed (Q) quantized discharge. This gives a signal that can range from −255 to +255, but because the errors are sufficiently small, ranged from 10 −55 to +128, which allows for 8 bit coding. Because the observed discharge signal (Q) can be reconstructed from the precipitation time series (P ), the model, and the stored error signal (Q err ), the model could enable compression of the data set consisting of P and Q. In the experiment we test whether the error series is indeed more compressible than the original time series of Q.

Experiment C: Compression of hydrological time series from the MOPEX data set
In a third experiment, we looked at the spatial distribution of compressibility for daily streamflow and rainfall data in the 431 river basins across the continental USA, as contained in the MOPEX data set. This should give some indication about the information 20 content or complexity of the time series. For these experiments, the streamflow values are log-transformed before quantization, to reflect the heteroscedastic uncertainty in the measurements. Missing values, which were infrequent, were removed from the series. Although this can have some impact on the ability to exploit autocorrelation and periodicity, the effect is deemed to be small and has a smaller influence than other 25 strategies such as replacing the missing values by zero or a specific marker. Results of this compression experiment are presented in Sect. 4.3. 10,2013 Data compression to define information content

Results of the compression experiments
This section shows results from the compression analysis for single time series. Also an example of compression of discharge, using a hydrological model in combination with knowledge of rainfall, is shown.
4.1 Results A: generated data 5 As expected, the file sizes after quantization are exactly equal to the number of values in the series, as each value is encoded by one byte (8 bits) and stored in binary raw format. From the occurrence frequencies of the 256 unique values, the entropy of their distribution was calculated. Normalized with the maximum entropy of 8 bits, the fractions in row 3 of Table 2 give an indication of the entropy bound for the ratio of com-10 pression achievable by value by value entropy encoding schemes such as Huffman coding, which do not use temporal dependencies. The signal to noise ratios in row 4 give an indication of the amount of data corruption that is caused by the quantization. As a reference, the uncompressed formats BMP (Bitmap), WAV (Waveform audio file format), and HDF (Hierarchical Data Format) are 15 included, indicating that the file size of those formats, relative to the raw data, does not depend on what data are in them, but does depend on the amount of data, because they have a fixed overhead that is relatively smaller for larger files.
The results for the various lossless compression algorithms are shown in rows 7-17. The numbers are the percentage of the file size after compression, relative to the orig-20 inal file size (a lower percentage indicates better compression). The best compression ratios per time series are highlighted. From the result it becomes clear that the constant, linear and periodic signals can be compressed to a large extent. Most algorithms achieve this high compression, although some have more overhead than others. The uniform white noise is theoretically incompressible, and indeed none of the algorithms 25 appears to know a clever way around this. In fact, the smallest file size is achieved by the WAV format, which does not even attempt to compress the data and has a relatively HESSD 10,2013 Data compression to define information content small file header (meta information about the file format). The Gaussian white noise is also completely random in time, but does not have a uniform distribution. Therefore the theoretical limit for compression is the entropy bound of 86.3 %. The WAVPACK algorithm gets closest to the theoretical limit, but also several file archiving algorithms (ARJ, PPMD, LZMA BZIP2) approach that limit very closely. This is because they all 5 use a form of entropy coding as a back-end (Huffman and Range coding). Note that the compression of this non-uniform white noise signal is equivalent to the difference in uncertainty or information gain due to knowledge of the occurrence frequencies of all values (the climate), compared to a naive uniform probability estimate; cf. the first two bars in Fig. 1 of Weijs et al. (2010a). 10 The results for the hydrological series firstly show that the streamflow series is better compressible than the precipitation series. This is remarkable, because the rainfall series has the lower entropy. Furthermore it can be seen that for the rainfall series, the entropy-bound is not achieved by any of the algorithms, presumably because of the overhead caused by the occurrence of 0 rainfall more than 50 percent of the time, see 15 Eqs. (3) and (4). Further structure like autocorrelation and seasonality can not be used sufficiently to compensate for this overhead. In contrast to this, the streamflow series can be compressed to well below the entropy bound (27.7 % vs. 42.1 %), because of the strong autocorrelation in the data. These dependencies are best exploited by the PPMD algorithm, which uses a local prediction model that apparently can predict the 20 correlated values quite accurately. Many of the algorithms cross the entropy bound, indicating that they use at least part of the temporal dependencies in the data.

Results B: Compression with a hydrological model
We analyzed the time series of Q and P for leaf river, along with the modeled Q (Q mod ) and its errors (Q err ). In Table 3, the entropies of the signals are shown. The second 25 row shows the resulting file size as percentage of the original file size for the best compression algorithm for each series (PPMD or LZMA). 10,2013 Data compression to define information content The table also shows the statistics for the series where the order of the values was randomly permuted (Q perm and Q perm err ). As expected this does not change the entropy, because that depends only on the histograms of the series. In contrast, the compressibility of the signals is significantly affected, indicating that the compression algorithms made use of the temporal dependence for the non-permuted signals. The joint distribu-5 tion of the modeled and observed discharges was also used to calculate the conditional entropy H(Q|Q mod ). It must be noted, however, that this conditional entropy is probably underestimated, as it is based on a joint distribution with 255 2 probabilities estimated from 14 610 value pairs. This is the cost of estimating dependency without limiting it to a specific functional form. The estimation of mutual information needs more data than Pearson correlation, because the latter is limited to a linear setting and looks at variance rather than uncertainty. In the description length, the underestimation of H(Q|Q mod ) is compensated by the fact that the dependency must be stored by the entire joint distribution. If representative for the dependence in longer data sets, the conditional entropy gives a theoretical limit of compressing Q with knowledge of P and 15 the model, while not making use of temporal dependence. A somewhat unexpected result is that the errors seem more difficult to compress (31.5 %) than the observed discharge itself (27.7 %), even though the entropy is lower. Apparently the reduced temporal dependence in the errors (lag-1 autocorrelation coefficient ρ = 0.60), compared to that of the discharge (ρ = 0.89), offsets the gain in com-20 pression due to the lower entropy of the errors. Possibly, the temporal dependence in the errors becomes to complex to be detected by the compression algorithms. Further research is needed to determine the exact cause of this result, which should be consistent with the theoretical idea that the information in P should reduce uncertainty in Q. The Nash-Sutcliffe Efficiency (NSE) of the model over the mean is 0.82, while the NSE 25 over the persistence forecast (Q mod (t) = Q t−1 ) is 0.18 (see Schaefli and Gupta, 2007), indicating a reasonable model performance. Furthermore, the difference between the conditional entropy and the entropy of the errors could indicate that an additive error model is not the most efficient way of coding and consequently not the most efficient tool for probabilistic prediction. The use of for example heteroscedastic probabilistic forecasting models (e.g. Pianosi and Soncini-Sessa, 2009) for compression is left for future work.

Results C: MOPEX data set
For the time series of the quantized scaled log streamflow and scaled quantized rainfall 5 of the MOPEX basins, from now on simply referred to as streamflow (Q) and rainfall (P ), for brevity, the compressibility and entropy show clear spatial patterns. For most of the streamflow time series, the entropy is close to 8 bits, indicating that the frequency distribution of the preprocessed streamflow does not diverge much from a uniform distribution. An exception are the basins in the central part of the USA, which show lower entropy time series due to high peaks and relatively long, low base flow periods. Also for the rainfall, entropy values are lower in this region due to longer dry spells; see Fig. 3. Compression beyond the entropy bound can be achieved by using temporal patterns. This is visible in Fig. 4, where the compression ratio of the best performing algorithm is 15 visualized relative to the entropy of the signals. Different algorithms are specialized in describing different kinds of patterns, so the map of best performing algorithms (Fig. 5) can be used as an indication for which types of patterns are found in data. In Fig. 6, some of two influences on compression rate are shown. Firstly, due to temporal dependencies in the streamflow, the conditional entropy given the previous value H(Q t |Q t−1 ), 20 known as the entropy rate H (Q), is much lower than the entropy itself. This could theoretically lead to a compression describing the signal with H (Q) bits per time step. However, because of the relatively short length of the time series compared to the complexity of the model that describes it (a two dimensional 256 bin histogram), this compression is not reached in practice, because the model needs to be stored too. Introduction

Discussion
The data compression results give an indication of the information content or complexity of the data. Eventually, these may be linked to climate and basin characteristics and become a tool for hydrological time series analysis and inference. Although information theory may eventually provide a solid foundation for hydrological modeling, it 5 is also important to first consider the limitations such approaches. In this paper, we discuss some inherent issues in quantifying the information content, which makes the results subjective and not straightforward to analyze.

How much information is contained in this data?
From the presented theoretical background, results, and analysis it can be concluded 10 that although information theory can quantify information content, the outcome depends on a number of subjective choices. These subjective choices include the quantization, auxiliary data, and prior knowledge used. The quantization can be linked to what question the requested information answers. When quantizing streamflow into 256 equally sized classes, the question that is im-15 plicitly posed is: "in which of these equally spaced intervals does the streamflow fall?". When the logarithm of the streamflow is used instead, the intervals change, and therefore also the questions change. The question requests more absolute precision on the lower flows than on the higher flows. The information contained in the answers given by the data, i.e. the information content of the time series, depends on the question 20 that is asked.
The information content of time series depends also what prior knowledge one has about the answers to the question asked. If one knows the frequency distribution but has no knowledge of surrounding values, the prior knowledge takes the form of a probability distribution that matches the observed frequencies. In that case, the ex- distribution. The entropy in bits gives the limit of the minimum average space per observation needed to store a long i.i.d. time series of that distribution. In many situations in practice, however, prior knowledge does not include knowledge of the occurrence frequencies, or does include more knowledge than frequencies alone, e.g. temporal dependencies. In the first case the information content of the 5 data should also include the knowledge gained from observing the frequencies. Also in compression, optimal coding table, which depends on the frequencies, should be stored and adds to the file size. One could see the histogram as a simple form of a model that is inferred from the data. The model generally forms part of the information content.
In the second case, temporal dependencies reduce the average information content per observation. Also when the form of the temporal dependencies are not know a priori, but inferred from the data, they can decrease the information content, if the gain in compression offsets the space needed to store the model describing the dependencies. In the theoretical framework of algorithmic information theory, model and 15 data are unified in one algorithm (one could see as a self-extracting archive) and the length of the shortest algorithm that reproduces the data is the information content, or Kolmogorov Complexity (Kolmogorov, 1968).
Flexible data compression algorithms, such as used in this paper, are able to give an upper bound for the information content of hydrological data, because they are 20 not specifically tuned towards hydrological data. All patterns inferred from the data are stored in the compressed file and very little is considered as prior information. Theoretically, prior information can be explicitly fed to new compression algorithms in the form of auxiliary data files (e.g. rainfall to compress runoff) or function libraries (e.g. hydrological models), which should reduce information content of the data due to the 25 increase in prior knowledge.
Summarizing, we can state that information content of data depends on (1) what question we ask the data, and (2) how much is already known about the answer before seeing the data.

Aleatoric and epistemic uncertainty
In current hydrological literature, attempts are sometimes made to separate epistemic (due to incomplete knowledge of the process) from aleatoric (the "inherent" randomness in the system) uncertainty. The approach to answer this question is equivalent to trying to separate pattern from scatter (signal from noise) in high dimensional data 5 spaces, to see how much of the variability can potentially be explained by any model. However, the inherent problem in answering this question is the subjectivity of what we call pattern and what we call scatter. Although model complexity control methods can give guidelines on how much pattern can be reasonably inferred from data, they usually do not account for prior knowledge. This prior knowledge may affect to a large degree what is considered a pattern, for example by constraining the model class that is used to search for patterns or by introducing knowledge of underlying physics. In the algorithmic information theory sense, this can be equivalently expressed either as prior knowledge favoring certain long (so otherwise unlikely) programs that describe the data, or prior knowledge favoring a certain reference computer or language, which 15 offers a shorter description for that specific pattern.
As a somewhat extreme, unlikely, but illustrative example, consider that we encounter 100 consecutive digits of π as a streamflow time series. Our prior hydrological knowledge would indicate those values as random, and containing a large amount of information (no internal dependence or predictability). With different prior knowledge, however, 20 for example that the data is the output of a computer program authored by a student, we would consider the data as having a pattern, and could use this to make predictions or compress the data (by inferring one of the possible programs the enumerate digits of π as a probable source of the data). There would be little surprise in the second half of the data, given the first. Introduction

Conclusions
Determining information content of data is a similar process as building a model of the data or compressing the data. These processes are subject to prior knowledge and therefore this knowledge should be explicitly considered in determining information content. Quantization of the data can be seen as a formulation of the question the 5 data is asked to give information about. Upper bounds for information content for that question can than be found using compression algorithms on the quantized data. A hydrological model actually is such a compression tool. It makes use of the dependencies between for example rainfall and streamflow. The patterns that are already present in the rainfall reduce the information that the hydrological model can learn 10 from: a long dry period could for example be summarized by one parameter for an exponential recession curve in the streamflow. The information available for a rainfall runoff model could theoretically be estimated by comparing the file size of compressed rainfall plus the file size of compressed streamflow with the size of a file where rainfall and streamflow are compressed together, exploiting their mutual dependencies. We where |ZIP(X )| stands for the file size of a theoretically optimal compression of data X , which includes the size of the decompression algorithm. This brings us back to the ideas of algorithmic information theory, which uses program lengths that repro-20 duce data on computers (Turing machines). The shortening in description length when merging input and output data, i.e. the compression progress, could be seen as the amount of information learned by modeling. The hydrological model that is part of the decompression algorithm embodies the knowledge gained from the data. Further explorations of these ideas from algorithmic information theory are expected 25 to put often-discussed issues in hydrological model inference in a wider perspective with more general and robust foundations.

Correspondence of resolution -reliability -uncertainty decomposition to compression and structure
In this appendix, we give a data-compression interpretation of Kullback-Leibler divergence as a forecast skill score and its decomposition into uncertainty, reliability and res-5 olution, as proposed in Weijs et al. (2010b). As noted in Sect. 2.1, when observations have distribution p, but an optimal fixed coding is chosen assuming the distribution is q, the expected average code length per observation is given by The code length is related to the remaining uncertainty, i.e. the missing information, i.e. the amount of information that remains to be specified to reproduce the data. In terms of forecast evaluation and the decomposition presented in Weijs et al. (2010b), using the same notation, this remaining uncertainty is the divergence score associated with a forecast with zero resolution (forecasts do not change), and non-zero reliability (forecast distribution f is not equal to climatological distributionō) 15 DS = H(ō) + D KL (ō||f) = UNC + REL.
The resolution term, given by the Kullback-Leibler divergence from the marginal distributionō to the conditional distributions of observationsō k , given forecast f k , is zero sinceō k =ō for an unconditioned, constant forecast (code for compression).

20
When data with temporal dependencies is compressed, a lower average code length per observation can be achieved, since we can use a dynamically changing coding for next observations, depending on the previous. In terms of forecast quality, this means 2051 Introduction that the individual probability estimates now have non-zero resolution. This resolution, which is equivalent to the mutual information between the forecast based on the past time series and the value to code, will reduce the average code length per observation.
Since also the individual forecasts will not be completely reliable, the average code length per observation will now have a contribution from each term in the decomposi-5 tion of the divergence score where n k is the number of observations for which unique forecast no. k is given and N is the total number of observations. When compressing data, however, the prediction model that describes the temporal dependence needs to be stored as well. Therefore, 10 the average total code length per data-point will become where L (model) is the length of the model algorithm. Although this model length is language dependent, it is known from AIT that this dependence is just an additive 15 constant, and can be interpreted as the prior knowledge encoded in the language. If the language is not specifically geared towards a certain type of data, the total code length will give a fairly objective estimate of the amount of new information in the data, which cannot be explained from the data itself. The number of bits per sample needed to store data can therefore be interpreted as a complexity-penalized version of the 20 divergence score presented in Weijs et al. (2010a,b), applied to a predictions of the data based on previous time steps. We can make the following observations. Firstly, data can only be compressed if there is a pattern, i.e. something that can be described be an algorithm where the resolution or gain in description efficiency or predictive power 2052 Introduction Gaussian white is the output from the Matlab ® function "randn", normally distributed white noise sine 1 single sinusoidal wave with a wavelength spanning all 50 000 values sine 100 100 sinusoidal waves with a wavelength spanning 1/100 of 50 000 values Leaf P daily rainfall series from the catchment of Leaf river  Leaf Q corresponding daily series of observed streamflow in Leaf river 2057 HESSD 10,2013 Data compression to define information content S. V. Weijs et al.   Fig. 1. Assigning code lengths proportional to minus the log of their probabilities leads to optimal compression. Code B is optimal for distribution II, but not for the other distributions. Distribution III has no optimal code that achieves the entropy bound.
distribution II. Both these codes achieve the entropy bound. Code B is also an optimal Huffman code for the distribution III (last column in figure 1). Although the expected code length is now more than the entropy, it is impossible to find a shorter code. The overhead is equal to the Kullback-Leibler divergence from the true distribution (III) to the distribution for which the code would be optimal. If the requirement that the codes are value by value (one code for each observation) is relaxed, blocks of values can be grouped together to approach an ideal probability distribution. When the series are long enough, entropy coding methods like Shannon and Huffman coding using blocks can get arbitrarily close to the entropy bound (Cover and Thomas, 2006). This happens 10 Fig. 1. Assigning code lengths proportional to minus the log of their probabilities leads to optimal compression. Code B is optimal for distribution II, but not for the other distributions. Distribution III has no optimal code that achieves the entropy bound. Introduction Left: best compression of Q against entropy and against entropy rate. Temporal dependencies cause better compression than the entropy, but model complexity prevents achieving the entropy rate. Right: the best achieved compression of P depends strongly on the percentage of dry days, mostly through the entropy. Also the best performing algorithm changes with the climate.