Interactive comment on “ MSWEP : 3-hourly 0 . 25 global gridded precipitation ( 1979 – 2015 ) by merging gauge , satellite , and reanalysis data ”

Short Comments I concur with previous reviewers that the paper is well written and highly of interest for the hydrological community. I believe that the idea of merging satellite-based, gaugebased and reanalysis precipitation data for taking advantage of the benefits of each product is good, and surprisingly not largely investigated so far. I have only two short comments that, in my opinion, should be addressed. 1) The gauge-based evaluation of precipitation datasets is carried out by using the GHCN –D database. I am fully aware that it is difficult to determine the quality of precip-


Introduction
A quantitative appraisal of precipitation (P ) amount and of its spatiotemporal distribution is essential for many scientific and operational applications, including but not limited to increasing our understanding of the hydrological cycle, assessing the hydrological impacts of human activities, assessing water resources, irrigation planning, and forecasting of droughts and floods (Golding, 2009;Kucera et al., 2013; Published by Copernicus Publications on behalf of the European Geosciences Union.Pozzi et al., 2013;Serrat-Capdevila et al., 2013;Lettenmaier et al., 2015).However, P is also one of the meteorological variables that is most difficult to estimate, due to its high spatiotemporal heterogeneity (Daly et al., 1994;Herold et al., 2015).There are currently four principal measurement or modeling sources to determine P (Michaelides et al., 2009): (i) ground-based gauge observations, (ii) ground-based radar remote sensing, (iii) satellite remote sensing, and (iv) atmospheric retrospective-analysis models.
Due to the highly localized nature of gauge observations, the accuracy of interpolation-based spatial P estimates is highly dependent on gauge network density and the degree of spatial coherence, both of which are highly variable globally (Krajewski et al., 2003;Hijmans et al., 2005;Chen et al., 2008).In addition, gauge observations are typically affected by systematic biases in mountainous environments due to the elevation bias in gauge placement and in snowdominated regions due to wind-induced under-catch (Groisman and Legates, 1994;Rasmussen et al., 2012).Groundbased radars offer high temporal and spatial resolution but their coverage is limited to developed regions (Koistinen, 1991;Kitchen and Blackall, 1992;Kitchen et al., 1994;Krajewski and Smith, 2002;Martens et al., 2013).Moreover, very few merged radar data products are readily available.
Satellites are capable of observing large areas instantaneously at a high resolution.They are particularly suitable for rainfall estimation in the tropics, which exhibit highly heterogeneous rainfall patterns due to the importance of convective storms (Smith et al., 2005).However, satellite retrieval approaches are susceptible to systematic biases, relatively insensitive to light rainfall events, and tend to fail over snow-and ice-covered surfaces (Ferraro et al., 1998;Ebert et al., 2007;Kidd and Levizzani, 2011;Kidd et al., 2012;Laviola et al., 2013).Finally, atmospheric reanalysis models are ideally suited for simulating the evolution of largescale (synoptic) weather systems, but poorly represent the variability associated with convection due mainly to their relatively low resolution and deficiencies in the parameterizations of sub-grid processes (Roads, 2003;Ebert et al., 2007;Kidd et al., 2013).Cloud-resolving models have only recently started to tackle the convection-P link (e.g., Petch, 2004;Li and Gao, 2012).Although only available regionally, ground-based gauges and radars are generally regarded as the most reliable way of estimating P and thus have frequently been used for evaluation and improvement of satellite-and reanalysis-based P estimates (Maggioni et al., 2016).
The datasets have been designed for different applications and provide sometimes widely varying P estimates (Adler et al., 2001;Bosilovich et al., 2008;Kucera et al., 2013;Skok et al., 2015;Prein and Gobiet, 2016), even among gauge-adjusted datasets (Herold et al., 2015).Despite several studies that intercompared and evaluated these P datasets in different regions (for non-exhaustive overviews, see Serrat-Capdevila et al., 2013 andMaggioni et al., 2016), so far no clear consensus has emerged on which estimation approach is superior overall.Indeed, each approach comes with potential disadvantages that may restrict its usefulness for global hydrological modeling.Common issues include the following: 1.With the exception of CMAP, all datasets listed in Table 1 employ only one or two of the main data sources -either gauge and satellite, or gauge and reanalysisand thus do not take full advantage of the complementary nature of satellite and reanalysis data identified in previous studies (e.g., Janowiak, 1992;Huffman et al., 1995;Xie andArkin, 1996, 1997;Xie and Joyce, 2014;Adler et al., 2001;Ebert et al., 2007;Serrat-Capdevila et al., 2013;Peña Arancibia et al., 2013); 2. Many of the listed datasets do not explicitly and fully account for gauge under-catch and/or orographic effects, and consequently underestimate P in many regions around the globe (e.g., Zaitchik et al., 2010;Kauffeldt et al., 2013;Beck et al., 2015Beck et al., , 2016a;;Prein and Gobiet, 2016); 3. Many datasets (numbered 6-12 and 14-21 in Table 1) do not incorporate gauge observations or do so on a monthly basis, and hence may not make optimal use of valuable information on the daily P variability provided by gauges; Hydrol.Earth Syst.Sci., 21, 589-615, 2017 www.hydrol-earth-syst-sci.net/21/589/2017/This research was motivated by the opportunity to develop a global gridded P dataset with a 3-hourly temporal and 0.25 • spatial resolution for the period 1979-2015, characteristics that make it more suitable for hydrological applications for the above-mentioned reasons.The dataset has been named Multi-Source Weighted-Ensemble Precipitation (MSWEP) as it seeks to optimally combine data from various gauge, satellite, and reanalysis data sources.

MSWEP methodology
Figure 1 presents a flow chart summarizing the main steps carried out to produce MSWEP.The methodology described here corresponds to version 1.1 of the dataset.In brief, we first derived a long-term bias-corrected climatic mean (explained in Sect.2.1).Next, several gridded satellite and reanalysis P datasets were evaluated in terms of temporal variability to assess their potential inclusion in MSWEP (explained in Sect.2.2).The long-term climatic mean was subsequently temporally downscaled in a stepwise manner first to the monthly, then to the daily, and finally to the 3-hourly timescale using weighted averages of P anomalies derived from gauge, satellite, and reanalysis datasets to yield the final MSWEP dataset (explained in Sect.2.3).This method was used since it (i) is relatively easy to understand and implement; (ii) accommodates the inclusion of datasets with 3-hourly, daily, as well as monthly temporal resolutions; (iii) is largely data-driven (i.e., the weights are based on gauge observations); (iv) accounts for cross-correlation among datasets of the same type (satellite or reanalysis); (v) treats random (i.e., temporally variable) and systematic (i.e., long-term) errors separately; (vi) accounts for gauge under-catch and orographic bias; and (vii) yields reliable estimates (as the comparative performance evaluation described in Sect. 3 will demonstrate).

Bias correction of CHPclim
The long-term mean of MSWEP was based on the recently released Climate Hazards Group Precipitation Climatology (CHPclim) dataset (0.05 • resolution; version 1.0; Funk et al., 2015b), a global P climatology based on gauge observations and satellite data.The CHPclim data were replaced with more accurate Parameter-elevation Relationships on Independent Slopes Model (PRISM) data for the conterminous USA (1 km resolution; Daly et al., 1994) and Tait et al. (2006) data for New Zealand (0.05 • resolution).For brevity, however, we refer to this combination of gaugeanalysis products as CHPclim.Although CHPclim has been corrected for orographic effects it has not been adjusted for gauge under-catch, and is thus expected to underestimate the actual P , especially in snow-dominated regions.Two approaches were tested to correct for this.

Bias correction using catch-ratio equations
The first approach uses country-specific catch-ratio (CR) equations for snowfall from Goodison et al. (1998), as summarized by Adam and Lettenmaier (2003), and the rainfall CR equation for unshielded gauges from Yang et al. (1998).The mean wind speed data required as input were derived from a global 5 km map of mean wind speed at 80 m height based on atmospheric reanalysis model output (Vaisala, 2015).To estimate the wind speed value at 5 m height we used the wind profile power-law relationship defined as follows: where v (m s −1 ) represents the wind speed at height z (m), and v r (m s −1 ) represents the wind speed at reference height z r (m).The exponent a (unitless) is an empirical coefficient that depends on surface roughness and atmospheric stability and was set to 0.3 following Irwin (1979).A global map of the snowfall fraction of P was computed based on air temperature (T a ) data from the gauge-based WorldClim monthly climatic dataset (1 km resolution; version 1.4 release 3; Hijmans et al., 2005) using a T a threshold of 1 • C to distinguish between rain and snow.It is noted that the CR equations were not designed for application at monthly timescales.However, the necessary daily station observations of P , T a , and wind speed are unfortunately generally not available in mountainous regions where they are needed the most (Viviroli et al., 2007).

Bias correction based on Q observations
The second approach tested to bias-correct CHPclim involves the use of observation-based estimates of long-term streamflow (Q) and potential evaporation (E p ) to infer the "true" P (cf.Adam et al., 2006;Henn et al., 2015).For this purpose we employed the Zhang et al. (2001) relationship, which has been frequently used to estimate long-term actual evaporation (E) from long-term P and E p values.Here we use it in an inverse manner to estimate, only for regions with snowfall and/or complex topography, the underestima-tion of long-term P from long-term Q and E p data under the assumption that the long-term E equals the difference between long-term P and Q and that changes in water storage are negligible.The Zhang et al. (2001) relationship was cho-sen over several other similar relationships (Ol'dekop, 1911;Pike, 1964;Budyko, 1974;Porporato et al., 2004) because it includes an empirical parameter related to the plant-available water capacity termed w (unitless).This parameter was set to the "shortgrass and crops" value of 0.5 to produce a conservative correction of P .The Zhang et al. (2001) relationship is given by the following: where P , E, and E p are measured in millimeters per year (mm yr −1 ).Assuming that E = P − Q, the Zhang et al. (2001) relationship reformulated to yield P is given by the following: where Q is measured in millimeters per year (mm yr −1 ) and a and b are defined as respectively.
Following this method, we inferred long-term P from Q records for 13 762 regions (12 233 catchments and 1529 interstation catchment areas) across the globe.Long-term Q estimates for these catchments were computed from the same three sources as those used by Beck et al. (2015), namely (i) the US Geological Survey (USGS) Geospatial Attributes of Gages for Evaluating Streamflow (GAGES)-II database (Falcone et al., 2010), (ii) the Global Runoff Data Centre (GRDC; http://www.bafg.de/GRDC/),and (iii) the Australian Peel et al. (2000) database.We only used catchments < 10 000 km 2 with a Q record length > 3 years (not necessarily consecutive).The long-term Q estimates for the interstation catchment areas were obtained from a map of long-term Q based on observations from 1651 large catchments (10 003-4 691 000 km 2 ) worldwide (Beck, 2016;version 1.2).
To assess the sensitivity of the P values inferred using Eq.(3) to biases in E p , we used E p estimates derived using three formulations: (i) the temperature-based Hargreaves and Samani (1982) equation, (ii) the radiation-based Priestley and Taylor (1972) approach, and (iii) the FAO-56 Penman-Monteith combination equation (Allen et al., 1998).The E p data were produced by Sperna Weiland et al. (2015) using meteorological inputs from the WFDEI dataset.Since the resulting P values generally differed by < 10 % depending on the E p formulation, the mean of the three estimates was adopted.
Next, we computed bias correction values for each catchment area as the ratio of P inferred from Eq. (3) over the original, unadjusted P .The resulting values were subsequently interpolated to produce a continuous global map by calculating, for each 0.25 • grid cell with snowfall and/or complex topography, the median bias correction factor of the 10 "closest" catchments.Note that in calculating the median we also included catchments with bias correction factors < 1.An empirical distance measure was developed as follows: where x (km) is the longitudinal distance to the catchment centroid, y (km) is the latitudinal distance to the catchment centroid, and z (m) is the difference between the grid-cell mean elevation and the catchment-mean elevation (note the different horizontal and vertical length units).By including z, the bias correction factors obtained for mountainous catchments exert less influence on the adjacent flat terrain.By using the median, the impact of potential outlier errors in the Q observations, E p estimates, and catchment boundary data is limited.Bias correction factors were only calculated for regions with snowfall, defined by a snowfall fraction > 0.25, and/or complex topography, defined by an "effective terrain height" < 100 m (Daly et al., 2008).The snowfall fraction was computed using WorldClim T a data using a threshold of 1

Gauge-based evaluation of gridded P datasets
Eight non-gauge-adjusted satellite and reanalysis P datasets were evaluated in terms of temporal variability to assess their potential inclusion in MSWEP (see Table 2).These datasets were specifically chosen for their lack of gauge adjustments, allowing us to evaluate their performance using gauge observations.For the evaluation, we considered daily P data from 99 444 gauges included in the Global Historical Climatology Network-Daily (GHCN-D) database (Gleason, 2002) and 16 912 gauges included in the Global Summary of the Day (GSOD) database (amounting to 116 356 stations in total).All gridded P data were resampled to 0.5 • for consistency, and if there were multiple gauges located within a single 0.5   end of GSMaP-MVK version 5) were used, and if a particular station was included in both databases we only kept the one in the GHCN-D.
Previous P evaluation studies have used a large variety of evaluation metrics (see the overview by Gebremichael, 2010).Here we use Pearson correlation coefficients computed between 3 day mean rainfall and snowfall time series from the gauges and the gridded datasets.Correlation coefficients are highly sensitive to large P events, which is a desirable feature since large P events are more important in the context of hydrological modeling.We used 3-day averages instead of daily averages to account for mismatches in the 24 h measurement period between the datasets and the gauges and to reduce the influence of erroneous measurements.Snowfall and rainfall were evaluated separately, and were distinguished using a T a threshold of 1 • C based on daily mean T a data from the ERA-Interim atmospheric reanalysis dataset (Dee et al., 2011) resampled to 0.5 • using "nearest neighbor" and offset to match the long-term mean of WorldClim.Gauges with correlation coefficients < 0.4 for all datasets were deemed unreliable and excluded from the analysis (cf.Collischonn et al., 2008).The performance of each P dataset in terms of bias was not evaluated since MSWEP relies on CHPclim for its long-term average.

Merging procedure
The procedure to generate the MSWEP dataset involves four distinct stages of merging (see Fig. 1).In the first, we produced a 0.25 • global map of long-term mean P based on the CHPclim dataset, which was replaced with more accurate regional datasets for the USA and New Zealand, and corrected for bias (see Sect. 2.1).
In the second merging stage (see Fig. 1), for the monthly, daily, and 3-hourly timescales separately, and for the satellite and reanalysis datasets separately, we merged P anomalies from the different datasets by weighted averaging to yield one P anomaly estimate at each timescale for each data source.The weight maps for each dataset were derived for rainfall and snowfall separately from the 3-day correlation coefficients obtained in the evaluation using gauge observations (see Sect. 2.2).The resulting 3-day correlation coefficients were first squared to yield the explained variance and subsequently interpolated to produce continuous global weight maps by calculating, for each 0.25 • grid cell, the median of the 10 most nearby gauges.The rainfall interpolated weight was used for grid cells with T a > 3 • C, while the snowfall interpolated weight was used for grid cells with T a < 3 • C. The weights for satellite data were set to 0 for grid cells with T a < 3 • C. T a data were taken from ERA-Interim, resampled to 0.25 • using nearest neighbor, and offset to match the long-term means of WorldClim.
In the third merging stage (see Fig. 1), for the monthly, daily, and 3-hourly timescales separately, we merged the P anomalies from the different data sources by weighted averaging to yield one P anomaly estimate at each timescale.For the satellite and reanalysis data sources, the weights were set to the maximum of the weights assigned to the individual datasets in the second merging stage.For example, if for a particular grid cell the GSMaP-MVK dataset was assigned the highest weight in the second merging stage, we used the GSMaP-MVK weight for the satellite data source in the third merging stage.Weights for the gauge data were empirically calculated as a function of gauge density by means of the following expression: where w j (unitless) represents the weight for grid cell j , D i (km) the distance to grid cell i, D 0 (km) the range of influence, and s i the number of stations for grid cell i.The summation is over all grid cells i = 1, 2, . . ., n within a 1000 km radius of grid cell j .D 0 was set to 25 km using trial and error.Typical values for w range from 0 in completely ungauged regions to 1 for grid cells containing only one isolated station and approximately 2.5 in densely gauged regions (e.g., in southeast Australia).As the available gauge observations vary over time, the associated weight maps vary accordingly.
In the fourth and final merging stage (see Fig. 1), the longterm mean P was downscaled in a stepwise manner using the P anomalies first to the monthly timescale, then to the daily timescale, and finally to the 3-hourly timescale to yield the MSWEP dataset.

Evaluation using FLUXNET gauge observations
Observed P data from 125 tower sites included in the FLUXNET dataset (Baldocchi et al., 2001;Baldocchi, 2008) were used to evaluate eight P datasets: (i) the final MSWEP, (ii) the gauge-only MSWEP, (iii) the satellite-only MSWEP, (iv) the reanalysis-only MSWEP, (v) WFDEI (adjusted using CRU TS3.1), (vi) GPCP-1DD (version 1.2), (vii) TMPA 3B42 (version 7), and (viii) CPC Unified (version 1.0 and real time; see Table 1 for details).The FLUXNET data were used for this purpose because they are completely independent; they have not been used in the development of any of the P datasets.The different MSWEP variants were obtained by setting the weights of the other two sources to zero (e.g., for obtaining the gauge-only MSWEP, the satellite and reanalysis weights were set to zero).All P data with a resolution < 0.5 • were resampled to 0.5 • using averaging for consistency.Only stations with > 120 days of data (not necessarily consecutive) during the period 1998-2015 for all datasets were used.The list of stations is provided in the Appendix.
Three performance metrics were considered.The first is the Pearson correlation coefficient (R), which ranges from −1 to 1, with higher values corresponding to better performance.The second is the root mean square error (RMSE), which ranges from 0 to ∞, with lower values corresponding to better performance.The third is the absolute bias (B), defined as follows: where P e is the long-term mean of the gridded estimates and P r the long-term mean of the station record.Bias values range from 0 to 1, with lower values corresponding to better performance.The R and RMSE values were computed based on daily (rather than 3-day mean) P data.

Evaluation using hydrological modeling
The performance of the previously mentioned eight P datasets (see Sect. 3.1) was further evaluated using hydrological modeling for 9011 catchments (< 50 000 km 2 ), by calibrating the simple conceptual hydrological model HBV (Bergström, 1992;Seibert and Vis, 2012) using P from each of the different datasets.For the calibration, we used the common Nash and Sutcliffe (1970) efficiency (NSE) objective function computed between simulated and observed Q time series.All P data with a resolution < 0.5 • were resampled to 0.5 • using averaging for consistency.The observed daily Q and associated catchment boundary data were obtained from the same three sources as those used for the CHPclim bias correction (see Sect. 2.1).Catchments were required to have an area < 50 000 km 2 and a Q record length > 1 year between 1998-2015 (not necessarily consecutive).HBV operates at a daily time step, has two groundwater stores and one unsaturated-zone store, and uses a triangular weighting function to simulate channel routing delays.The model was chosen because of its flexibility, low computational cost, and successful application under a large variety of climatic and physiographic conditions (e.g., te Linde et al., 2008;Deelstra et al., 2010;Plesca et al., 2012;Beck et al., 2013;Vetter et al., 2015;Valéry et al., 2014;Beck et al., 2016a).Besides daily P time series, the model requires daily time series of E p and T a as inputs.E p was calculated using the Penman (1948) equation as given by Shuttleworth (1993) with daily net radiation, T a , atmospheric pressure, wind speed, and relative humidity derived from WFDEI, and surface albedo from monthly climatic data derived from the European Space Agency (ESA) GlobAlbedo dataset (Muller et al., 2011).T a data were also taken from WFDEI.Only a single "elevation/vegetation zone" (Seibert and Vis, 2012) was considered for each catchment.Table 3  parameters including their calibration range.The stores were initialized by running the model for the first 10 years of the record if the record length was ≥ 10 years, or by running the model multiple times for the entire record if the record length was < 10 years.

lists the model
The model was recalibrated for each P dataset for the period with simultaneous observed Q and input data in a lumped fashion to minimize the computational time.For the calibration we used the (µ + λ) evolutionary algorithm (Ashlock, 2010) implemented using the Distributed Evolutionary Algorithms in Python (DEAP) toolkit (Fortin et al., 2012).The calibration procedure was almost the same as in Beck et al. (2016b); however, to limit the computational time we reduced the population size (µ) to 12 (from 24), the recombination pool size (λ) to 24 (from 48), and the number of generations to 10 (from 25).

Bias correction of CHPclim
The CHPclim dataset is used to determine the long-term mean of MSWEP.Although CHPclim has been adjusted for orographic effects, it does not explicitly account for windinduced gauge under-catch and is thus likely to underestimate P , particularly in snow-dominated regions.We tested two approaches to bias-correct CHPclim. Figure 2a shows bias correction factors computed using country-specific CR equations and a global 5 km map of mean wind speed.The spatial patterns agree reasonably well with those from three previous studies based on interpolation of CR values computed from daily station observations of P , T a , and wind speed (Legates and Willmott, 1990;Adam and Lettenmaier, 2003;Yang et al., 2005).Figure 2b shows bias correction factors inferred for the 13 762 regions using the Zhang et al. (2001) relationship in combination with Q observations and E p estimates (see Sect. 2.1 and Eq. 3).These values were subsequently interpolated to obtain a continuous global map (see Fig. 2c).The resulting bias correction factors are highest in Alaska, northern Canada, the Andes, Scandinavia, the Central Asian mountain ranges, and northeastern Russia (Fig. 2c).
There is little agreement between Fig. 2a and c, particularly in western Russia where the use of P corrected using the country-specific equations to force a hydrological model would probably result in appreciable Q overestimation.Although further research is needed to ascertain the exact cause of this discrepancy, these results suggest firstly that the country-specific CR equations are unsuitable for global application, and secondly, that there is probably still a substantial amount of orographic P unaccounted for in CHPclim.Indeed, almost all regions with high bias correction values (> 1.25) in Fig. 2c exhibit a low density of P gauges predominantly located at low elevations in regions with complex topography which might have confounded the estimation of orographic P in CHPclim.The global patterns of Fig. 2c agree reasonably well with those from Adam et al. (2006), who estimated the bias in P due to orographic effects at global scale using a similar approach based on Q observations from 357 (mountainous) catchments (> 10 000 km 2 ).
Figure 2d presents the CHPclim P adjusted using the interpolated bias correction factors shown in Fig. 2c.The bias correction of CHPclim resulted in an increase in global terrestrial mean annual P (Antarctica excluded) from 817 to 877 mm yr −1 , amounting to an increase of 7.4 %.This number is lower than the 17.9 % increase reported by Adam et al. (2006), possibly because CHPclim has already been corrected for orographic effects.Our estimate is close to the 850 mm yr −1 cited by Schneider et al. (2014) based on the GPCC dataset, which includes orographic corrections and accounts for gauge under-catch using Legates and Willmott (1990).

Gauge-based evaluation of gridded P datasets
We evaluated eight non-gauge-adjusted gridded P datasets in terms of temporal variability using observations from thousands of gauges around the world, to assess their value for inclusion in MSWEP (see Table 2).All datasets with a resolution < 0.5 • were resampled to 0.5 • for this exercise for consistency.Figure 3 shows 3-day temporal correlation coefficients calculated between rainfall derived from the datasets and the gauges for 0.5 • grid cells containing at least one gauge.The satellite datasets CMORPH, GSMaP-MVK, and TMPA 3B42RT exhibit very similar performance at global scale (Fig. 3a-c).In agreement with previous regional studies (e.g., Brown, 2006;Hirpa et al., 2010;Thiemig et al., 2012;Peña Arancibia et al., 2013;Cattani et al., 2016), PERSIANN performed rather poorly (Fig. 3d), possibly because it is predominantly based on cloud-top IR observations.SM2RAIN-ASCAT performed similarly to PER-SIANN (Fig. 3e).Conversely, Brocca et al. (2014) found that SM2RAIN-ASCAT performed similarly to TMPA 3B42RT based on 5-day correlations.The lower performance obtained here for SM2RAIN-ASCAT could be due to the use of only 4 years of data for SM2RAIN-ASCAT (2007-2010) vs. 11 years for TMPA 3B42RT (20003B42RT ( -2010)).However, the performance of SM2RAIN-ASCAT appears to be somewhat complementary to the other satellite datasets in that it performs best in the western USA where the other satellite datasets perform rather poorly (Fig. 3).Better results could potentially be obtained by applying the SM2RAIN technique to a multi-satellite soil moisture product, for example the Essential Climate Variable Soil Moisture (ECV-SM) dataset (Liu et al., 2011).
The three reanalysis datasets exhibited very similar performance patterns overall, although ERA-Interim consistently performed slightly better than the other reanalysis datasets, in line with several earlier regional studies evaluating different atmospheric variables (e.g., Bromwich et al., 2011;Bracegirdle and Marshall, 2012;Jin-Huan et al., 2014).Although NCEP-CFSR performed reasonably well, it was not incorporated in MSWEP due to the presence of wave-like artifacts around mountain ranges (cf. Seguinot et al., 2014).Patterns of temporal correlations obtained for the USA for the satellite and reanalysis datasets (Fig. 3) correspond well with those obtained by Tian et al. (2007) and Ebert et al. (2007).Figure 4 presents the difference in 3-day rainfall correlation coefficients between TMPA 3B42RT and ERA-Interim, highlighting the complementary performance of satellite and reanalysis P datasets reported in previous large-scale studies (Janowiak, 1992;Huffman et al., 1995;Xie andArkin, 1996, 1997;Xie and Joyce, 2014;Adler et al., 2001;Ebert et al., 2007;Serrat-Capdevila et al., 2013;Peña Arancibia et al., 2013).Based on these results, we decided to use CMORPH, GSMaP-MVK, TMPA 3B42RT, ERA-Interim, and JRA-55 for determining the rainfall variability for ungauged regions in MSWEP.
Figure 5 shows the 3-day temporal correlation between snowfall derived from the P datasets and the gauges for 0.5 • grid cells containing at least one gauge.All five satellite datasets consistently exhibit very low correlation coefficients (Fig. 5a-e), due to the well-documented failure of retrieval techniques over snow-and ice-covered surfaces, and more generally their inability to detect snowfall in all conditions (e.g., Ferraro et al., 1998;Levizzani et al., 2011;Ebert et al., 2007;Yong et al., 2013;Peng et al., 2014).Conversely, the reanalysis datasets showed high correlation coefficients for most grid cells (Fig. 5f-h), even exceeding those obtained for rainfall estimation (Fig. 3f-h), possibly because almost all snow originates from (non-convective) large-scale synoptic weather systems which tend to be represented well by atmospheric models.Based on these results, we decided to use ERA-Interim and JRA-55 for determining the snowfall variability for ungauged regions in MSWEP.
Weight maps for MSWEP were calculated for each of the selected P datasets based on the 3-day correlation coefficients.The weight maps for the satellite and reanalysis datasets are static in time, although we used the rainfall weight for grid cells with T a > 3 • C and the snowfall weight for grid cells T a < 3 • C. Conversely, the weight maps for the gauges were computed based on gauge density and thus vary daily depending on the available gauge observations.Figure 6a shows the sum of the weights used to merge the gauge, satellite, and reanalysis components of MSWEP during merging stage 3 (see Fig. 1) at the daily timescale on an arbitrarily chosen day (25 April 2006).The relative contributions of the gauge, satellite, and reanalysis components to the total weight are shown in Fig. 6b-d, respectively.

Evaluation of MSWEP using FLUXNET gauge observations
Observed P data from 125 FLUXNET stations were used to evaluate the final MSWEP, the MSWEP gauge, satellite, and reanalysis components, and four existing state-of-theart gauge-adjusted P datasets (WFDEI-CRU, GPCP-1DD, TMPA 3B42, and CPC Unified).Figure 7 presents box plots of the distribution of R, RMSE, and B values obtained for each dataset.The merged MSWEP performed better than the individual MSWEP components in terms of both R and RMSE (Fig. 7a and b, respectively), confirming the effectiveness of the employed merging scheme.The merged MSWEP also appears to perform better overall than all four existing P datasets in terms of both R and RMSE (Fig. 7a and b, respectively), providing the best estimates in terms of R for 60.0 % of the stations and in terms of RMSE for 68.8 % of the stations.Interestingly, MSWEP performed average in terms of absolute bias (Fig. 7c), obtaining the best B value for only 9.6 % of the stations, which is somewhat unexpected given the sophisticated interpolation techniques employed in producing the CHPclim dataset (Funk et al., 2015b).A possible reason for this is that the FLUXNET P data are generally not corrected for gauge under-catch.Moreover, some of the P records may be subject to measurement bias -at the Chokurdakh station in northeastern Russia (site code RU-Cok; 70.82 • N, 147.49• E), for example, snowfall was not measured at all (A. Budishchev, personal communication, 2016).The relatively good performance of the reanalysisonly MSWEP and WFDEI-CRU datasets (Fig. 7a and b) confirms the usefulness of ERA-Interim for gap-filling of FLUXNET P records (Vuichard and Papale, 2015).

Evaluation of MSWEP using hydrological modeling
We further evaluated the performance of MSWEP based on hydrological modeling for 9011 catchments (< 50 000 km 2 ). Figure 8 presents daily NSE scores obtained after the calibration of HBV using daily P data from the final MSWEP, the gauge-only MSWEP, the satellite-only MSWEP, and the reanalysis-only MSWEP.The respective datasets attained the highest calibration NSE for 28.7, 53.1, 6.7, and 11.5 % of the catchments.The good results for the gauge-only MSWEP are, however, somewhat misleading since the large majority of the catchments are located in regions with very dense P measurement networks (the USA, Europe, and Australia; cf.Figs.6b and 8a). Figure 8b shows median calibration NSE values for 5 • latitude bands.These results once again confirm the complementary nature of satellite and reanalysis data clearly demonstrated in the previous evaluation of P datasets using gauge observations (see Sect. 2.2).For most latitudes, the final MSWEP performs either better or only slightly worse than the best-performing single-source MSWEP (Fig. 8b), confirming the efficacy of the merging scheme.The final MSWEP performs particularly good at latitudes < 17.5 • N/S and > 67.5 • N (Fig. 8b) where P gauge densities are relatively low.The reanalysis-only MSWEP exhibits especially low scores for the 17.5-22.5• N lati-tude band (Fig. 8b), mainly reflecting the low performance achieved for Puerto Rico, perhaps due to the coarse resolution of the atmospheric models used to produce the reanalysis datasets.
To examine in more detail how the final MSWEP, the gauge-only MSWEP, the satellite-only MSWEP, and the reanalysis-only MSWEP perform in regions with dense vs. parse P gauge networks, Fig. 9a plots the median calibration NSE for the P datasets as a function of catchmentmean distance to the closest gauge.For the 7953 well-gauged catchments, defined here by catchment-mean distance to the closest P gauge < 25 km, the final MSWEP performed slightly worse than the gauge-only MSWEP.However, only 16.1 % of the global land surface (excluding Antarctica) has a P gauge located at < 25 km distance.For the 1058 sparsely gauged catchments, defined here by catchment-mean distance to the closest P gauge > 25 km, MSWEP performs overall considerably better than the gauge-only MSWEP.Contrary to expectations, the median NSE scores increase with increasing distance to the closest P gauge for several of the P datasets, which is primarily because the more sparsely gauged groups contain less (semi-)arid catchments, which tend to exhibit lower NSE scores.The overall low performance of the satellite-only MSWEP reflects the lack of tropical catchments in our catchment set.
Figure 10 presents daily NSE scores obtained after calibration of HBV using daily P data from MSWEP and four existing state-of-the-art gauge-corrected datasets (WFDEI-CRU, GPCP-1DD, TMPA 3B42, and CPC Unified).MSWEP obtained the highest calibration NSE for 40.4 % of the catchments, while the other datasets obtained the highest calibration NSE for 12.2, 4.1, 11.3, and 32.1 % of the catchments, respectively (noting that the spatial coverage of TMPA 3B42 is limited to < 50 • N/S).The median calibration NSE score obtained by MSWEP is 0.49 vs. 0.30, 0.22, 0.35, and 0.45 for the other datasets, respectively.However, similar to the gauge-only MSWEP (see Fig. 8), the scores for CPC Unified are probably somewhat inflated since most catchments are located in regions with very high P gauge densities.Indeed, the picture changes somewhat when we examine median calibration NSE scores for 5 • latitude bands; MSWEP performs either similar to or substantially better than the other P datasets for all latitudes (Fig. 10b).The good performance of WFDEI-CRU for the central Interior Plains of North America (Fig. 10a) appears to be attributable to P overestimation by the other datasets including MSWEP.
Figure 9b plots the median calibration NSE for the five P datasets (MSWEP, WFDEI-CRU, GPCP-1DD, TMPA 3B42, and CPC Unified) as a function of catchmentmean distance to the closest gauge.For the 7953 well-gauged catchments (defined by catchment-mean distance to the closest P gauge < 25 km), MSWEP obtained a median NSE of 0.49 vs. 0.30, 0.21, 0.35, and 0.47 for the other P datasets, respectively, with MSWEP obtaining the best NSE in 39.0 % of these catchments.However, as mentioned previously, only 16.1 % of the global land surface (excluding Antarctica) has a P gauge located at < 25 km distance.For the 1058 sparsely gauged catchments (defined by catchment-mean distance to the closest P gauge > 25 km), MSWEP obtained a median NSE of 0.52 vs. 0.30, 0.32, 0.39, and 0.29 for the other P datasets, respectively, with MSWEP obtaining the best NSE in 51.6 % of these catchments.Thus, MSWEP performs slightly better than CPC Unified in well-monitored regions, whereas it performs considerably better than all other datasets in sparsely gauged regions which comprise most (83.9 %) of the global land surface.
Figure 11 presents daily NSE scores obtained after calibration of HBV forced with MSWEP.The patterns obtained for the USA exhibit good agreement with those obtained in six previous studies using different hydrological models and forcing data (Lohmann et al., 2004;Xia et al., 2012;Newman et al., 2015;Bock et al., 2016;Essou et al., 2016;Beck et al., 2016b).The NSE values are consistently lower in (semi-)arid regions in North America, Africa, and Australia, likely due to the highly non-linear rainfall-runoff response, the high transmission losses, and the flashy nature of the Q under these conditions (Pilgrim et al., 1988;Ye et al., 1997;Lidén and Harlin, 2000).The lower scores obtained for polar catchments in North America and Iceland are attributable to P underestimation, suggesting that there is still some P unaccounted for.Note that a low score for a particular catchment does not necessarily mean that MSWEP is unreliable as errors in the Q, E p , or catchment boundary data or flow regulation could also be responsible.
Hydrological modeling has been used in numerous previous studies to evaluate the quality of P datasets (e.g., Su et al., 2008;Collischonn et al., 2008;Voisin et al., 2008;Bitew et al., 2012;Li et al., 2013;Falck et al., 2015;Tang et al., 2016).However, these previous studies were generally regional in nature and used Q observations from a relatively small number of stations (7, 23, 9, 1, 22, 300, and 1, respectively, for the cited studies), which may have led to conclusions with limited global-scale applicability.Moreover, some of these studies did not recalibrate the hydrological model for each P dataset (Su et al., 2008;Voisin et al., 2008;Li et al., 2013), which may have led to additional uncertainty since hydrological models are inevitably imperfect representations of reality and thus produce uncertain estimates even with "perfect" meteorological forcing.Conversely, we used Q observations from a large set of 9011 stations around the globe, spanning all terrestrial latitudes and covering a broad range of physiographic and climatic conditions, which allowed us to examine how P -gauge proximity influences the calibration NSE.Furthermore, we recalibrated the model for each catchment and P dataset, thereby minimizing the uncertainty in model parameters.
One could argue that the evaluation of MSWEP using hydrological modeling presented here is not completely independent, since the Q observations used for the evaluation have also been used in the development of MSWEP for the bias correction of CHPclim (see Sect. 2.1).However, the evaluation is in fact nearly completely independent since the results for a particular catchment will not change much, if at all, if that catchment was excluded from the bias adjustment exercise, since the bias correction factors were computed for each grid cell as the median of the 10 closest catchments.
Moreover, the large majority of catchments are located in regions not affected by the bias correction (81.7 % of the catchments have an interpolated bias correction factor between 1 and 1.02; see Fig. 2c).Furthermore, it is noted that all P data were resampled to 0.5 • for consistency; both MSWEP and TMPA 3B42 may produce better calibration scores if the original 0.25 • resolution data are used.

Caveats and future work
Although MSWEP overall performed better than existing state-of-the-art gauge-adjusted P datasets, some caveats need mentioning: 1.As with many P datasets, spurious temporal discontinuities can occur due to (i) the introduction of the satellite data source in 1998, (ii) satellite sensor degradation and instrument changes, (iii) differences in time coverage among stations, and (iv) changes in the observations assimilated in the reanalysis datasets (e.g., Kang and Ahn, 2015).Thus, care should be exercised when using MSWEP for trend analysis.
2. Sub-daily temporal variability in tropical regions should be interpreted with caution prior to the launch of TRMM in 1998, since atmospheric models poorly represent the diurnal cycle in (sub-)tropical regions (Kidd et al., 2013).
3. The estimated bias correction factors are influenced by uncertainties in the Q, E p , and catchment boundary data and particularly the interpolation.The spatial bias correction factors are available along with the data, so the original, non-corrected long-term averages can be reconstructed.
4. The use of PRISM data for the conterminous USA and the restriction of the bias correction to regions with snowfall and/or complex topography may result in spatial discontinuities in the long-term average of MSWEP.
5. Although MSWEP has a 0.25 • spatial resolution (corresponding to ∼ 25 km at the equator), the effective spatial resolution reflects the gauge density in regions where the gauge component dominates (Fig. 6b), while it reflects the resolution of the atmospheric models (∼ 80-125 km) in regions where the reanalysis component dominates (Fig. 6d).In regions where the satellite component dominates, the effective resolution can be assumed to be 0.25 • (Fig. 6c).
Future updates and improvements of MSWEP are intended to take advantage of new and improved data sources.For example, MSWEP may be updated with ERA5, the planned successor to ERA-Interim, while the TMPA dataset may be replaced by the Integrated Multi-satellitE Retrievals for GPM (IMERG) dataset (Huffman et al., 2014) once it has been extended backwards to the start of the TRMM era.Given the promising complementary information from SM2RAIN-ASCAT (Fig. 3), any such dataset ingesting soil moisture retrievals from multiple satellites should also be considered for inclusion in MSWEP, if available.Moreover, the gauge component of MSWEP may be improved by accounting for the variable correlation length in the interpolation (cf.Funk et al., 2015a), while the biases at high northern latitudes may be improved using Gravity Recovery and Climate Experiment (GRACE) data (cf.Behrangi et al., 2016;Wang et al., 2016).Finally, we hope to explore the possibil- ity of producing a near-real time variant of MSWEP.Many of the datasets incorporated in MSWEP are available in nearreal time, suggesting this is feasible.

Conclusions
We developed a global gridded P dataset  with several novel features: (i) it takes advantage of a wide range of data sources, including gauge, satellite, and reanalysis data, to provide reliable P estimates at global scale; (ii) it accounts for gauge under-catch and orographic effects using Q observations from an unprecedentedly large set of 13 762 stations worldwide; and (iii) it has a high 3-hourly temporal and 0.25 • spatial resolution.We summarize our findings as follows: 1.The high-resolution global CHPclim climatic P dataset, which explicitly accounts for orographic effects, still appears to contain widespread P biases.These biases could not be satisfactorily corrected using countryspecific equations that account for gauge under-catch, suggesting that CHPclim still underestimates orographic P in some regions, probably mainly due to the topographic bias in gauge placement.We used the Zhang et al. (2001) relationship in combination with Q observations from 13 762 stations around the globe to produce a bias-adjusted variant of CHPclim for use in MSWEP.2. Eight non-gauge-adjusted P datasets were evaluated in terms of 3-day temporal variability using observed P data from thousands of gauges around the globe to assess their value for inclusion in MSWEP.For rainfall, CMORPH, GSMaP-MVK, and TMPA 3B42RT performed equally well, while PER-SIANN and SM2RAIN-ASCAT performed less well.The satellite datasets consistently performed better in the tropics, while the reanalysis datasets consistently performed better at mid-to high latitudes.For snowfall, all satellite datasets performed poorly, while the reanalysis datasets performed even better than for rainfall.The obtained correlation coefficients were interpolated to produce global weight maps for the merging scheme of MSWEP.MSWEP also performed best in terms of RMSE, but performed averagely in terms of absolute bias.
4. We further evaluated the performance of MSWEP using hydrological modeling for 9011 catchments (< 50 000 km 2 ) around the globe.Specifically, we calibrated the HBV hydrological model against daily Q observations using P from MSWEP and the four previously mentioned existing P datasets.

Figure 1 .
Figure 1.Flow chart summarizing the main steps carried out to produce MSWEP.The different colors represent different timescales.P n refers to normalized P (i.e., P anomalies).

a∼
80 km effective spatial resolution (i.e., the resolution of the employed atmospheric model).b 0.25 • spatial resolution for the conterminous USA.c 1 • spatial resolution for 2014-present.

Figure 2 .
Figure 2. Global maps of (a) bias correction factors estimated using country-specific equations, (b) bias correction factors inferred using the Zhang et al. (2001) relationship from Q observations and E p estimates, (c) bias correction factors computed by interpolating the Zhang et al. (2001)-based values, and (d) mean annual P (mm yr −1 ) from CHPclim bias-adjusted using the interpolated bias correction factors and used for MSWEP.The data points in (b) represent the centroids of the catchments (n = 12 493) and the interstation regions (n = 1651).Note that there are many values in (b) < 1, but these are not of interest here and thus are shown as 1.

Figure 3 .
Figure 3. Correlation coefficients calculated between 3-day mean rainfall derived from the datasets and the gauges.Each data point represents a 0.5 • grid cell containing at least one gauge.

Figure 4 .
Figure 4.The difference in 3-day rainfall correlation coefficients between TMPA 3B42RT and ERA-Interim.Each data point represents a 0.5 • grid cell containing at least one gauge (n = 6668).Red and yellow indicate TMPA 3B42RT performs better, whereas blue indicates ERA-Interim performs better.

Figure 5 .
Figure 5. Correlation coefficients calculated between 3-day mean snowfall derived from the datasets and the gauges.Each data point represents a 0.5 • grid cell containing at least one gauge.

Figure 6 .
Figure 6.Map (a) shows the sum of the gauge weight (see Eq. 7) and the satellite and reanalysis weights (derived from 3-day correlation coefficients at surrounding gauges) used during merging stage 3 (see Fig. 1) at the daily timescale on an arbitrarily chosen day (25 April 2006; note the non-linear color scale).The contributions of the gauge, satellite, and reanalysis components to the total weight on the same day are shown in (b)-(d), respectively.

Figure 7 .
Figure 7. Box plots of the distribution of (a) daily correlation (R), (b) root mean square error (RMSE), and (c) absolute bias (B) values computed between observed and estimated P at the 125 FLUXNET sites.Note that for the satellite-only MSWEP only the 113 stations located at < 60 • N/S were used, while for TMPA 3B42 only the 87 stations located at < 50 • N/S were used.The locations of the 125 FLUXNET stations and the R values obtained by MSWEP are shown in (d).The list of stations is provided in the Appendix.

Figure 8 .
Figure 8.The P dataset with the best daily NSE score obtained after calibration of HBV is shown in (a).Also shown are (b) median NSE scores for 5 • latitude bands and (c) the number of catchments (n) for each 5 • latitude band.Each data point in (a) represents a catchment centroid (N = 9011).The NSE scores have been computed between simulated and observed Q time series.

Figure 9 .
Figure 9.The change in median calibration NSE for the P datasets as a function of catchment-mean distance to the closest P gauge.The distance to the closest P gauge was computed using the GPCC and CPC Unified datasets.The NSE scores have been computed between simulated and observed Q time series.

3.
Independent observed P data from 125 FLUXNET tower stations around the globe were used to evaluate the performance of MSWEP as well as four existing state-of-the-art global P datasets (GPCP-1DD, WFDEI-CRU, TMPA 3B42, and CPC Unified).The median R value obtained by MSWEP was 0.67 vs. 0.44-0.59for the other P datasets, with MSWEP obtaining the best R score for 60.0 % of the stations.

Figure 10 .
Figure 10.The P dataset with the best daily NSE score obtained after calibration of HBV is shown in (a).Also shown are (b) median NSE scores for 5 • latitude bands and (c) the number of catchments (n) for each 5 • latitude band.Each data point in (a) represents a catchment centroid (N = 9011).The NSE scores have been computed between simulated and observed Q time series.

Figure 11 .
Figure 11.Daily NSE scores obtained after calibration of HBV forced by MSWEP.The NSE scores have been computed between simulated and observed Q time series.Each data point represents a catchment centroid (N = 9011).

Table 1 .
Non-exhaustive overview of freely available (quasi-)global gridded P datasets.If a particular dataset is available in different spatial resolutions or in different variants we only listed the "best" one (e.g., we list the gauge-adjusted 3B42 variant of TMPA rather than the non-gauge-adjusted 3B42RT variant).The datasets are sorted first by source and then alphabetically by short name.MSWEP has been added for the sake of completeness.

Table 2 .
Overview of the (quasi-)global P datasets considered for inclusion in MSWEP.
* Selected to be included in MSWEP.

Table 3 .
HBV model parameter descriptions and calibration ranges.