Hydrologic extremes – an intercomparison of multiple gridded statistical downscaling methods

Gridded statistical downscaling methods are the main means of preparing climate model data to drive distributed hydrological models. Past work on the validation of climate downscaling methods has focused on temperature and precipitation, with less attention paid to the ultimate outputs from hydrological models. Also, as attention shifts towards projections of extreme events, downscaling comparisons now commonly assess methods in terms of climate extremes, but hydrologic extremes are less well explored. Here, we test the ability of gridded downscaling models to replicate historical properties of climate and hydrologic extremes, as measured in terms of temporal sequencing (i.e. correlation tests) and distributional properties (i.e. tests for equality of probability distributions). Outputs from seven downscaling methods – bias correction constructed analogues (BCCA), double BCCA (DBCCA), BCCA with quantile mapping reordering (BCCAQ), bias correction spatial disaggregation (BCSD), BCSD using minimum/maximum temperature (BCSDX), the climate imprint delta method (CI), and bias corrected CI (BCCI) – are used to drive the Variable Infiltration Capacity (VIC) model over the snow-dominated Peace River basin, British Columbia. Outputs are tested using split-sample validation on 26 climate extremes indices (ClimDEX) and two hydrologic extremes indices (3-day peak flow and 7-day peak flow). To characterize observational uncertainty, four atmospheric reanalyses are used as climate model surrogates and two gridded observational data sets are used as downscaling target data. The skill of the downscaling methods generally depended on reanalysis and gridded observational data set. However, CI failed to reproduce the distribution and BCSD and BCSDX the timing of winter 7-day low-flow events, regardless of reanalysis or observational data set. Overall, DBCCA passed the greatest number of tests for the ClimDEX indices, while BCCAQ, which is designed to more accurately resolve eventscale spatial gradients, passed the greatest number of tests for hydrologic extremes. Non-stationarity in the observational/reanalysis data sets complicated the evaluation of downscaling performance. Comparing temporal homogeneity and trends in climate indices and hydrological model outputs calculated from downscaled reanalyses and gridded observations was useful for diagnosing the reliability of the various historical data sets. We recommend that such analyses be conducted before such data are used to construct future hydro-climatic change scenarios.


Introduction
Water resources infrastructure is designed to accommodate hydrologic extremes such as floods and droughts (Cunderlik and Ouarda, 2009;Cunderlik et al., 2004;Ouarda et al., 2006).The frequency and magnitude of extreme hydrologic events such as floods and droughts have changed with climate and there is broad agreement that changes will continue with projected increases in greenhouse gases (IPCC, 2013).The direction and magnitude of change is not uniform across the globe, but is regionally specific, distinguishable by hydrologic regime and by local changes to temperature and precipitation (Cunderlik and Ouarda, 2009;Monk et al., 2011;Sheffield et al., 2012;Stahl et al., 2010Stahl et al., , 2012)).For example, in Canada, floods in snowmelt-dominated regimes decreased in magnitude, while floods in rainfall-fed regimes had no significant trend over 1974 to 2003 (Cunderlik and Published by Copernicus Publications on behalf of the European Geosciences Union. A. T. Werner and A. J. Cannon: Hydrologic extremes Ouarda, 2009).Conversely, Canadian annual low-flow indices showed spatially uniform decreases over 1970 to 2005 (Monk et al., 2011).Thus, future changes in hydrologic extremes need to be estimated at regionally relevant resolutions ( ∼ 10 km) and consider both temperature and precipitation effects.
Global climate models (GCMs) are one of our only tools for projecting the future climate, but they operate at scales too coarse (∼ 100 km) for use in regional studies.Hence, before projecting changes in hydrologic extremes, some intervening steps are required.Approaches to converting coarsescale GCM simulations to project changes to peak flows and low flows vary.Some examples include direct downscaling of streamflow extremes by sparse Bayesian learning and multiple linear regression (Joshi et al., 2013), weather generators combined with hydrologic models (Cunderlik and Simonovic, 2007), regional frequency analysis of regional climate model (RCM) projections (Clavet-Gaumont et al., 2013), and, most commonly, statistical downscaling of GCM or RCM projections run through a physically based hydrologic model (Elsner et al., 2010a;Maurer et al., 2010;Schnorbus et al., 2014;Shrestha et al., 2012;Bürger et al., 2011).The uncertainty in hydrologic projections from GCMs is greater than that from emissions scenarios or model parameterizations (Bennett et al., 2012;Prudhomme and Davies, 2008) and all GCMs represent the climate imperfectly in different ways (Gleckler et al., 2008;Knutti et al., 2008); therefore, to fully characterize the uncertainty in projected hydrological extremes, an ensemble of GCMs is required.
Gridded statistical downscaling methods provide a computationally efficient and effective means of producing plausible hydro-climatology from a large ensemble of GCMs (Salathe et al., 2007;Salathé, 2005;Wood et al., 2004).A number of studies have compared multiple statistical downscaling methods for use in climatological or hydrological projections.Maurer and Hidalgo (2008) compared constructed analogues (CA) and bias correction spatial disaggregation (BCSD) using the National Centers for Environmental Prediction/National Center for Atmospheric Research Reanalysis I (NCEP1) (Kalnay et al., 1996) as a surrogate GCM.Methods were comparable in producing precipitation and temperature at a monthly and seasonal level, but skilfully downscaled daily data depended on the ability of the climate model to show daily skill.Bürger et al. (2012a) compared five methods for their ability to represent climatic extremes including BCSD and expanded downscaling (XDS).The fixed diurnal temperature range in BCSD was seen as a shortcoming in Bürger et al. (2012a).XDS performed best, passing 48 % of single tests on average for 27 Climate Indices of Extremes (ClimDEX), with BCSD close behind, passing 45 % (Bürger et al., 2012a).Pierce et al. (2013) found that projected increases in annual precipitation versus decreases in California were due to disagreements in the occurrence of the heaviest precipitation days (> 60 mm day −1 ) amongst three dynamical and two statistical downscaling methods (BCCA and BCSD).Maurer et al. (2010) compared BCSD, BCCA, and CA for their ability to reproduce hydrologic extremes.BCCA, when combined with the Variable Infiltration Capacity (VIC) model, consistently outperformed the other methods in simulating 3-day peak flow and 7-day low flow.BCCA is an improvement over CA because it includes bias correction and over BCSD because it includes daily GCM anomalies (Maurer et al., 2010).An additional method described as statistical downscaling and bias correction (Abatzoglou and Brown, 2012) and as asynchronous regression (Gutmann et al., 2014), both of which interpolate from the GCM to a fine scale and then apply quantile mapping bias correction (i.e.basically reversing the steps of BCSD), was found to reproduce extreme precipitation events at the grid scale but overestimate them on aggregate scales (Maraun, 2013).Studies to date have not assessed the strength of downscaling methods for use with climatic and hydrologic extremes concurrently.
The first generation National Centers for Environment Prediction/National Center for Atmospheric Research Reanalysis I (NCEP1) reanalysis (Kalnay et al., 1996) is often used as a surrogate GCM when testing downscaling techniques (Bürger et al., 2012a;Gutmann et al., 2014;Maurer et al., 2010), primarily because of its long record length.Recently new reanalysis products have come online, bringing to light possible issues with NCEP1, such as a spurious pattern in precipitation fields at high latitudes (Sheffield et al., 2012), and lack of skill in producing daily air temperature at high altitudes versus other reanalyses (Hofer et al., 2012).Reanalyses differ due to variations in assimilated observational data, assimilation methods, representations of surface and boundary layer processes, physics packages, and dynamical cores, and the resulting uncertainty in output fields can be considerable, especially for climatic extremes (Sillmann et al., 2013a).For instance, discrepancies between reanalyses for some climate extreme indices, such as frost days in some regions, are sometimes as large as the typical intermodel spread of the Coupled Model Intercomparison Project ensembles (Sillmann et al., 2013a).These differences arise because near surface temperature and precipitation extremes are calculated from variables that are relatively poorly constrained by observations in reanalyses.Additionally, nonstationarity exists in some reanalysis products because they amalgamate observational data sets from different sources over time (Donat et al., 2014).In the context of historical validation of downscaling methods, statistical downscaling methods may perform poorly simply because reanalysis outputs are not stationary over the calibration and validation periods (Maurer et al., 2013).All of these factors suggest that multiple reanalysis products should be used as GCM surrogates to ensure methods are not failing due to irreparable errors in reanalyses, and also to explore the variability in results due to reanalysis uncertainty.
Gridded climate observations underpin hydrologic projections.They are used to calibrate the downscaling technique and the hydrologic model, serving as targets and inputs, respectively.Gridded observations are commonly evaluated via comparison with station observations (Hutchinson et al., 2009;Werner et al., 2015), intercomparison with other gridded observations (Eum et al., 2014), or by using them to drive a hydrologic model and comparing outputs to observed water balance fluxes and streamflow over large basins (Livneh et al., 2013;Maurer et al., 2002).We know that statistical downscaling methods perform poorly when nonstationarity occurs between the calibration and validation periods (Maurer et al., 2013), but we have not evaluated how apparent non-stationarity caused by natural climate variability (Huang et al., 2014;Maraun, 2012) is amplified or diminished with methods used to create gridded observations, which could also affect the success of downscaling methods.Furthermore, stationarity in mean annual precipitation and temperature does not dictate stationarity in climatic or hydrologic extremes.Not all, but some, previous studies have included as many years as possible in the calibration, with the goal of maximizing the available historical record available for resampling in the temporal disaggregation step applied in BCSD (Bürger et al., 2012a;Salathé, 2005;Werner, 2011).This approach is also supported by other studies that found bias correction is more robust for larger samples from longer time series, especially for extremes such as flood events (Huang et al., 2014;Themeßl et al., 2011).The pros and cons of this extended calibration period have not been fully evaluated.This investigation will help the hydrologic modelling community build a better evaluation system for gridded observations to ensure their strength not only for projections of mean monthly changes over large basins ∼ 100 000 km 2 , but also for extremes in basins as small as 500 km 2 .
When used to make climate change projections, distributed hydrologic models such as VIC are best driven with gridded daily data, which is usually produced via gridded statistical downscaling techniques such as BCSD, CA, and BCCA, three gridded methods that have been tested to date.Applying BCSD using minimum and maximum monthly temperature instead of mean monthly temperature has not been tested and may correct some issues with diurnal temperature range (Bürger et al., 2012a).It is important to note that the effect of BCSD on daily temperature range (DTR) when used with daily data and ways to ensure minimum temperature is less than maximum temperature has been tested by Thrasher et al. (2012) and is not the focus of this study.A few other methods have been developed recently that warrant investigation.These include double bias corrected constructed analogue (DBCCA), which is similar to BCCA but applies a second quantile mapping bias correction as a post-processing step to correct drizzle and other residual biases (Maurer et al., 2010).Additionally, the climate imprint delta method (CI) (Hunter and Meentemeyer, 2005) and the "reverse" BCSD (similar to SDBC in Ahmed et al., 2013, andAR in Gutmann et al., 2014), which we refer to as bias corrected climate imprint (BCCI) due to its use of CI for interpolation, have not been explored for their applicability to hydrology.A recently developed hybrid of BCCA and BCCI, referred to as BCCAQ (Cannon et al., 2015;Murdock et al., 2014), has the potential to be an improvement versus other gridded statistical downscaling techniques and has not been tested with hydrologic extremes.This work will also help to inform use of the resulting BCSD and hydrologic model output provided by the Pacific Climate Impacts Consortium (PCIC; http://www.pacificclimate.org/data).Finally, PCIC also makes available Canada-wide downscaled climate change projections using both the BCSD and BCCAQ methods (http://www.pacificclimate.org/data).This study provides the first rigorous intercomparison of these two methods.
The ClimDEX indices are recommended by the World Meteorological Organization Expert Team on Climate Change Detection and Indices (ETCCDI) (Zhang et al., 2011) as a means of summarizing daily temperature and precipitation statistics, focusing particularly on aspects of climate extremes.They have been developed to allow seamless comparison of climate conditions on an international basis.There are many projects applying the ETCCDI indices to detect changes in extremes historically (e.g.Sillmann et al., 2013a), to project future changes (e.g.Sillmann et al., 2013b), and to provide future changes via data portals to allow local analysis (http://www.cccma.ec.gc.ca/ data/climdex/).Two commonly investigated hydrologic extremes include 3-day peak flow, which represents potential flood conditions, and 7-day low flow, which represents potential drought conditions (e.g.Maurer et al., 2010).Floods can be damaging to river and floodplain infrastructure, while droughts can be detrimental for human water use and aquatic habitat.We follow the framework developed by Bürger et al. (2012a), evaluating methods for their abilities in producing the temporal sequencing and distributional properties of climate indices and hydrologic extremes.
The objectives of this study are the following.
1. To compare several reanalyses in the study region against two gridded observation data sets.
2. To test the ability of the BCCA, DBCCA, BCCI, CI, BCSD (mean temperature), BCSDX (minimum and maximum temperature), and BCCAQ downscaling techniques to simulate 26 ClimDEX indices using four reanalyses and two gridded observations.3. To test the ability of the BCCA, DBCCA, BCCI, CI, BCSD (mean temperature), BCSDX (minimum and maximum temperature), and BCCAQ downscaling techniques when used to force the VIC hydrologic model, to simulate 3-day peak-flow and 7-day low-flow indices using four reanalyses and two gridded observations.
A. T. Werner and A. J. Cannon: Hydrologic extremes 4. To learn more about the strengths and weaknesses of two gridded observations for use with hydrologic modelling.
5. To see whether the strength of a method to downscale for climate extremes relates to abilities for use with hydrologic extremes.

Study area
The Peace River basin will be the focus of this work.The snow-dominated regime of this basin makes the findings of this work applicable to many mid-latitude areas.The Peace River is located in interior north-eastern BC and encompasses the 101 000 km 2 drainage area upstream of Taylor, BC (Fig. 1).Elevations range from 400 to 2800 m.The region is highly influenced by the Pacific Ocean and Arctic air masses.
The region has a continental climate (Demarchi, 1996), with monthly average temperatures ranging from −12.0 • C in January to 12.3 • C in July, averaging 0.2 • C. Precipitation follows a seasonal pattern of summer maximum and spring minimum.The Peace River has a nival regime, with approximately 54 % of the annual precipitation (440 mm) falling as snow (mostly during October-April) and 64 % of the natural streamflow occurring during the freshet months of May-July.Low flows occur during the winter and early spring in headwater (INGEN) and downstream (BCGMS) basins (Fig. 2).Due to the topographical complexity and strong climate gradients this region provides a stringent test of downscaling techniques.Additionally, the Peace River basin is the focus of two studies that explore uncertainty in hydrologic projections, one due to GCMs, emissions scenarios, and parameter sets (Bennett et al., 2012), the other due to statistically versus dynamically downscaled GCMs (Shrestha et al., 2014a).
This study provides a good complement to these by exploring new sources of uncertainty in the same basin.

Gridded observations
Two daily, gridded observational data sets were available over the study area.The first was generated for BC for application with the Variable Infiltration Capacity (VIC) macroscale distributed hydrologic model following the methods of Maurer et al. (2002) and Hamlet and Lettenmaier (2005).Daily gridded surfaces of minimum and maximum temperature and daily precipitation accumulation were produced at the spatial resolution of 1/16  gram, each with a varying range of quality control.Stations were interpolated to grids using the SYMAP inverse-distance weighting algorithm (Shepard et al., 1984).The raw gridded fields were temporally homogenized to remove interpolation artefacts introduced by using a temporally varying mix of stations and corrected for topographic effects using ClimateWNA, a 1961-1990 PRISM based high-resolution climatology for western North America (Daly et al., 1994;Wang et al., 2006).This data set is referred to as VIC Forcings.
The second data set was created for all of Canada using the Australian National University Spline (ANUSPLIN) implementation of trivariate thin plate smoothing splines (Hutchinson et al., 2009).The Canada-wide ANUSPLIN observational data set was created at a 1/12 • grid spacing (∼ 10 km) for daily minimum temperature, maximum temperature, and precipitation amounts for the period 1950-2010 by Hopkinson et al. (2011) andMcKenney et al. (2011).Station data from Environment Canada observing sites were interpolated onto the high-resolution grid using the ANUSPLIN smoothing splines with elevation, longitude, and latitude as interpolation predictors.Precipitation occurrence and square-root transformed precipitation amounts were interpolated separately on each day, combined, and transformed back to original units.Observed station data were quality controlled and corrected for station relocation, changes in the definition of the climate day, and trace precipitation amounts.

Reanalyses
Four atmospheric reanalysis products were selected to span a range of complexity and spatial resolution.Chosen methods include NCEP1, European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis 40 (ERA40), ECMWF Re-Analysis Interim (ERAInt), and the National Oceanic and Atmospheric Administration -Cooperative Institute for Research in Environmental Sciences 20th Century Reanalysis V2 (20CR).NCEP1 is a popular reanalysis product applied in the validation of statistical downscaling techniques (Bürger et al., 2012a;Maurer et al., 2010).It spans the period from 1948 to the present, is ∼ 1.9 • in resolution, and includes a wide range of observations assimilated from ship to satellite data (Kalnay et al., 1996).ERA40 is available from 1958 to 2002 and is archived at the coarsest resolution (2.5 • ) of the four products selected for this study.It was the first to assimilate satellite radiance data directly (Uppala et al., 2005).ERAInt covers the satellite era from 1979 through to the present.Data used here are archived at 1.5 • , although the underlying forecast model runs at 0.75 • .It has an improved atmospheric model and assimilation system over that used in ERA40 (Dee et al., 2011).20CR is one of the longest reanalysis records available, starting in 1871 and running to 2012.At 2 • resolution it assimilates only surface observations of synoptic pressure, monthly sea surface temperature and sea ice distribution (Compo et al., 2011).Hutchinson et al. (2009) marizes the availability of the gridded observations and reanalyses.

Downscaling techniques
Seven statistical approaches are selected based on their wide use and/or potential strength in downscaling coarse-scale models to gridded observations for representing extremes.BCSD has been applied across North America (Maurer and Hidalgo, 2008;Salathé, 2005;Schnorbus et al., 2014;Wood et al., 2002Wood et al., , 2004)).Monthly minimum temperature, maximum temperature, and precipitation from GCMs or reanalyses are bias corrected, using quantile mapping, against gridded observations aggregated to the large-scale model grid.
Bias corrected, spatially disaggregated monthly data are temporally disaggregated to a daily time step via random sampling of historical months.Days in the selected month are rescaled (multiplicative for precipitation and additive for temperature) to match the bias corrected monthly precipitation and average temperature (Fig. 3a).Two variations of BCSD are tested; one derives minimum and maximum temperature from mean temperature in the coarse-scale model by assuming a uniform monthly diurnal temperature range (BCSD); the other uses monthly minimum and maximum temperature directly from the large-scale model (BCSDX).Two constructed analogue downscaling approaches are tested: BCCA and DBCCA (Maurer et al., 2010).BCCA bias corrects the large-scale temperature and precipitation using quantile mapping, as in BCSD, except on daily rather than monthly large-scale data.In the constructed analogue (CA) component, a library of observed daily coarse-resolution and corresponding high-resolution climate patterns of the variable to be downscaled is built (Hidalgo et al., 2008).Daily data are downscaled by selecting 30 days from the coarsescale library that have the closest similarity to a given simulated day; optimal weights are determined via ridge regression and the 30 corresponding fine-scale library patterns are combined using the same weights (Maurer et al., 2010).In the DBCCA technique, a second quantile mapping bias correction is then applied at the fine scale to fix drizzle and other biases caused by the linear combination of daily fields in the CA step (Fig. 3a).Two climate imprint methods are tested, the CI delta method (Hunter and Meentemeyer, 2005) and bias corrected CI (BCCI), which applies quantile mapping to the interpolated series from CI (Fig. 3a).For the imprint methods, longterm averages (i.e. 30 years) from the fine-scale data provide a "spatial imprint" that is used to represent environmental gradients.The ratio of daily to average monthly values is multiplied by the fine-scale monthly values for a location to get the daily precipitation.This is similar for minimum and maximum temperature, except values are calculated as the difference between the monthly mean and the daily value (Hunter and Meentemeyer, 2005).
While BCCI applies quantile mapping as a postprocessing step to the interpolated fine-scale outputs from the CI method, BCCAQ is a post-processed version of BCCA where the final quantile mapping bias correction is based on BCCI.First, the BCCA and BCCI algorithms are run independently, and then BCCAQ corrects BCCA with BCCI.The daily BCCI outputs at each fine-scale grid point are reordered within a given month according to the daily BCCA ranks.Be-cause the optimal weights used to combine the analogues in BCCA are derived on a day-by-day basis, without reference to the full historical data set, the algorithm may be prone to Huth's paradox, wherein models that are calibrated based on short-term variability may be biased and fail to produce realistic long-term trends (Benestad et al., 2008;Huth, 2004).Reordering data for each fine-scale grid point within a month effectively breaks the overly smooth representation of sub reanalysis-grid-scale spatial variability inherited from BCCI (Maraun, 2013), thereby resulting in a more accurate representation of event-scale spatial gradients; this also prevents the downscaled outputs from drifting too far from the BCCI long-term trend.Over longer timescales, the spatial variability of BCCAQ converges to that of BCCI.
Statistical methods are calibrated from 1950 to 1990 for 20CR and NCEP1, and from 1958 to 1990 and from 1979 to 1990 for ERA40 and ERAInt, respectively (Table 2).Calibration periods were selected to include the longest overlapping record between the gridded observation and reanalyses to replicate the approach taken in Werner et al. (2011).Thus, the 20CR and NCEP1 reanalyses results will serve to evaluate the gridded observations and these two reanalyses, and also to validate the calibration-validation approach taken with BCSD for a series of studies conducted in this region (Bürger et al., 2012a, b;Schnorbus et al., 2014;Shrestha et al., 2012;Werner et al., 2013).The resulting modelling framework for these two gridded observations, four reanalysis products, and seven gridded statistical downscaling techniques is displayed in Fig. 3b.All statistical downscaling methods use precipitation and temperature as predictors and predictands.

ClimDEX
ClimDEX is a common climate indices package that computes values for 27 core indices based on daily precipitation and minimum and maximum temperature (Karl et al., 1999;Peterson, 2005; and http://etccdi.pacificclimate.orgor http://www.clivar.org/panels-and-working-groups/etccdi/etccdi.php/).These indices describe events such as the number of heavy precipitation days denoted as days where precipitation is greater than 10 mm or percentage of days when maximum temperature is greater than the 90th percentile.They do not usually represent the most extreme events conceivable, but instead represent "the more extreme aspects of climate", which are known to be relevant to a broad range of impact fields and are still statistically manageable, so that they can be reliably estimated from current data for the present and future.ClimDEX has been adopted as a standard for extremes by the World Climate Research Programme (http://www.clivar.org/organization/extremes).Indices were computed from downscaled temperature and precipitation from seven statistical downscaling methods used with four reanalyses and two gridded observations for a total of 56 estimates of each index.The index of the annual count  NCEP1 1950-199041 1991-200515 20CR 1950-199041 1991-200515 ERA40 1958-199033 1991-200111 ERAInt 1979-199012 1991-2005 15 when daily minimum temperature is > 20 • C, tropical nights (tr), was dropped for this analysis because this temperature threshold is not exceeded in the Peace River basin.See Table 1 in Bürger et al. (2012a) for a description of indices explored in this study.

Hydrologic modelling
Hydrologic projections for the Peace River basin are derived using the Variable Infiltration Capacity (VIC) model (Liang et al., 1994(Liang et al., , 1996)).The VIC model is a spatially distributed macro-scale hydrologic model that was originally developed as a soil-vegetation atmosphere transfer scheme for general circulation models.It has been used to evaluate climate change impacts on global river systems (Nijssen et al., 2001) and in the mountainous western United States and BC (Elsner et al., 2010b;Hamlet andLettenmaier, 2005, 2007;Schnorbus et al., 2014;Shrestha et al., 2012).Its spatially distributed nature makes it suitable for capturing regional variation in the hydrologic cycle due to topographic, physiographic, and climatic controls.The VIC model is also process based, allowing for a more plausible extrapolation of hydrologic processes into future climate regimes (Leavesley, 1994).The VIC model is applied at a resolution of 1/16 • (approximately 27-31 km 2 , depending upon latitude) and run at a daily time step (1 h time step for the snow model).Surface routing between grid cells is done using the linearized Saint-Venant equations (Lohmann et al., 1996).The Finlay River above Akie River, Ingenika River above Swannell River, Parsnip River above Misinchinka River, and Peace River above Pine River sub-basins of the Peace River were calibrated to observations from Water Survey of Canada (Fig. 1).Peace River at Bennett Dam was calibrated to naturalized flow provided by BC Hydro.The sub-basins range in drainage area from 4200 to 83 900 km 2 and from a minimum elevation of 392 m to a maximum of 2799 m (Table 3).All selected basins had strong calibration results over 1990-1995 for both the VIC Forcings and ANUSPLIN gridded observations based on the Nash-Sutcliffe efficiency score (Nash and Sutcliffe, 1970), the Nash-Sutcliffe efficiency score of the log-transformed discharge, and the percent volume bias error (Table 4).Nash-Sutcliffe efficiency score values improved, Nash-Sutcliffe efficiency score of the log-transformed discharge stayed roughly the same, and percent volume bias er- There are several daily streamflow metrics that are useful for water resources design and management, which are also ecologically relevant (Monk et al., 2011;Richter et al., 1996;Shrestha et al., 2014b).A recent intercomparison of statistical downscaling techniques for use with daily streamflow investigated the hydrologic extremes 3-day peak flow and 7day low flow (Maurer et al., 2010).To build on that study we investigate the strength of seven downscaling methods for the same metrics using 3-day peak flow to represent flood and, 7-day low flow, drought.Two low-flow periods are investigated because the lowest discharge takes place in the months of October-April in sub-basins of the Peace River (Fig. 2) and summer low flows (July-September) are of interest to agriculture and ecology.Hydrologic models can have low flows in different seasons than observations due to their poor parameterization of baseflow conditions and because calibration approaches favour good performance for peak flow (Najafi et al., 2011).This issue can be exaggerated by downscaling approaches (Shrestha et al., 2014b).Thus, narrowing the window over which low flows are accessed is important to prevent low flows in one season being compared to low flows in another.Peak flows are analysed between May and July.

Statistical tests
The seven statistical downscaling methods vary in their approach, which can result in differing strengths and weaknesses.We chose our statistical tests to fully evaluate these downscaling techniques for the climate and hydrologic results and to follow the framework of Bürger et al. (2012a).The time period for calibration of the downscaling techniques was selected to match Bürger et al. (2012aBürger et al. ( ) (pre-1991, depending on the availability of the reanalyses).Longer calibration periods available for NCEP1 and 20CR were also seen as favourable when applying bias correction based downscaling methods, especially when working with extremes (Huang et al., 2014;Themeßl et al., 2011), and assisted with evaluating the two gridded observations.Validation was set to 1991-2005 to accommodate the overlap of available reanalyses, gridded observations, and observed streamflow records.ERA40 is an exception, with the last full year of available record for 2001.Validation results for ERA40 are provided for 1991-2001.
All statistical tests used in this study are conducted at the 5 % significance level, meaning that the tests are conducted in such a way that rejection of the null hypothesis is expected to occur in 5 % of tests when the null hypothesis is true.Statistical hypothesis testing with absolute certainty is impossible.The choice of significance level reflects a balance between the rate at which false rejection of the null hypoth-esis is expected to occur (so-called type I error) and the rate at which a given test will correctly reject the null hypothesis when it is false (the so-called power of the test), with the choice of a more conservative significance level, such as 1 %, leading to lower power in exchange for a lower type-I error rate (e.g.von Storch and Zwiers, 1999).
Two statistical tests are applied to the ClimDEX results over the Peace River basin: the Kolmogorov-Smirnov (KS) test and the test for Pearson's correlation.The KS test is used to see how well the distribution of climate indices for the statistically downscaled reanalyses matches the distribution of those calculated from the gridded observations used as downscaling targets.The KS test is a nonparametric test of the equality of continuous one-dimensional probability distributions.Here, it is used to compare two samples, namely annual climate indices for the statistically downscaled reanalyses and the associated gridded observation.The KS test statistic is used to quantify the distance between empirical distribution functions of these two samples.The null hypothesis is that the two samples are drawn from the same distribution.The distributions considered under the null hypothesis have to be continuous distributions, but are otherwise unrestricted.While some of the climate indices are not strictly continuous (e.g.frost days), asymptotic critical values may still be used in the presence of a small number of ties (Janssen, 1994).Pearson's correlation is used to test the temporal correspondence between the annual climate indices for the statistically downscaled reanalyses and the associated gridded observation.Pearson's product moment correlation coefficient is used to measure the linear correlation between climate indices from downscaled reanalyses and indices from observations.The null hypothesis is that the downscaled and observed samples are not linearly correlated.
The 101 000 km 2 Peace River basin is represented by 3975 grid cells at the 1/16 • resolution used to run the VIC hydrologic model.The KS test and Pearson's correlation are evaluated on each of the grid cells in the Peace River basin for each climate index.The statistical significance of the KS test and Pearson's correlation results over the basin as a whole are measured using a field significance test: the Walker field significance test (Wilks, 2006), where the evaluation of field significance is done by using the minimum local p value as the global test statistic.The Walker field significance test was selected because it is relatively insensitive to correlations among local tests, allowing global tests based on data exhibiting both spatial and temporal correlations to be conducted.Temporal and spatial correlations between climate indices grids would require a cumbersome procedure to address correctly with conventional resampling tests.Walker's test can be seen to be closely related to the conventional field significance test (Storch, 1982) based on counting significant local results, except that Walker's test statistic is based on the smallest of the K local p values, rather than the number of K local tests that are significant at some level.
The KS test and the test for Pearson's correlation were applied to the 3-day peak flow and 7-day low flow in winter and summer for hydrologic data from the five sub-basins of the Peace River.In this case, with the KS test the null hypothesis is that the distribution of the hydrologic extremes created by driving the VIC model with the statistically downscaled reanalyses are drawn from the same sample as those derived from driving the VIC model with the two gridded observations.The null hypothesis for Pearson's correlation is that the hydrologic extremes created by driving VIC with downscaled reanalyses versus gridded observations are not linearly correlated.

Gridded observations and reanalyses
Four reanalyses (NCEP1, ERA40, ERAInt, and 20CR) are compared to two gridded observations (VIC Forcings and ANUSPLIN) over the Peace River basin.Daily precipitation, minimum temperature, and maximum temperature are converted to total monthly precipitation and average monthly temperatures over the 1950-2005.Average minimum and maximum temperatures in ANUSPLIN and VIC Forcings are similar from year to year in most months (Figs. 4 and 5).However, prior to 1970, ANUSPLIN can be up to 5 • cooler than the VIC Forcings and reanalyses from December to February.Precipitation totals are similar from year to year for all months in the two gridded observations, except October, when precipitation difference can be up to 50 mm (Fig. 6).This could be because there is greater station coverage in the VIC Forcings and an elevation adjustment is made with Cli-mateWNA.Differences in these two products resulting from these factors might be more apparent in the shoulder season.
There is a warm bias in minimum temperature in 20CR and ERA40 from May to November and a cool bias in NCEP1 from March to October relative to gridded observations (Fig. 4).The biases in NCEP1 tend to be greater over part of the record in some months, such as from 1970 to ∼ 1995 in June.ERAInt is closest to gridded observations for minimum temperature, but is only available after 1979.Some of the patterns seen in minimum temperature are repeated in maximum temperature (Fig. 5).NCEP1 values are noticeably cooler than observations and other reanalyses in May, June, July, September, and October in some years.In April, maximum temperatures in 20CR and NCEP1 are close to each other and roughly 5 degrees less than the other reanalyses and gridded observations.Maximum temperatures for ERA40 and ERAInt are closest to gridded observations from year to year in all months.Monthly precipitation in the NCEP1 and ERA40 reanalyses has similar magnitudes and variability as the gridded observations (Fig. 6).ERAInt is close to observations in the autumn and winter months, but has higher precipitation values in March through to August.20CR stands apart from the other reanalyses and both gridded observations with consistently larger precipitation amounts, roughly twice the magnitude as observations in September through to April.However, sequencing of events is similar between 20CR and observations.This confirms that near surface temperature and precipitation values from the selected reanalyses have different characteristics due to their different resolutions, model physics, and contributing data in the Peace River basin.The two gridded observations also displayed some dissimilarity in time.Differences between these four reanalyses in this particular region should act as a stringent test of the downscaling techniques applied.However, we expect that the time-dependent differences between gridded observations and NCEP1 for minimum and maximum temperature, and precipitation, will reduce the success rate of any of the downscaling techniques (Maurer et al., 2013).Nevertheless, we carry NCEP1 through the analysis to quantify the impacts of using a potentially flawed reanalysis and also to evaluate VIC Forcings and ANUSPLIN over their full record  with two reanalyses (NCEP1 and 20CR).

Impact of the downscaling approach and reanalyses on ClimDEX results
Downscaled minimum temperature, maximum temperature, and precipitation from seven gridded downscaling methods, two gridded observations, and four reanalyses were used to generate 26 ClimDEX indices.Results were compared to the indices generated from the respective gridded observations at their native resolution (VIC Forcings (∼ 6 km) and ANUS-PLIN (∼ 10 km)) for their ability to match the timing (Pearson's correlation) and distribution (KS test) of values over the Peace River basin using the Walker field significance test (Wilks, 2006).
In the calibration (1950-1990) and validation (1991-2005)   ble 5), Namely, PRCPTOT, annual total wet day precipitation (> 1 mm), in ANUSPLIN is 18 and 21 % less than VIC Forcings in the calibration and validation periods, respectively.The events on a given day are larger in VIC Forcings than ANUSPLIN as shown by the higher R95p, RX1day, RX5day, R10mm, and R20mm values.Between the validation and the calibration period, PRCPTOT increases more in VIC Forcings than in ANUSPLIN.The increase in VIC Forcings comes from an increase in precipitation days (R1mm) rather than an increase in intensity.Magnitudes of the larger precipitation events actually decrease for VIC Forcings, while they increase for ANUSPLIN, although these events are still larger in VIC Forcings than ANUSPLIN in the validation period.The percentage of cool nights decreases and the duration of warm spells increases somewhat equally for both gridded observations.However, increases in the percentage of warm days and warm nights, and decreases in the percentage of cool days and duration of cold spells, are greater in ANUSPLIN than VIC Forcings, which suggests that the warming signal in ANUSPLIN is stronger.Statistically significant increases in annual minimum temperatures were found by Rodenhuis et al. (2009) in this region.Differing trends in climate extremes are common in gridded observations due to differences in stations, interpolation tech-niques, and potential corrections for temporal inhomogeneity.Donat et al. (2014) found that decadal trends in maximum 5-day precipitation amounts (Rx5day) over 1979-2008 ranged from −15 to 5 mm decade −1 in the Peace River basin region, depending on the gridded observations they studied.VIC Forcings included a monthly temporal adjustment to increase homogeneity (Hamlet and Lettenmaier, 2005), while ANUSPLIN did not.Additionally, stations were allowed to drop in and out on a daily bases in ANUSPLIN, whereas stations had to be available for a minimum of 1 year of consecutive days and 5 years over the record to be included in VIC Forcings.Hence, trends in some climate extremes differ for these gridded observations and may or may not match those of "reality" and/or reanalyses.Irrespective of downscaling method or reanalysis, those methods calibrated and validated against the ANUSPLIN gridded observations were more successful versus those based on VIC Forcings overall (Table 6), although there were some cases where VIC Forcings passed more tests than ANUSPLIN (Table 8).For example, under the BCCA method, precipitation amounts on extremely wet days (R95p) for all reanalyses based on VIC Forcings failed the Walker field significance test for the Pearson correlation, while those for ANUSPLIN passed (Fig. 7).(Note: time series shown are averages of all of the VIC Forcings or ANUSPLIN cells in the Peace basin, while the significance of results was based on the Walker field significance of the correlation tested on each grid cell in the basin.)The largest differences in the number of tests passed primarily occur for precipitation based indices where ANUSPLIN passes more than VIC Forcings.VIC Forcings passes 29 more tests than ANUS-PLIN for DTR (Table 7).This result is not unexpected because the differences between the calibration and the validation period are precipitation related in VIC Forcings and temperature related in ANUSPLIN (Table 5).
Step changes in daily temperature range (DTR) from 1950 to 2005 are apparent in ANUSPLIN (Fig. 8).DTR is a strong driver of snowpack generation and melt, and errors in simulating realistic DTR could affect hydrologic modelling results.The sequencing of precipitation indices, such as CWD, PRCPTOT, R10mm, R20mm, R95p, R99p, Rx1day, Rx5day, and SDII, is most difficult to replicate for all methods, especially under VIC Forcings.VIC Forcings has a higher station density than ANUSPLIN because it includes stations from BC Hydro, the BC Ministry of Forests Lands and Natural Resource Operations, and the Ministry of Environment's BC River Forecast Centre Snow Survey Network in addition to those from Environment Canada (Werner et al., 2015).The BC Hydro network provided a large number of stations in the Peace River basin, most of which were not available until the 1980s (Werner et al., 2015).The increase in the number of stations after 1980 in the VIC Forcings likely resulted in more complex spatial patterns in precipitation, despite the monthly temporal adjustment, because it is designed to maintain spatial variability (Hamlet and Lettenmaier, 2005).Increased spatial variability in the validation period, coupled with a different interpolation method in VIC Forcings, could have made precipitation patterns harder to replicate with downscaling.If we are going to rely on these data sets to investigate changes to extreme climate and hydrology, we should develop a way to maintain temporal and spatial homogeneity for daily values while allowing data sets to reflect natural trends.Minimizing homogeneity problems throughout the record is favourable when using gridded ob- servations to calibrate statistical downscaling methods (Gutmann et al., 2014;Livneh et al., 2013;Maurer et al., 2002).
Considering results for all downscaling methods and both gridded observations, results based on ERAInt had the highest score of all four reanalyses for the Pearson correlation and KS tests combined (Table 6).ERAInt results matched sequencing of events most often, as indicated by frequent rejection of the null hypothesis for the Pearson correlation test (Fig. 7; Table 8), and ERA40 results matched distributions most often according to the KS test (Fig. 9; Table 8).The zero-correlation null hypothesis was rejected when comparing ERAInt for the ANUSPLIN and VIC Forcings gridded observations for the number of heavy precipitation days (R10mm), but was not rejected with other reanalyses (Fig. 7).ERA40 and ERAInt monthly average minimum and maximum temperature and total precipitation matched those of the gridded observations most closely (see Sect. 4.1).ERAInt is the highest resolution (1.5 • ) and both ERAInt andERA40 excluded 1950-1958 in their calibration when NCEP1 and 20CR did not (Table 2), which may have avoided potential problems with the gridded observations caused by lower station availability earlier in the record and with reanalysis data from the pre-satellite era (1979-) and before the expansion and standardization of a global radiosonde network (1958-).Results for SDII for VIC Forcings and ANUSPLIN under all seven downscaling methods show large differences between gridded observations and downscaled NCEP1 prior to 1958 (Fig. 10).Gutmann et al. (2014) tested four downscaling methods with NCEP1 focusing on the period containing satellite microwave and infrared atmospheric soundings (1979-) and still found that temporal instabilities in NCEP1 contributed to failure in downscaling techniques for some metrics.Root mean square error in sea level pressure decreases from 1950 to 2008 strongly in NCEP1, somewhat in ERA40, and minimally in 20CR (see Fig. 10 in Compo et al., 2011).Assimilating only surface pressure reports and using observed monthly sea surface temperature and sea ice distributions as boundary conditions to create 20CR has resulted in a more temporally consistent product.However, it has still improved over time.Changes in 20CR in combina- The highest ranked downscaling method based on the combined results for field significance of Pearson's correlation and the KS test for all gridded observations, reanalyses, and ClimDEX indices was DBCCA (Table 6).It tied for highest rank with CI for correlation, while BCCAQ superseded all other methods for distribution.Bias remains in results of the BCCA method for precipitation due to the linear combination of fine-scale analogues and uncorrected "drizzle" and related biases (Guttmann et al., 2014).All downscaling methods, except CI, include a quantile mapping bias correction step and are expected to do well in matching distributions with their respective gridded observation.All methods except CI pass 86 % or more of the tests for distribu-  Forcings and ANUSPLIN for 1991-2005(1991-2001  tion (KS test), while CI passes 78 %.The correlation of DTR was a problem for all the downscaling methods and both gridded observations (Fig. 8) and for distribution based on ANUSPLIN (except BCCAQ), but not when based on VIC Forcings.BCCAQ in combination with ANUSPLIN matched DTR distributions for ERAInt, ERA40, and 20CR when all other methods failed, which points to the success of its approach of post-processing BCCA with a final quantile mapping bias correction based on BCCI.As mentioned above, DTR is an important driver in snowpacks.Additionally, it plays a key role in evaporation (Sheffield et al., 2012).Rates of evaporation are an important component of projecting future water availability and drought (Sherwood and Fu, 2014).Therefore, accurately downscaling DTR should be a priority.Including minimum and maximum monthly temperature predictors in BCSDX did not improve the correlation of DTR as was hypothesized in previous studies (Bürger et al., 2012a).

Impact of the downscaling approach and reanalyses on hydrologic extremes
The   1991-2005(1991-2001 ERA40) ERA40).Dark grey boxes indicate cases in which the null hypothesis is rejected at the 5 % significance level.
tremes when calibrated to one gridded observation versus another.NCEP1 has routinely been used to compare the performance of statistical downscaling methods in terms of climate and hydrologic extremes (e.g.Bürger et al., 2012a andMaurer et al., 2010).We thus continue our comparison of multiple gridded observations, reanalyses, and downscaling techniques for hydrologic extremes.Results are compared for 15 years from 1991 to 2005 (inclusive) for the five subbasins, except for ERA40 (11 years; 1991-2001).We evaluate methods for their ability to replicate the timing (Pearson's correlation) and distribution (KS test) of the 3-day peak flow, 7-day low flow in summer, and 7-day low flow in winter.
Irrespective of reanalysis or downscaling method, VIC hydrologic model simulations based on the VIC Forcings gridded observations passed 8 % more tests than those based on the ANUSPLIN gridded observations (Table 9), whereas for the ClimDEX indices ANUSPLIN passed 7 % more tests than VIC Forcings (Table 6).The difference in the number of tests passed is not great.Therefore, the success of the downscaling methods does not depend strongly on which of the gridded observations is applied overall.However, the greater number of tests passed for hydrologic modelling with the VIC Forcings gridded observations could relate to VIC Forcings being created at the native resolution of the VIC hydrologic model (1/16 • ), whereas the ANUSPLIN data were created at 1/12 • and remapped to 1/16 • using bilinear interpolation.Additionally, a larger precipitation bias correction was required during calibration with the ANUSPLIN data than the VIC Forcings data, suggesting that ANUSPLIN precip- itation is less representative than VIC Forcings.Out of the two statistical tests and three metrics the only case where ANUSPLIN passed more tests than VIC Forcings was for correlation in summer 7-day low flow (Table 10), especially when driven with NCEP1 and 20CR downscaled via BCCA and DBCCA.Similar results were found for ANUSPLIN and BCCA and DBCCA with the ClimDEX indices (Sect.4.2).
This suggests that there is potential for ClimDEX results to act as a predictor of hydrologic extremes.
When considering results regardless of gridded observation or downscaling technique, the number of tests passed under ERA40 was the highest overall (Table 9).Additionally, the number of tests passed for Pearson's correlation and the KS test were both highest for ERA40.The truncated val-  idation period for ERA40, 1990ERA40, -2001ERA40, versus 1990ERA40, -2005 for other reanalyses, could have avoided some challenging hydrologic extreme events in 2002-2005. However, ERAInt, which was validated over 1990-2005, passed nearly the same number of tests as ERA40.Thus, the shorter calibration period in ERA40 and ERAInt avoids step changes in the gridded observations and reanalyses prior to 1958.Peculiarities with the gridded observations were apparent from 1950 to 1958 for the monthly average minimum and maximum temperatures (Figs. 4 and 5) and for the DTR and SDII ClimDEX indices (Figs. 8 and 10).Avoiding these years could have reduced artefacts in the downscaled products and hydrologic model results.Nevertheless, many studies have demonstrated that ERA40 and ERAInt are superior products versus NCEP1 (Donat et al., 2014;Ma et al., 2008Ma et al., , 2009;;Sillmann et al., 2013a).In our own analysis ERA40 and ERAInt have similar timing and magnitude in minimum and maximum temperature and precipitation (Figs. 4,5,and 6) to the gridded observations when NCEP1 and 20CR do not.These results confirm that downscaling methods will succeed when applied to reanalyses that have correct timing, magnitude, and trends such as ERA40 and ERAInt, more so than when applied to reanalyses such as NCEP1 and 20CR that have irregular step changes (Maraun, 2013).We should be able to assume that although the biases in GCMs will be greater than those found in reanalyses, they are consistent over time.The strength of downscaling methods when downscaling ERA40 and ERAInt versus NCEP1 and 20CR was also found with the ClimDEX indices.
The BCCAQ method was the best overall performer for the three hydrologic extremes.It was the best method according to Pearson's correlation and tied for second place with DBCCA and BCCI, after BCSD and BCSDX, for the KS test.BCSD and BCSDX passed the fewest number of tests for correlation, while CI passed the fewest for distribution.In the case of ClimDEX, BCCAQ ranked third after BCCA and BCCI.The strength of the BCCAQ method when tested in terms of basin-wide hydrologic modelling and hydrologic extremes, rather than in terms of ClimDEX indices at individual grid cells, comes from the maintenance of daily spatial patterns resulting from the combination of BCCA and BCCI methods.Event-scale spatial gradients and magnitudes are preserved by reordering the BCCI outputs based on the rank order structure from BCCA.In effect, this removes the overly smooth representation of sub reanalysis-grid-scale variability Table 10.Number of basins where the null hypothesis that the downscaled and observed (VIC Forcings and ANUSPLIN) derived 3-day peak flows are not linearly correlated was rejected and the number of basins where the null hypothesis that the downscaled and observed based distributions are drawn from the same sample was not rejected, by downscaling method/reanalysis combinations for 1991-2005(1991-2001 ERA40) ERA40) from BCCI (Maraun, 2013) and largely corrects remnant biases in magnitude from BCCA (Guttmann et al., 2014).Spatial covariability is much more relevant in hydrologic modelling than the comparison of climate indices between products on a grid cell to grid cell basis.This method is also better at maintaining long-term trends, which might explain failed tests in some of the sub-basins when downscaling NCEP1 and 20CR, which, as shown earlier, exhibit inhomogeneities between calibration and validation periods.BCCAQ could be failing for the "right reason" when the trend in VIC Forcings or ANUSPLIN for a given metric is opposite that in NCEP1 or 20CR.BCCAQ is the only method to pass the Pearson correlation and KS test in all five sub-basins when downscaling ERA40 or ERAInt to VIC Forcings or ANUSPLIN for all three hydrologic extremes.BCCAQ has overcome some of the challenges of BCCA that Maurer et al. (2010) would not have been able to find using NCEP1 alone as surrogate GCM.
It is also more successful than the BCCI method, which is analogous to the statistical downscaling and bias correction (SDBC) method in Ahmed et al. (2013) and asynchronous regression (AR) in Gutmann et al. (2014), by avoiding overestimates of extreme events at aggregate scales (Maraun, 2013).
The BCSD methods pass the most tests for distribution for all basins and reanalyses, while they fail more tests than any other downscaling method for correlation due to their reliance on random sampling of historical months when temporally disaggregating from the monthly to daily time step (Table 6).Thus, these methods will get the frequency and magnitude of events correct, but will get the timing of when these events occur wrong.Again, including the minimum and maximum temperatures from the large-scale model (reanalysis) does not improve the number of tests passed with BCSDX versus BCSD.For 3-day peak flow (Table 11; Fig. 11) and 7-day low flow in summer (Table 10; Fig. 12) these methods pass the majority of tests for correlation.Very few tests are passed for correlation in 7-day low flow in winter (Table 12; Fig. 13).Winter low flows are challenging to monitor and to model.There could be ice on the river causing the stagedischarge relationships to be incorrect.Also, as mentioned, models are not parametrized or calibrated to best represent base flow.However, BCSD and BCSDX have more trouble than any of the other downscaling methods.Due to the resampling of daily events from the historical gridded observations there can be precipitation occurring in combination with temperatures warm enough to generate runoff (Fig. 14).This is because of the stochastic resampling of the historical precipitation, but is also related to temperature since runoff is occurring when conditions should be near freezing.Additionally, the random selection of months from the historical record can lead to large discontinuities across month boundaries, such as in December-January (Fig. 14).This is when it is important to get daily events from the GCM or reanalyses (e.g. as in the CI, BCCI, BCCA, DBCCA, and BCCAQ methods).As calibrated, the VIC model is known to have limited performance for low flows and additional errors were suspected to have been contributed by BCSD in downscaled 20C3M GCM results (Shrestha et al., 2014b).Some sharp spikes on the rising limb of the hydrograph suggest rain-on- snow events caused by the downscaling-driven results that are not displayed in the runs based on gridded observations.The CI method is the closest to the delta method that we have investigated.The median and ranges for CI are much lower for winter 7-day low flow (not shown).The poorer perfor-mance of the CI method for the KS test is due to the lack of quantile mapping bias correction in this method.

Conclusions
We have tested the applicability of seven techniques for downscaling coarse-scale climate models in terms of ClimDEX indices and hydrologic extremes.The seven approaches investigated include several methods commonly used in hydrologic modelling.Some of these had been explored before (i.e.BCSD and BCCA), but not using multiple reanalyses.Choice of reanalysis was found to affect the number of tests passed for a given downscaling technique.Downscaling methods were more successful under ERA40 or ERAInt than they were under NCEP1 or 20CR.
The quality of reanalyses and gridded observations changed over the calibration period due to changes in availability of satellite/radiosonde data and station observations.NCEP1, the reanalysis used as a surrogate GCM in many previous downscaling intercomparisons, had an obviously erroneous step change in temperature over the Peace River basin.Between the calibration and the validation period, changes in ClimDEX indices were greater for precipitation with VIC Forcings but greater for temperature with ANUSPLIN.Thus, trends in ClimDEX indices differed in these gridded observations.ANUSPLIN passed 5 % more tests than VIC Forcings, mostly for precipitation-related ClimDEX indices.Through  this work we learned a lot about these gridded observations and discovered evaluation procedures that will be useful for future studies.BCSDX, DBCCA, and BCCAQ downscaling methods had not been evaluated in terms of ClimDEX indices and hydrologic extremes before now.The BCSDX method included minimum and maximum temperature from the reanalyses instead of mean as is done in BCSD, but this did not improve its ability to resolve temperature indices, such as diurnal temperature range or hydrologic extremes.DBCCA was an improvement over BCCA and passed the greatest number of tests for the ClimDEX indices.The double bias correction proved capable of reducing some of the drizzle and remnant bias in precipitation amounts found in BCCA.The BCCAQ method, which combines BCCA and BCCI, performed well in terms of number of tests passed for the ClimDEX indices, but it really shone for use with modelling hydrologic extremes.In this context, it exceeded all other methods.BC-CAQ provides a more accurate representation of event-scale spatial gradients, removing the overly smooth representation of sub reanalysis-grid-scale variability inherited from BCCI and correcting biases from BCCA.These attributes are important for simulating the climate events that occur over a basin that drive runoff.All methods passed correlation and Based on results from this study, use of a daily downscaling method, such as BCCAQ, in conjunction with a rigorously constructed and validated observational data set, is recommended to supplement the existing hydrologic modelling efforts at PCIC and improve projections of hydrologic extremes.
We can build on this work to develop tools that predict changes to hydrologic extremes from changes in climate extremes without the direct application of a hydrologic model.Similar emulations have been made by drawing on the relationship between GCMs and hydrologic model projections (Schnorbus and Cannon, 2014) and by identifying relationships between GCMs and RCMs (Li et al., 2011).The next step is to identify which of the 26 ClimDEX indices are predictors of 3-day peak flow and 7-day low flow and avoid those downscaling methods that simulate them poorly.

Figure 1 .Figure 2 .
Figure 1.The Peace River basin (above Taylor, BC) study area analysed for ClimDEX indices (black boundary) and the five subbasins investigated for hydrologic extremes, including the Finlay River above the Akie River (FINAK), the Ingenika River above the Swannell River (INGEN), the Parsnip River above the Misinchinka River (PARMS), the Peace River above the Pine River (PEAPN), and the Peace River at Bennett Dam (BCGMS).

Figure 3 .
Figure 3. (a) Diagram of the bias corrected spatial disaggregation (BCSD), bias corrected constructed analogues (BCCA), and bias corrected climate imprint (BCCI) downscaling methods and a summary of adjustments made to these methods to create BCSD with monthly minimum and maximum temperature (BCSDX), double BCCA (DBCCA), climate imprint (CI), and BCCA corrected to BCCI (BCCAQ).(b) Workflow diagram for assessment of downscaling techniques in replicating ClimDEX and hydrologic extremes.
periods, the VIC Forcings and ANUSPLIN data sets are similar for most temperature based indices and show some large differences for precipitation based indices (Ta-
previous section shows how raw reanalyses and observations differ in the Peace River basin and how downscaled reanalyses can differ in their representation of climate ex-
Figure 11.Boxplots, time series, and distributions of 3-day peak flow in the spring months (May-July) for NCEP1, 20CR, ERA40, and ERAInt in the BCGMS basin based on VIC Forcings (top) and ANUSPLIN (bottom).Legend same as Fig. 9.

Figure 12 .
Figure 12.Time series of 7-day low flow in the summer months (July-September) for NCEP1, 20CR, ERA40, and ERAInt in the BCGMS basin based on VIC Forcings (top) and ANUSPLIN (bottom).Legend same as Fig. 9.

Figure 13 .
Figure 13.Time series of 7-day low flow in the winter months (November-April) for NCEP1, 20CR, ERA40, and ERAInt in the BCGMS basin based on VIC Forcings (top) and ANUSPLIN (bottom).Legend same as Fig. 9.

Table 1 .
Availability of gridded observations and reanalyses.

Table 2 .
Calibration and validation periods for downscaling methods by reanalyses.

Table 3 .
Metadata for five select sub-basins of the Peace River basin.

Table 4 .
Calibration and validation statistics for five select sub-basins of the Peace River basin under the under VIC Forcings and ANUSPLIN gridded observational data sets including the Nash-Sutcliff efficiency score (NS), the Nash-Sutcliff efficiency score of the log-transformed discharge (LNS), and the percent volume bias error (%VB).

Table 5 .
Mean annual ClimDEX values for VIC Forcings and ANUSPLIN averaged over the Peace River basin.

Table 6 .
Summ-2005f the number of tests passed for Pearson's correlations and similarity in distributions (KS test) based on the Walker field significance test between ClimDEX indices for downscaled reanalyses versus target gridded observation over the Peace River basin for1991-2005  (1991 -2001 ERA40) ERA40), summarized by gridded observation, reanalysis, and the downscaling method.Max indicates the maximum possible tests to pass in that category.

Table 7 .
Number of tests passed for each ClimDEX index for VIC in ERA40).

Table 8 .
Summary of the number of tests passed for Pearson's correlations and similarity in distributions (KS test) based on the Walker field significance test between ClimDEX indices for downscaled reanalyses versus target gridded observation over the Peace River basin for1991-2005 (1991-2001)for reanalysis (ERA40) versus the downscaling method for each gridded observation.

Table 9 .
Summ-2005f the number of tests passed for Pearson's correlations and similarity in distributions (KS test) based on the Walker field significance test between hydrologic extremes for downscaled reanalyses versus target gridded observation over the Peace basin for1991-2005  (1991 -2001 ERA40) ERA40), summarized by gridded observation, reanalysis, and downscaling method.Max indicates the maximum possible tests to pass in that category. .

Table 11 .
As in Table 10 but for summer 7-day low flow.

Table 12 .
As in Table10but for winter 7-day low flow.