Land surface model performance using cosmic-ray and point-scale soil moisture measurements for calibration

. At very high resolution scale (i.e. grid cells of 1 km 2 ) , land surface model parameters can be calibrated with eddy-covariance ﬂux data and point-scale soil moisture data. However, measurement scales of eddy-covariance and point-scale data differ substantially. In our study, we investigated the impact of reducing the scale mismatch between surface energy ﬂux and soil moisture observations by replacing point-scale soil moisture data with observations derived from Cosmic-Ray Neutron Sensors (CRNSs) made at larger spatial scales. Five soil and evapotranspiration parameters of the Joint UK Land Environment Simulator (JULES) were calibrated against point-scale and Cosmic-Ray Neutron Sensor soil moisture data separately. We calibrated the model for 12 sites in the USA representing a range of climatic, soil, and vegetation conditions. The improvement in latent heat ﬂux estimation for the two calibration solutions was assessed by comparison to eddy-covariance ﬂux data and to JULES simulations with default parameter values. Calibrations against the two soil moisture products alone did show an advantage for the cosmic-ray technique. However, further analyses of two-objective calibrations


Introduction
The land surface water and energy balances are coupled through the process of evapotranspiration. Soil moisture is one of the main water reservoirs near the land surface and can hence importantly control the surface water and energy balances. Soil moisture provides a first-order (i.e. direct) control on evapotranspiration when there is insufficient water to meet the evaporative demand (Manabe, 1969;Budyko, 1956;Seneviratne et al., 2010). An indirect effect of soil moisture on surface energy flux partitioning is for instance the damping effect on soil and air temperature, which in its turn affects humidity, evapotranspiration, boundary-layer stability, and in some cases precipitation (Seneviratne et al., 2010). The control of soil moisture on temperature at seasonal scales is especially strong in transitional climate regions (Koster et al., 2004).
Land surface models (LSMs) solve the surface mass (including water), energy, and momentum balances to provide the weather and climate prediction models with lower boundary conditions. The land surface has been shown to play an important role in global atmospheric circulation (Koster et al., 2004). Because the soil moisture state and surface fluxes are so closely connected, it is important to accurately simulate these simultaneously (Henderson-Sellers et al., 1996;Richter et al., 2004;Seneviratne et al., 2010;Dirmeyer, 2011;Dirmeyer et al., 2013).
The increasing complexity of land surface models over the past decades (Sellers et al., 1997;Seneviratne et al., 2010) has brought with it an increasing number of parameters, with values not easily defined with in situ measurements because of scale mismatch. For instance, stomatal resistance mea-J. Iwema et al.: Land surface model performance using cosmic-ray measurements sured at leaf level is not the same as canopy stomatal resistance needed for LSMs (Blyth et al., 1993). Soil hydraulic parameter values (e.g. soil hydraulic conductivity) are often obtained from laboratory experiments on soil cores for cubic centimetres to cubic decimetres. The soil properties and processes at this scale can however differ from those at the LSM grid cell sizes, which are often as large as hundreds to tens of thousands of square kilometres (Pitman, 2003). Due to the different governing processes, upscaling the soil hydraulic properties from soil core scale to field scale is nontrivial (Vinnikov et al., 1996;Crow et al., 2012).
In an effort to develop global hydrometeorological monitoring and prediction capabilities (Wood et al., 2011), hydrological models and LSMs are now increasingly being applied at the finer "hyper-resolution scale" with grid cells of about 1 km 2 . Typically, parameters are calibrated and validated at this hyper-resolution against in situ measurements from sources such as eddy-covariance flux towers and pointscale (PS) soil moisture sensors (e.g. time domain transmissivity and time domain reflectometry; Stockli et al., 2008;Richter et al., 2004;Blyth et al., 2010Blyth et al., , 2011Rosolem et al., 2013). Such calibration/validation data have become more widely available at hyper-resolution scale (Baldocchi et al., 2001;Smith et al., 2012). However, the horizontal footprints of different measurement techniques vary from each other: eddy-covariance surface energy flux data represent a downwind footprint of 100 m 2 to 1 km 2 , while in situ soil moisture probes link to much smaller surface areas by representing a support volume (Blöschl and Sivapalan, 1995) of ∼ 4 dm 3 only (Running et al., 1999;Kurc and Small, 2007;Vivoni et al., 2008). Soil moisture is spatially non-uniform within the eddy-covariance footprint due to heterogeneity in soil properties, vegetation, and topography. Therefore, soil moisture measurements best (i.e. most effectively) representing the soil below the eddy-covariance tower's footprint should be used when the performance of a land surface model is evaluated. If soil moisture is measured at only a single or a few locations with limited support volume, like with PS sensors, the measured soil moisture content might be different from the effective soil moisture state that controls the surface exchange processes. It is therefore often assumed that soil moisture measured at a scale closer to the footprint of an eddy-covariance tower is more informative than a single or a couple of PS sensor profiles for studying land surface processes and constraining model parameters at scales > ∼ 100 m 2 (Robinson et al., 2008). This poses a potential scale mismatch issue when a single or a few PS sensors are used. On the other hand, past research has shown that soil moisture measured at only one or a few points within an area of similar size to an eddy-covariance footprint can have similar values to surface energy flux simulation as soil moisture measured at a larger scale (e.g. Vachaud et al., 1985;Teuling et al., 2006;Mittelbach and Seneviratne, 2012). These studies showed that points within a soil moisture observation network keep their rank with respect to the mean soil moisture (anomaly), i.e. they either underestimate or overestimate the mean (anomaly), so-called spatio-temporal stability. The physical principle behind the spatio-temporal stability theory is that different time variant and hydrological processes either create or destroy spatial soil moisture variability, whereas time invariant land surface characteristics induce an effective offset in the spatial variability (Albertson and Montaldo, 2003;Teuling and Troch, 2005;Vanderlinden et al., 2012). When soil moisture reaches values below the critical point (i.e. transpiration becomes moisture limited), spatial variability in soil moisture and fluxes is reduced (Teuling and Troch, 2005). Soil moisture dynamics was found to be a small portion of total soil moisture variability (Mittelbach and Seneviratne, 2012) while having a greater effect on surface energy fluxes than absolute soil moisture in land surface models (Dirmeyer et al., 1999;Teuling et al., 2009). The implication of the spatio-temporal stability theory therefore is that the spatial-scale mismatch issue might have limited implications to surface energy flux simulation. The question which then arises is, does reduced observation-scale mismatch improve LSM energy flux estimates?
Based on the spatio-temporal stability theory, we phrased the following hypothesis for our research question: reduced observation-scale mismatch does not lead to LSM energy fluxes closer to eddy-covariance observations.
In recent years, new soil moisture measurement techniques have been developed that have, compared to point-scale soil moisture sensing techniques, a reduced scale mismatch with eddy-covariance surface energy flux measurements. Improvement in wireless technology and remote data collection technology have made the development of PS soil moisture sensor networks more feasible (Cardell-Oliver et al., 2005;Ritsema et al., 2009;Trubilowicz et al., 2009;Bogena et al., 2009;Robinson et al., 2008). Newer soil moisture sensor techniques, for instance one which makes use of the Global Positioning System (GPS; Larson et al., 2008Larson et al., , 2010, and the Cosmic-Ray Neutron Sensor (CRNS; Zreda et al., 2008) have the advantage that their installation requires less time and work effort because only a single above-ground sensor is needed. The CRNS (Zreda et al., 2008) is an above-ground passive sensor which utilizes natural cosmic-ray neutron radiation to estimate soil moisture content in the top 10-70 cm. The sensor's footprint area has a radius of about 100 to 300 m surrounding the above-ground sensor (Desilets and Zreda, 2013;Köhli et al., 2015).  showed that soil moisture estimated from CRNS neutron counts differed less than 20 % from the average of a co-located point-scale soil moisture sensor network at a site in Arizona. Networks of the CRNS have been established in various countries, e.g. the Cosmicray Soil Moisture Observation System, COSMOS (Zreda  (Evans et al., 2016), the Australian National Cosmic-Ray Soil Moisture Monitoring Facility CosmOZ (Hawdon et al., 2004), and TERrestrial ENvironmental Observatoria, TERENO (Baatz et al., 2015).
Unlike wireless point-scale sensor networks, both the GPS and CRNS technology provide an integrated soil moisture measurement over the entire support volume (Larson et al., 2008;Zreda et al., 2008). We chose to answer our research question using the CRNS technology because the COSMOS network provides publicly available data for multiple years at a range of sites co-located with AmeriFlux/FLUXNET eddycovariance towers (ORNL-DAAC, 2015). Twelve of these sites provided sufficient LSM forcing data, PS soil moisture data, CRNS data, and eddy-covariance LE and sensible heat flux (H) data.
Before our modelling exercise we first compared the PS and CRNS data. The outcomes of this data analysis were mainly used to see whether the results from the calibration and validation yielded larger differences in surface energy flux estimation at sites where the two soil moisture observation products showed higher deviation from each other. To investigate our research question we made the LSM simulated soil moisture content match the observed PS or CRNS data as closely as possible. We did this by calibrating parameters of the Joint UK Land Environment Simulator (JULES; Best et al., 2011) against point-scale and Cosmic-Ray Neutron data separately. We subsequently validated the results against eddy-covariance observed data over the same periods. To assess the change in soil moisture and surface energy fluxes after calibration we compared the calibrated runs against a default run with parameter values computed from a widely used soil properties database. We emphasize that we compared the two different soil moisture measurement techniques' scales and not the techniques as such.
2 Data and methods 2.1 Calibration and validation data: PS, CRNS, and eddy-covariance data Point-scale (PS) soil moisture and CRNS neutron count data from 12 AmeriFlux/COSMOS sites were used ( Fig. 1; full COSMOS site names are shown in this figure). These 12 sites covered 8 of 20 Ecoclimatic domains of the US National Ecological Observatory Network (NEON; www. neonscience.org) (Fig. 1). These 12 sites hence represent a variety of climates and land cover types, but also different soil types (Table 1). Hourly PS data for nine sites were obtained from the publicly available AmeriFlux Level 2 data source (ORNL). Data for the three California Climate Gradient sites (DC, CS, and SO) were obtained at http://www.ess.uci.edu/~california/ (data version 3.4; Goulden, 2015). The number of PS profiles, the installation depths, and sensor types differed between the 12 study sites (Table A1 in Appendix A). We used point-scale soil moisture data from the soil layers up to 30 cm depth only for consistency among all sites. There were only two sites reporting soil moisture data at greater depths: WR at 50 cm and MO at 100 cm. Our main objective was to investigate the difference in information content due to two soil moisture measurement techniques' different horizontal scales in relation to the eddy-covariance footprint, rather than to compare the measurement techniques themselves. Quality control was applied to filter out spurious and unrealistic data points due to sensor errors. The PS data were then interpolated to the JULES soil layer on which the model was calibrated.
Hourly CRNS neutron count data were obtained from the COSMOS network website (http://cosmos.hwr.arizona. edu/). Corrections were applied as by Zreda et al. (2012). Water vapour corrections  were applied with respect to dry air (Bogena et al., 2013). The quality control approach used for the PS analysis was also applied to CRNS neutron count data series to remove unrealistic points. Snow cover periods were also removed for the analysis. A 5 h moving average window was applied to the observed CRNS neutron counts (following Shuttleworth et al., 2013;Rosolem et al., 2014). We used the exact same data for the soil moisture data comparison as for the calibration (providing that both PS and CRNS were available in the same period). As validation data, latent heat (LE) flux hourly data from the AmeriFlux Level 2 data source were used for nine sites. We used data version 3.4 (Goulden et al., 2015) for the three California Climate Gradient sites. Quality control was applied to the LE and sensible heat flux (H) flux data to remove outliers and unrealistic data points.

Soil moisture data comparison methodology
To compare PS and CRNS soil moisture data, we computed the mean squared deviation (MSD) and its decomposition (Gupta et al., 2009) where µ (m 3 m −3 ) is the observed mean, σ (m 3 m −3 ) is the standard deviation, and r (-) is the coefficient of linear correlation. We however scaled the relative contributions of the three MSD components to the root mean squared deviation (RMSD) instead of MSD, to keep units of m 3 m −3 . We then ranked the sites from lowest to highest RMSD. According to our hypothesis, we would expect to see a larger difference in simulated surface energy fluxes after calibration when the two soil moisture time series differ most. We compared PS soil moisture with CRNS soil moisture values computed from vertically homogeneous soil moisture values obtained from the observed neutron counts using the COsmic-ray SoiI Moisture Interaction Code (COSMIC; Shuttleworth et al., 2013).

JULES forcing data and initial conditions
JULES requires precipitation, air temperature, atmospheric pressure, wind speed, specific humidity, downward shortwave radiation, and downward longwave radiation as meteorological forcing data. Quality controlled hourly data were obtained from AmeriFlux Level 2 and the three California Climate Gradient sites. At some of the sites, however, certain specific forcing data were not available from AmeriFlux, and hence data from different sources were used (Table A2 of Appendix A).
Model input data were gap-filled following  because JULES requires continuous time series except for precipitation where gaps were set to zero. For all sites, gaps smaller than 3 h were filled by linear interpolation. Larger gaps of up to 30 days were gap-filled using the average diurnal pattern of the preceding and following 15 days. In addition, some remaining gaps in downward shortwave radiation and downward longwave radiation at Wind River (WR) were filled using linear least squares relationships with NLDAS-2 data (http://disc.sci.gsfc.nasa. gov/uui/datasets?keywords=NLDAS). At Soaproot (SO) and Coastal Sage (CS), data gaps in the atmospheric pressure time series were filled with NLDAS data. At CS NLDAS data were also used to gap-fill air temperature. Gap-filling at Santa Rita Creosote was done with data from the nearby Sahuarita site, followed by the gap-filling procedure described above.

Joint UK Land Environment Simulator (JULES)
We used the Joint UK Land Environment Simulator (JULES; Best et al., 2011;Clark et al., 2011) in this study. JULES can be coupled as a lower boundary condition to the UK Met Office Unified Model (Cullen, 1993). Within JULES choices can be made (e.g. canopy radiation model type) and certain modules (e.g. vegetation dynamics) can be switched on or off to operate at different levels of complexity. In addition, we chose the UK Variable resolution configuration (UKV) because it is the standard setting when JULES is run coupled with the UK Met Office Unified Model. The UKV land grid cell size is 1 km by 1 km. However, our study focused on JULES standalone simulations at the 12 grid points located at the sites investigated. The UKV setting employs the multilayer canopy radiation module with surface heat capacity and snow beneath the canopy, the single canopy layer "big leaf" approach for leaf-level photosynthesis (which computes radiation absorption with Beer's law). Soil heat conductivity was calculated using the approach of Dharssi et al. (2009). We used the default JULES-UKV soil layering (Supplement 1, Fig. S1.1 in the Supplement). The hydraulic bottom boundary condition in JULES is free drainage.
JULES computes the transport of water through the soil using a finite difference representation of the Richards equation. The vertical fluxes are computed with the Buckingham-Darcy equation. JULES-UKV uses the Mualem-Van Genuchten (Van Genuchten, 1980;Mualem, 1976) soil water retention equations. The Van Genuchten equation calculates soil water content θ (m 3 m −3 ) from soil hydraulic pressure head ψ (m): with shape parameter n (-), α (m −1 ) representing the inverse of the water entry pressure, θ res (m 3 m −3 ) is the empirical residual soil moisture content (without physical meaning), and θ sat (or smsat; m 3 m −3 ) is the saturated soil moisture content. In JULES parameter n is defined as b = 1/(n − 1) and sathh = α −1 (m). The Mualem equation computes the unsaturated hydraulic conductivity K: where K sat (or satcon; mm s −1 ) is the saturated hydraulic water conductivity.
The values of the Mualem-Van Genuchten parameters need to be defined by the user for each grid cell/point based on soil characteristics.
In JULES soil moisture directly interacts with transpiration (through root water uptake) and bare soil evaporation as described hereafter (see also Fig. S1.2). JULES first computes the potential photosynthesis, which is a function of three limiting factors: Rubisco limitation, radiation limitation, and photosynthetic product transport limitation. The potential photosynthesis is multiplied by a soil moisture reduction factor to obtain the actual photosynthesis. To obtain this soil moisture reduction factor, the model first computes a limiting factor for each layer: where θ i is the unfrozen soil moisture content in layer i, θ crit is the critical point soil moisture content below which soil moisture is limiting the root water uptake (matrix water potential −330 cm in JULES), and θ wilt is the wilting point soil moisture content below which no root water uptake occurs (−15 000 cm in JULES). These reduction factors are multiplied by the root density in the layer. These weighted reduction factors are then summed to obtain the root zone soil moisture reduction factor. From the actual photosynthesis the plant stomatal conductance is computed. Separately the bare soil surface conductance, which is a function of the soil moisture content in the upper soil layer and the critical soil moisture, is computed. The surface conductance is then computed as a function of the stomatal conductance and the bare soil surface conductance. The potential evapotranspiration is also calculated separately. This variable is multiplied by the saturated land fraction to compute the free water evaporation (e.g. lake and canopy evaporation). The rest of the potential evapotranspiration is multiplied by the surface conductance to obtain the bare soil evaporation + plant transpiration. Together these fluxes are the actual evapotranspiration (water) or latent heat flux (energy). The amount of water drawn from the top soil layer through bare soil evaporation depends on the bare soil surface conductance. The distribution of the root water uptake between the layers depends on the weighted soil moisture limitation factor for each layer. The water extraction from the soil in its turn directly affects the soil moisture content in the different layers at the start of the next time step. These soil moisture contents then affect the soil moisture redistribution, surface runoff, and deep drainage.
JULES-UKV also requires a number of initial conditions: the amount of unfrozen water and snow stored on the surface (on the canopy and on the soil surface; set to zero in this study), snow properties (set to JULES-UKV defaults), the surface temperature (set to the air temperature of the hour before the first simulation time step), the soil temperature of each layer (set to the soil temperature from AmeriFlux data the hour before the initial time step), and the soil water content in each layer (set to the soil water content from the PS observed moisture content of the hour before the first simulation time step and applied homogeneously throughout the profile). Soil moisture was spun up by running a maximum of five cycles and stopped when soil moisture convergence was lower than or equal to 10 % compared to the previous cycle.

Calibration and validation approaches
At each site we calibrated JULES against PS observed soil moisture and against CRNS observed neutron counts respectively. We chose to calibrate simulated neutron counts against CRNS observed neutron counts using COSMIC  to translate simulated soil moisture profiles into equivalent neutron counts. We computed the root mean squared error (RMSE) between simulated and observed hourly time series. To better match the observed soil moisture/neutron count time series, we calibrated five JULES parameters that influence the model soil moisture state (Fig. S1.2). These included the Mualem-Van Genuchten shape parameter (b), the water entry pressure parameter (sathh), and the saturation hydraulic conductivity (K sat ). The critical point (θ crit ) and wilting point (θ wilt ) soil moisture content parameters from the evapotranspiration limitation factor were also calibrated. We chose these parameters because they are, in theory, directly linked to the movement of moisture in the soil and to the effects of soil moisture on transpiration in JULES.
To assess the effects of calibration on soil moisture and surface energy flux simulation, we compared the calibrated solutions against a default run at each site. The parameter values for the default case were derived from soil properties (percentage clay, loam, and organic matter) reported by the Harmonised World Soil Database (HWSD; FAO, 2009) for each of the 12 sites. These properties were used in the Wösten pedotransfer function (Wösten, 1997) to obtain values for b, sathh, and K sat . Parameter values for θ crit and θ wilt were subsequently obtained with the Van Genuchten formula.
The parameter calibration ranges were the same for all sites (Table 2). They were constructed by computing the minimum and maximum parameter values for the entire soil texture triangle (based on the Wösten pedotransfer function). Three organic matter contents were taken into consideration, yielding three triangles. Clay percentages above 70 % were excluded to avoid extreme values for parameter b especially. The range for θ crit was set to 10-90 % of the saturated soil moisture content. The saturated soil moisture content was computed as a function of the dry soil bulk density (ρ bd,dry ): θ sat = 1−ρ bd,dry /2.65 (Brady and Weil, 1996). Soil bulk density values obtained from the COSMOS network were used. This yielded different θ crit ranges in terms of soil moisture content (m 3 m −3 ) for different sites. The residual soil moisture content (defined implicitly in JULES) was set to zero because the Wösten pedotransfer function does not consider it.
JULES' remaining two ancillary parameters, the soil heat conductivity and soil heat capacity, were for each site computed as a function of soil properties (HWSD) with De Vries' (1963) method. The bare soil albedo was set constant at 0.38 (-) for all sites. Plant functional type parameters were set to JULES defaults except for the e-folding rooting depth (depth above which 86 % of plant roots are present) and the canopy height, at sites where more specific information was available from AmeriFlux/COSMOS site information or from site-specific literature.
We calibrated JULES using the BORG Multi-Objective Algorithm (BORG-MOEA or BORG; Hadka and Reed, 2013). This calibration tool was designed for multi-objective problems but also works for single-objective calibration. BORG employs multiple optimization algorithms simultaneously to obtain convergence while also keeping the searched parameter space wide. The algorithm measures progress with the epsilon-progress technique, which uses the objective function space divided into boxes with sides of size epsilon. Epsilon is a user-defined value for each objective function (we used epsilon values of 0.001 m 3 m −3 for PS and 1 cph for CRNS, Kollat et al., 2012). Only if a new solution resides inside a box with a better objective function value does BORG consider it progress. If no progress was obtained after 200 runs, the algorithm had stagnated. In this case the BORG algorithm triggers a restart, which consists of (among other techniques) changing population size to maintain a diverse population and to escape local optima. We used a maximum number of 3000 runs and an initial population size of 100 runs.
As a validation metric, we chose the RMSE between the observed and simulated latent heat flux (LE). Because data quality issues with eddy-covariance data are often observed during nighttime (Goulden et al., 1996(Goulden et al., , 2006(Goulden et al., , 2012Aubinet et al., 2010), we computed these metrics over daytime hours only. We defined daytime as downward shortwave radiation > 20 Wm −2 . To avoid extreme RMSE-EF values, we used hours with both observed and simulated LE values ≥ 1 Wm −2 only. Otherwise a few hours with small LE or H values would have dominated the RMSE values, while this would probably have been due to forcing or eddy-covariance data inconsistencies and would not relate to soil moisture temporal variability.
It is known that single-objective calibration is often insufficient to constrain parameters to simulate different states and fluxes well (Gupta et al., 1999). In addition, Vereecken et al. (2008) argued that calibrating soil hydraulic parameters against soil moisture only does not guarantee better surface energy fluxes. To find out whether it was actually feasible to expect better surface energy flux simulation when soil moisture was improved, we performed calibrations where we op- timized the model for two objectives simultaneously. We employed the BORG algorithm to simultaneously optimize the RMSE of (daytime) latent heat flux and the RMSE of allday soil moisture (using PS soil moisture and CRNS neutron counts separately). We analysed the trade-off between the two-objective functions. We computed the compromise solution for each two-objective calibration. Plotted within the normalized two-objective solution space, the compromise solution is the model run which has the smallest distance to the origin. This means no other solution can be obtained that yields a better approximation for one objective function without deteriorating the other.
3 Results and discussion

Soil moisture data analyses
In Fig. 2 comparison between PS and CRNS soil moisture time series shows that the seasonal trends of the two soil moisture observation products were similar. The two soil moisture products however also differed from each other in distinct ways at different sites. PS soil moisture observations were systematically higher than CRNS soil moisture observations at 8 of 12 sites. At three sites (DC, SO, CS) PS soil moisture dried down quicker than CRNS soil moisture, while at ME the opposite behaviour was observed. At KE, MM, TR, and MO PS showed a higher seasonality signal (up to 50 % higher) than CRNS. Peaks in PS soil moisture were at three sites (UM, KE, TR) up to twice as high as in CRNS soil moisture. In addition, the CRNS data appear noisier than the PS data, which is an effect of inherent randomness in neutron radiation reaching the CRNS sensor element . This effect was more pronounced for lower neutron intensity. The differences seen in Fig. 2 are also summarized in Fig. 3, which shows a gradual site-to-site increase in RMSD between PS and CRNS soil moisture (RMSD-SM obs ). Overall, bias contributed 50 % or more of the total error at 7 out of 12 sites.
Additional analyses indicated that differences between the two soil moisture estimates could not be clearly related to any differences in site physical characteristics other than the mean soil wetness. Dominant vegetation type seemed not to have an effect on the similarity between the two soil moisture data products: both forested and grass sites included those with relatively small RMSD-SM obs and those with relatively high RMSD-SM obs . Only bare/shrub-covered sites (DC, SR, CS) were all below the sites' average RMSD- SM obs of 0.05 m 3 m −3 . These four sites were however also relatively dry. Soil type and soil bulk density were also investigated for correlations with RMSD-SM obs , but no trends were discovered. Larger differences between PS and CRNS soil moisture could be expected at sites with more heterogeneous soil or vegetation. Static satellite photos of the sites from the COSMOS website did not indicate systematically more heterogeneous conditions at the sites where PS and CRNS soil moisture differed more. Site info (e.g. topography, presence of rocks) from COSMOS and AmeriFlux also did not clearly show more horizontally heterogeneous soil properties for those sites. The fact that the soil moisture time series of PS and CRNS differed from each other in various ways could be related to a number of issues. First, PS sensor types and numbers of sensors differed between the sites (Table A1 of Appendix A). Secondly, the exact installation locations of the PS sensors may in certain cases have been for instance next to a macropore, or near roots, while at other sites they were coincidently installed in a homogeneous soil patch.
The presence of neutron mitigating factors other than soil moisture (e.g. atmospheric pressure, sensor type, biomass, intercepted water, and water in the litter layer) also affects the observed CRNS neutron count. Because different hydrogen pools are more present at certain sites than at others, the uncertainty in neutron count observations varies between sites. The results did not however show effects of land cover and soil properties on the similarity between the two soil mois- ture products. Another factor is that the quality of the calibration of COSMIC could possibly be different for different sites. Finally, at multiple sites, the PS and CRNS soil moisture time series were similar, as shown with RMSD values. This could be expected at rather homogeneous sites. Moreover, Köhli et al. (2015) suggested the CRNS footprint to be around 150-200 m instead of 300 m as reported by Desilets and Zreda (2013). In that case the differences between the two soil moisture observation techniques could be smaller than initially thought.
We derived vertically constant CRNS soil moisture values from observed neutron counts with COSMIC. This method contains inherent uncertainty because in reality soil moisture is often not vertically homogeneous. The outcomes of the calibration (against PS and CRNS) and validation provide insight into the effects of the differences between the two soil moisture products on JULES' surface energy flux partitioning and latent heat flux simulation.

Single-objective calibration against soil moisture observations
The degree to which the objective function (RMSE-SM; Fig. 4) values decreased differed between sites and the two calibration strategies (PS or CRNS), with decreases of 21 (AR-PS) to 93 % (UM-CRNS). While the errors of the default runs consisted mostly of systematic bias, after calibration the difference in dynamics was the largest source of uncertainty, and in 16 out of 24 cases this contribution actually increased in absolute terms. The increase in difference in dynamics was due to the selected objective function (RMSE), which reduces the mean error between modelled and observed data. Previous research (e.g. Teuling et al., 2009) has shown that calibrating soil parameters has a large effect on simulated absolute soil moisture values but substantially less on soil moisture seasonality and dynamics. Our finding supports this.
The RMSE values reduced relatively more for the CRNS calibration (70 % on average over the 12 sites) than for the PS calibration (55 % on average). The calibration method could possibly explain this. CRNS calibration was against observed neutron counts, while PS calibration was against observed soil moisture contents. Because neutron counts have an inverse relationship with soil moisture content, the PS calibration was possibly governed by avoiding larger errors occurring during a few brief soil moisture peaks. While focussing on getting the fitting for those peaks right, the PS calibration neglected the smaller errors during dry periods. This would then result in a relatively smaller decrease in RMSE values than for the CRNS calibrations because those were fitted with heavier weights to the drier periods. Figure 4 also shows that the relative improvement was not systematically lower or higher for sites with higher similarity between the two observed soil moisture time series. Actually the largest improvement was for the CRNS calibration at UM. Therefore, it appears that the quality of the default runs was hence predominated by the quality of the chosen default parameter values. Figure 5 shows soil moisture time series for four selected sites: UM was chosen because PS and CRNS soil moisture were most similar there, SR was a site with moderate difference, and at WR and MO PS and CRNS soil moisture were most different. Simulated soil moisture improved especially at UM and WR, where the default runs overestimated soil moisture contents. Simulated soil moisture dynamics became more similar to both PS and CRNS observed soil moisture dynamics, even though observed soil moisture dynamics differed from each other substantially at sites WR and MO. The relatively high noise in the CRNS soil moisture time series is due to the relatively low neutron count at this site, and possibly due to temporal variations in other hydrogen pools like water intercepted in the forest canopy and in the litter layer.

Validation of the single-objective calibrations against eddy-covariance observations
While calibration errors decreased for soil moisture, latent heat flux estimation improved for 14 out of 24 calibrations (Fig. 6). This means that an improvement in simulated soil moisture did not necessarily lead to better estimation of surface energy fluxes. Calibration against PS soil moisture improved RMSE-LE at six sites (UM, KE, DC, SR, SO, and ME), while calibration against CRNS neutron counts improved RMSE-LE at eight sites (KE, DC, SR, ME, SO, AR, MM, and WR). At five sites RMSE-LE improved after calibration against both PS soil moisture and CRNS neutron counts. Calibration yielded lower RMSE-LE after calibration against CRNS neutron counts than after calibration against PS soil moisture at all but three sites (UM, KE, and WR). Figure 6 also shows that RMSE-SM decreased substantially less (i.e. > 20 % difference) after calibration against PS soil moisture than after calibration against CRNS neutron counts (i.e. > 20 % difference between both strategies on the horizontal axis of Fig. 6, which occurred at six sites). At five of these six sites the relative change in surface energy flux performance was also smaller (sites MO, AR, TR, ME, and SR). This indicates that further improvement in soil moisture simulation after calibration against PS data could have yielded better surface energy fluxes. The differences between the two calibration strategies can also be seen from the different locations of the centres-of-mass of the two point clouds. The cloud of the CRNS strategy is clearly located more to the lower left corner in Fig. 6. In Fig. 6, 10 % change in latent heat flux estimation was chosen to distinguish substantial change from nonsubstantial change, derived from the approximate error in eddy-covariance data (Sellers and Hall, 1992;Finkelstein and Sims, 2001). Improvement in latent heat flux was actually substantial in four cases for PS calibration and five cases for CRNS calibration. Using this threshold also revealed that RMSE-LE did not change substantially in six cases for cal-ibrations with a more than 60 % change in RMSE-SM. This again shows that a change in simulated soil moisture did not necessarily mean a substantial change in surface energy flux simulation. Analysis of the RMSE of evaporative fraction (EF = LE/(LE + H)), which shows the ability to simulate surface energy partitioning, yielded similar overall results to our analysis of the RMSE of latent heat flux (Fig. S2.1).
One factor causing some of these limited improvements was that, when mean simulated root zone soil moisture (weighted with root density) increased after calibration, the values of the wilting point and critical point soil moisture parameters moved along (data not shown), yielding similar soil moisture stress. This happened for both calibration strategies at site KE, and for the calibration against CRNS neutron counts at sites MO and TR. This could relate to the limited value of simulated absolute soil moisture for surface energy flux estimation in land surface models (Dirmeyer al., 2000;Koster et al., 2009). However, we also found the distance between wilting point values and critical point values to decrease after calibration. This occurred with a simultaneous decrease in the standard deviation of the simulated soil moisture. The self-adjusting behaviour of the wilting point and critical point parameters was also indicated by parameter sensitivity analysis (Appendix B), which showed soil moisture was substantially more sensitive to a change in critical point value than latent heat flux.
Another issue, which occurred for instance for the PS calibration at SO and for the CRNS calibrations at site AR, was that while surface energy flux estimation improved for a certain period, it deteriorated for another period (data not shown). A third cause of limited improvement in surface energy flux estimation occurred for example at site ME (data not shown). Soil moisture stress was relatively limited (beta mostly above 0.6) and during periods of soil moisture stress, the latent heat flux was dominated by different stress factors.
To see whether validation results would be different if only those periods during which soil moisture stress occurred were evaluated, we also analysed the performance over these periods (data not shown). During these periods plant water uptake limitation factor β was below 1 (data not shown). Only at site SO did we see somewhat better performance during these periods compared to the original validation period, after both PS and CRNS calibration.
We explored trends between relative improvement in surface energy flux estimation and soil wetness, precipitation, vegetation type, vegetation height, and soil characteristics. No clear trends were discovered. The only feature that could be distinguished was that for the two sites with a bare soil tile (DC and SR), PS and CRNS calibration improved EF and LE. Rooting depth also did not explain relative improvement in surface energy flux estimation. Finally, larger differences between the two observed soil moisture time series did generally yield more different simulated surface energy flux time series, with a clear exception for site UM, where the two cal-ibration approaches yielded a more than 20 % difference in change in RMSE-LE.
In Fig. 7 the monthly mean diurnal latent heat cycles of four sites (UM, SR, WR, and MO) are shown (nighttime data included but not used for calibration). Both calibration strategies yielded overestimation in March and April and underestimation from June to August at site UM. A month long gap in longwave and shortwave downward model forcing data occurred in April 2012. This might have affected the results for this site. At SR, both PS and CRNS calibration improved latent heat flux during periods of low evapotranspiration, while during the other periods only the CRNS calibrations yielded better results. At WR calibration against PS soil moisture yielded overestimation of latent heat flux, while CRNS calibration yielded overly low latent heat flux. The PS calibration at MO yielded latent heat flux underestimation, whereas CRNS calibration did not change LE substantially.
In summary, the single-objective calibrations against CRNS neutron counts yielded larger improvements in simulated surface energy fluxes than single-objective calibrations against PS soil moisture (Fig. 6). Improvements in surface energy flux estimation were however substantial for four calibrations against PS soil moisture and five calibrations against CRNS neutron counts. Limited improvements in surface energy flux estimation after calibration could partly be attributed to the limited value of absolute soil moisture to estimate surface energy fluxes with land surface models. This seems reasonable because calibration mostly affected absolute soil moisture (Fig. 4). This result corresponds to earlier research that showed model soil moisture dynamics and seasonality have a larger effect on surface energy flux simulation than absolute soil moisture (e.g. Teuling et al., 2009;Dirmeyer et al., 1999). Previous research has also indicated that soil moisture alone is insufficient to estimate soil hydraulic parameters (Vereecken et al., 2008. Our findings support this for some sites. To better understand these implications, the two-objective simulations against soil moisture and latent heat flux simultaneously (discussed in the next section) provide further insight into these results.

Two-objective calibration against soil moisture and latent heat flux
The results from the two-objective calibrations suggest that latent heat flux estimation improvement with respect to the compromise solutions (%-change RMSE-LE, vertical axis in Fig. 8) was similar for both two-objective calibration strategies (Fig. 8). Only at three sites was the difference in improvement (%-change RMSE-LE, vertical axis in Fig. 8) between the two-objective calibration strategies more than 5 % (at sites SR, DC, and UM). Improvements were substantial (according to our 10 % threshold for improvement in RMSE of latent heat flux) for six calibrations with PS soil moisture and for eight calibrations with CRNS soil moisture.  Scatterplots of normalized RMSE of latent heat flux (LE) and normalized RMSE of soil moisture for all sites are shown in Figs. 9 and 10 for the PS and CRNS two-objective calibration strategies respectively. The RMSE values were normalized with respect to the default solutions, which therefore have normalized RMSEs of 1 (-). The black dots represent individual model runs and the default model run for each site is represented by a red cross. The single-objective calibration solutions are shown as blue triangles and the compromise so-lution of each two-objective calibration is shown as a green triangle. Figures 9 and 10 show that the substantial (i.e. more than 5 %) differences between the two two-objective calibration strategies in improvement of simulated latent heat flux observed for SR, DC, and UM, in Fig. 8, were less meaningful than seemed initially from analysis of Fig. 8. These findings are based on the shapes of the black point clouds in Figs. 9 and 10. We first look at the left edges of the black point clouds. When these edges are close to vertical, a small deterioration in RMSE-SM or RMSE-N (e.g. less than 0.05 (-)) would yield a large deterioration in normalized RMSE-LE (e.g. greater than 0.2 (-)). We observed this for three calibrations with PS soil moisture (SO, AR, and WR; indicated with a pink line for WR) and in all these three cases, a less than 0.05 (-) change in normalized RMSE-SM would have yielded worse simulated latent heat flux than with the default parameter set. Seven calibrations with CRNS neutron counts (UM, DC, SO, KE, SR, CS, and WR; indicated with a pink line for WR) showed a close to vertical edge. In four cases (UM, DC, KE, and CS) this would have yielded worse simulated latent heat flux than with the default parameter set. A negative slope for this side of the point cloud means an improvement in soil moisture estimation would mean a deterioration in latent heat flux estimation. We observed such a negative slope for four calibrations with PS soil moisture (CS, MM, TR, and MO; indicated with a continuous black line for MO) and for two calibrations with CRNS neutron counts.
Next, we look at the lower edges of the point clouds in Figs. 9 and 10. When this edge is horizontal, this means good latent heat flux estimation can be obtained for a wide range of soil moisture estimation performances (for instance site ME, indicated with pink lines in both figures). When this edge has a negative slope, this means the best latent heat flux estimation is obtained for worse soil moisture (for instance site TR, indicated with pink lines in both figures). We observed these two features for all two-objective calibrations. Singleobjective calibration against latent heat flux would however only have necessarily yielded worse soil moisture than the default parameter set for six calibrations with respect to PS soil moisture (sites UM, KE, SR, TR, AR, and WR) and two calibrations with respect to CRNS neutron counts (sites TR and AR). The implication of these results is that the quality of latent heat flux simulation did not depend strongly on the quality of soil moisture simulation.
In summary, the two-objective calibrations against soil moisture (or neutron counts) together with latent heat flux showed, compared to the single-objective calibrations, fewer substantial differences between calibration with PS soil moisture and calibration with CRNS neutron counts. These results indicate that the differences between both singleobjective calibration strategies, which showed an advantage for CRNS observations, were possibly not as substantial as they seemed at first. The spatio-temporal stability theory, which implies limited spatial variability in surface energy fluxes, could be one explanation for this (Vachaud et al.,Figure 10. Same as Fig. 9 but for the two-objective calibrations against CRNS neutron counts (representing soil moisture) and daytime hourly latent heat flux (LE) at each site.
1985; Teuling et al., 2006;Mittelbach and Seneviratne, 2012;Albertson and Montaldo, 2003). Another factor that has possibly played a role is that the spatial-scale advantage of the CRNS was masked out by the possibly lower quality of its measurements (see also Sect. 3.1). Different hydrogen pools than soil moisture affect the neutron measurements at various temporal resolutions (e.g. rainfall interception; Baroni and Oswald, 2015). However, PS sensors also have their limitations. Different (electromagnetic) PS sensors have different designs and properties, which affect the data quality (Robinson et al., 2008;Blonquist et al., 2005). Soil type and soilspecific calibration also affect the accuracy and precision of the PS data. For instance, the relationship between electrical permittivity and soil moisture content is strong for quartzrich soils, but less accurate for clay soils (e.g. Ishida et al., 2000;Robinson et al., 2008). An issue that could have had an effect on our results is the occurrence of gaps in the model forcing data. Gaps in observed meteorological data are often inevitable and must be filled for a land surface model to be run. Percentages of missing hours differed between 0 and 15 % at our sites. Gaps larger than 15 days filled with the moving window gap-filling procedure occurred mainly for downward shortwave and/or downward longwave radiation at sites UM, KE, SR, DC, TR, and CS. At site WR a data gap of 29 days in precipitation and air pressure occurred. Especially the gap in precipitation data can have negatively affected the model results at this site. Gaps larger than 30 days were filled with NLDAS-2 data at sites WR, CS, and SO. At site WR a single gap of 110 days in shortwave and longwave radiation was filled. At site SO a gap of 66 days in atmospheric pressure was filled with NLDAS-2 data. At site CS two gaps of 200 days in atmospheric pressure were filled and the entire time series of air temperature was filled with a linear relationship based on data from 4 preceding years. At site SR, gaps before applying the moving window procedure varied between 3 and 15 % of the time for precipitation and wind speed respectively and atmospheric humidity was missing for 35 % of the time. These gaps were mostly filled with a linear relationship with data from the Sahuarita site, located approximately 14 km to the north-west from the SR site, except for downward longwave radiation. Remaining gaps were filled with the moving window procedure. In this study we used PS sensor data in the upper 30 cm of the soil only, even though data from deeper sensors were publicly available at two sites (WR and MO). This choice can have provided a disadvantage to the PS data in our comparison, especially at sites with deeper roots and during soil moisture limiting conditions in the upper soil. Investigating the role of deeper roots was however beyond the scope of this study. Our goal was to compare the effects of the difference in horizontal footprint on latent heat flux simulation after using the two different measurement techniques to calibrate model parameters.

Conclusions
We investigated whether the spatial-scale mismatch between the surface energy flux data and soil moisture data could be reduced through the use of Cosmic-Ray Neutron Sensors (CRNSs). Five soil and evapotranspiration parameters of LSM JULES were calibrated against point-scale (PS) and CRNS soil moisture data separately, for 12 COS-MOS/AmeriFlux sites with different climate, land cover, and soil properties. Next, at each site, the improvement in latent heat flux for the two calibration solutions was assessed by comparing the fit with eddy-covariance data and a version of LSM JULES runs with default parameter values based on a widely used soil database. Before the calibrations we compared the observed soil moisture data from the two sensor types. These analyses showed that the differences between PS and CRNS soil moisture varied in nature at the investigated sites, but such differences were small. While at eight sites there were mainly systematic biases, at three other sites the seasonality was most different, and at one other site different time series dynamics were the main cause of differences between the two soil moisture observations. The single-objective calibration of JULES parameters against Cosmic-Ray Neutron Sensor neutron counts yielded better simulated latent heat flux than single-objective calibration against point-scale soil moisture. The analysis of multiobjective calibrations (against (1) PS soil moisture and latent heat flux and (2) CRNS neutron counts and latent heat flux) however revealed that differences between calibrations with these two soil moisture observation methods did overall not yield substantially different surface energy flux estimations.
These outcomes did not provide sufficient evidence to reject our null hypothesis "reduced scale mismatch does not lead to LSM flux estimates closer to eddy-covariance observations". The spatio-temporal stability theory in soil moisture, on which our hypothesis was based, can possibly explain the limited differences in surface energy flux estimation. This theory implies that spatial variability in surface energy fluxes is relatively limited within an eddy-covariance tower footprint. Therefore simulated surface energy fluxes after calibration with point-scale soil moisture data would not necessarily be better than simulated surface energy fluxes after calibrations with CRNS neutron count data. Calibrating soil parameters had mostly an effect on absolute soil moisture values rather than soil moisture dynamics. Soil moisture dynamics have a greater effect on surface energy flux simulation in land surface models than absolute soil moisture values do. Related to this we observed that after calibration the wilting point and critical point soil moisture parameter values adjusted themselves in a similar way as the root zone soil moisture did, yielding similar soil moisture control on transpiration despite changes in soil moisture values. In other cases we found calibration against soil moisture to improve surface energy fluxes during certain periods, but to deteriorate surface energy fluxes during other periods, yielding similar overall performance. Yet in other cases evapotranspiration was not limited by soil moisture stress. The potential scale advantage of the CRNS was possibly masked out by the possibly lower measurement quality of this sensor because other hydrogen pools than soil moisture affect the neutron count observations. Future use of CRNS soil moisture data could however benefit from improved knowledge of the effects of additional hydrogen pools (e.g. Baroni and Oswald, 2015) and of the sensor footprint (Köhli et al., 2015). In this study, our results are conditioned to a single land surface model (JULES). For additional understanding of the importance of both PS and CRNS measurements for surface energy flux simulation, this study can be extended to other models in the future.
J. Iwema et al.: Land surface model performance using cosmic-ray measurements Appendix A: Additional data and methods tables and figures

Appendix B: Parameter sensitivity analysis
We investigated whether a change in some parameters had a relatively small effect on soil moisture while at the same time having a large effect on latent heat flux. In such a case calibration could yield better soil moisture, but the parameter value might be inappropriate for latent heat flux. The inverse (influential on SM but not on LE) could also occur. We explored this by performing a sensitivity analysis with Morris' method (Morris, 1991) as implemented in the SAFE Toolbox (Pianosi et al., 2015) on the exact same parameter value ranges as used during the calibration. We computed the sensitivity indices (mean and standard deviations of the elementary effects) on the RMSE values (i.e. our objective functions) of simulated vs. observed PS soil moisture, simulated vs. observed CRNS neutron counts, and simulated vs. observed latent heat flux (daytime only). The results (shown for four sites in Fig. B1 of Appendix B) were consistent across most sites: all three objective functions were most sensitive to changes in parameter b and least sensitive to the wilting point multiplier. Finch and Haria (2006) also found JULES parameter b to be most influential on soil moisture and latent heat, at a UK chalk site. The critical point soil moisture content was influential with respect to soil moisture/neutron counts but had at all sites less effect on latent heat flux. This could relate to the selfadjusting behaviour of the wilting point and critical point soil moisture described in Sect. 3.3. The lack of effect from the wilting point soil moisture content can probably be attributed to the use of the multiplier, despite being a common approach (Prihodko et al., 2008;; a certain multiplier value could be good in combination with a certain value for the critical point, but not for a different critical point value. We have evaluated the differences between default and calibrated parameters and their degree of physical realism in Supplement 3. If parameter values obtained after calibration were not physically feasible (e.g. representing a sandy soil while there was a clay soil), then, if model structure is assumed to represent biophysical processes sufficiently well, that could yield undesirable results. The parameter value ranges were especially wide for the air entry pressure (sath) and the saturation hydraulic conductivity (satcon). However, physically realistic parameters also often provide unrealistic results (Gupta, 1998).