The WACMOS-ET project – Part 1: Tower-scale evaluation of four remote-sensing-based evapotranspiration algorithms

Abstract. The WAter Cycle Multi-mission Observation Strategy – EvapoTranspiration (WACMOS-ET) project has compiled a forcing data set covering the period 2005–2007 that aims to maximize the exploitation of European Earth Observations data sets for evapotranspiration (ET) estimation. The data set was used to run four established ET algorithms: the Priestley–Taylor Jet Propulsion Laboratory model (PT-JPL), the Penman–Monteith algorithm from the MODerate resolution Imaging Spectroradiometer (MODIS) evaporation product (PM-MOD), the Surface Energy Balance System (SEBS) and the Global Land Evaporation Amsterdam Model (GLEAM). In addition, in situ meteorological data from 24 FLUXNET towers were used to force the models, with results from both forcing sets compared to tower-based flux observations. Model performance was assessed on several timescales using both sub-daily and daily forcings. The PT-JPL model and GLEAM provide the best performance for both satellite- and tower-based forcing as well as for the considered temporal resolutions. Simulations using the PM-MOD were mostly underestimated, while the SEBS performance was characterized by a systematic overestimation. In general, all four algorithms produce the best results in wet and moderately wet climate regimes. In dry regimes, the correlation and the absolute agreement with the reference tower ET observations were consistently lower. While ET derived with in situ forcing data agrees best with the tower measurements (R2  =  0.67), the agreement of the satellite-based ET estimates is only marginally lower (R2  =  0.58). Results also show similar model performance at daily and sub-daily (3-hourly) resolutions. Overall, our validation experiments against in situ measurements indicate that there is no single best-performing algorithm across all biome and forcing types. An extension of the evaluation to a larger selection of 85 towers (model inputs resampled to a common grid to facilitate global estimates) confirmed the original findings.

and water cycles. Turbulent fluxes of sensible and latent heat determine the development of the planetary boundary layer and thus govern the interactions between the Earth surface and the atmosphere. Evapotranspiration (ET) represents the time-integrated flux of latent heat and is an essential component of the energy and water cycle, playing a key role in meteorology and climate as well as agriculture (see, e.g., Ershadi et al., 2014).
Historically, there has been a lack of reliable estimates of turbulent fluxes, since the partitioning of the available energy at the Earth's surface into these fluxes is complex and characterized by large spatial and temporal variability. Also, these components of the energy balance cannot be monitored directly on a global scale by remote sensing techniques. Thus, efforts to produce satellite-based estimates typically involve combining multi-sensor data sets with predictive formulations of varying complexity, ranging from relatively simple empirical formulations to more complex modeling approaches (see, e.g., Courault et al. (2005), Kalma et al. (2008) and Wang and Dickinson (2012) for comprehensive reviews). In recent years, such efforts have generated global ET products (Mu et al., 2007;Fisher et al., 2008;Jung et al., 2010;Zhang et al., 2010;Vinukollu et al., 2011;Miralles et al., 2011a) that have typically been evaluated by comparing them individually to in situ data or by inter-comparing them with other existing global heat flux estimates. For example, within the Evaluation and inter-comparison of existing land evapotranspiration products (Landflux-Eval initiative) initiative of the Global Energy and Water cycle Exchanges (GEWEX) Data and Assessments Panels (GDAP), Jiménez et al. (2011) investigated 3 years (1993)(1994)(1995) of global sensible and latent fluxes from a selection of 12 products, including satellite-based estimates, atmospheric reanalyses and offline land surface model simulations, while Mueller et al. (2011) examined a total of 30 observation-based ET estimates from similar sources over the longer period of 1989-1995, while also providing a comparison with global climate model simulations contributing to the Intergovernmental Panel on Climate Change (IPCC) fourth assessment report. More recently, Mueller et al. (2013) extended the previous LandFlux-EVAL evaluations and presented two monthly global ET synthesis products, merged from individual data sets spanning the periods 1989-1995 and 1989-2005. Based on the Mueller et al. (2013) methodology, Mao et al. (2015) synthesized a global ET time series for the period 1982-2010 based on a set of diagnostic ET products (including data sets produced with PT-JPL (Priestley-Taylor Jet Propulsion Laboratory model) and GLEAM (Global Land Evaporation Amsterdam Model)) to investigate the role of anthropogenic and climatic controls on ET trends. For the period 1982-2013 Zhang et al. (2015) produced a satellite-based global ET data set using remote sensing data and daily surface meteorology records for the investigation of multidecadal changes in ET, which was validated against precipitation and discharge records.
The GEWEX-LandFlux initiative is currently working towards producing an observation-based data set of heat fluxes that can be used together with related GDAP products to enable a joint analysis of the water and energy cycles (Jimenez et al., 2012). To contribute towards that goal, the European Space Agency (ESA) has conducted the Water Cycle Multi-mission Observation Strategy EvapoTranspiration project (WACMOS-ET), aimed at the identification of appropriate algorithms to develop regional and global ET products. WACMOS-ET efforts, which aimed at maximizing the use of European Earth Observation assets, have also included the compilation of a multi-sensor data set to run the ET methodologies for a 3-year period (2005)(2006)(2007).
In WACMOS-ET, the methodologies by Su (2002) (Surface Energy Balance System, hereafter referred to as SEBS), Mu et al. (2011) (Penman-Monteith algorithm from the MODerate resolution Imaging Spectroradiometer (MODIS) evaporation product, PM-MOD), Fisher et al. (2008) (Priestley-Taylor Jet Propulsion Laboratory model, PT-JPL) and Miralles et al. (2011a, b) (Global Land Evaporation Amsterdam Model, GLEAM) were selected to produce ET estimates on different temporal and spatial scales. The same algorithms have also been examined at a selection of different tower sites in a recent paper by McCabe et al. (2015) in preparation for the GEWEX-LandFlux product. In McCabe et al. (2015) the algorithms are run at 3-hourly time steps with both point-scale inputs (from tower meteorological observations) and gridded inputs (from the GEWEX-LandFlux global forcing data set) over a longer time period. Here, the ET algorithms are run with the WACMOS-ET forcings (see Sect. 2.2) and the analyses of model performance are extended to evaluate different timescales (3-hourly and daily) and time integrations (nighttime, daytime and full day). In a companion paper, Miralles et al. (2016) present the second part of the WACMOS-ET study, in which PT-JPL, GLEAM and PM-MOD are evaluated on the global scale.
The paper is structured as follows. First, the WACMOS-ET input data set is described in detail, together with the tower flux data used for driving and evaluating the ET models. This is followed by an evaluation of ET model performance on the tower scale using the tower eddycovariance (EC) fluxes as the reference data set. The model evaluation is first performed over a selection of 24 stations covering nine biomes in three continents (Europe, North America and Australia), in which models are run based on in situ and remote sensing forcing. Then the validation is extended to embrace a larger sample of 85 towers, in which models are driven only by satellite data resampled to a common grid. Finally, the main findings of our model evaluation on the pixel scale are summarized. Here the ET methodologies comprising the WACMOS-ET effort together with the input data sets that have been compiled to run the models and evaluate the ET estimates are presented. A summary of the data sets and the model-specific forcing requirements is provided in Table 1.

ET models
The four algorithms selected here estimate the fraction of the available energy at the surface used by the soil and canopy evaporation processes. Therefore, the available energy (i.e., the difference between the surface net radiation and the ground heat flux) is a key input for all algorithms. However, this evaporative fraction is parameterized differently by each model. SEBS is an energy balance model (Su, 2002) based on a detailed parameterization of the sensible heat flux at the surface, where ET is estimated as the residual of the surface energy balance once the sensible heat flux is calculated. Therefore, key inputs are the surface temperature and the temperature gradient between the surface and the air, and a key component of the model is the aerodynamic resistance to sensible heat transfer. PM-MOD (Mu et al., 2011) derives ET directly based on the Penman-Monteith equation (Monteith, 1965), which relates the latent heat flux to the vapor pressure deficit between the surface and the overlying air and requires a resistance parameter to characterize the canopy transpiration. PT-JPL (Fisher et al., 2008) and GLEAM (Miralles et al., 2011a) are based on first determining the potential evaporation by applying the Priestley and Taylor equation (Priestley and Taylor, 1972), followed by reducing the potential evaporation to actual evaporation with a number of evaporative stress factors. The derivation of these stress factors is different between both models. PT-JPL requires the vapor pressure deficit, relative humidity and visible-infrared-related vegetation indexes, while GLEAM combines microwave data of vegetation optical depth and soil moisture. A more detailed description of the input requirements for each model can be found in Table 1, while a more comprehensive description of each individual model is given in the following sections.

SEBS
SEBS (Su, 2002) is a one-source energy balance algorithm that is arguably one of the most widely used energy balance approaches to derive turbulent fluxes. The SEBS model calculates the sensible heat flux (H ) based on the Monin and Obukhov theory (Monin and Obukhov, 1954): where u is the wind speed, u * is the friction velocity, k is the von Kármán constant, z is the height above the surface, d 0 is the zero plane displacement height, z 0m and z 0h are the roughness heights for momentum and heat transfer, and m and h are the stability correction functions for momentum and sensible heat transfer, respectively. L refers to the Obukhov length, ρ is the air density, θ 0 is potential land surface temperature, θ a is the potential air temperature at height z, g is the gravity acceleration and θ v is the potential virtual air temperature at level z. When the suitable data are available, the only unknowns are H , u * and L. This allows the calculation of H and the further estimation of ET based on closing the energy balance at the surface, i.e., ET is estimated as the difference between net radiation (R n ) and the sum of the calculated H and ground flux (G).
Additionally, in order to constrain H estimates, two limiting cases are considered that set upper and lower bounds for the evaporative fraction. Under very dry conditions, ET becomes 0 and H is at its maximum, set by R n − G. Under wet conditions, ET occurs at potential rates and therefore H is at its minimum. In this wet case, H is calculated via reverse application of the Penman-Monteith equation (see Sect. 2.1.2) assuming that the surface resistance is zero.
SEBS has been extensively validated against tower measurements and has proved to estimate realistic evaporation rates on a variety of scales, ranging from local to regional (Jia et al., 2003;Su et al., 2005;McCabe and Wood, 2006). As an example, Chen et al. (2015) recently reported an average correlation of ∼ 0.8 and root mean square difference (RMSD) of 0.7 mm day −1 against eddy-covariance measurements in a validation of monthly SEBS ET aggregates over China.
As a one-source energy model, SEBS does not separate the contributing components of ET (i.e., transpiration, interception loss, bare-soil evaporation), unlike the other models studied in WACMOS-ET, which provide independent estimates of these vapor sources. Although not examined here, we note that two-source energy balance models can also treat the soil and vegetation components separately (e.g., Kustas and Norman, 2000;Anderson et al., 2007) but have had limited application on the global scale.

PM-MOD
PM-MOD (Mu et al., 2011) is based on the Penman-Monteith equation. It estimates ET as the sum of interception loss (I ), transpiration (ET t ) and evaporation from the soil (ET s ). The interception loss is modeled as where is the slope of the curve relating saturated water vapor pressure to temperature, VPD is the vapor pressure deficit, γ is the psychrometric constant, f c is the canopy fraction, f wet is the wet cover fraction based on the derivation by Fisher et al. (2008) (see Eq. (9) in Sect. 2.1.3), and r wc a and r wc s are aerodynamic and surface resistances against evaporation of intercepted water (calculated as a function of air temperature and leaf area index, LAI).
Canopy transpiration is estimated as where r t a and r t s are the aerodynamic and surface resistances against transpiration. r t a is determined in a similar way to r wc a , and r t s is a function of stomatal conductance, biome-constant values of cuticular conductance and canopy boundary-layer conductance. The values of stomatal conductance are a function of air temperature, VPD and LAI.
Evaporation from the soil surface is the sum of evaporation from wet soil and evaporation from saturated soil, which are both calculated separately based on Eq. (7) with specific values of aerodynamic and surface resistances for bare soils and a soil moisture constraint (f sm ) depending on relative humidity (taken from Fisher et al. (2008), see Sect. 2.1.3).
The Mu et al. (2011) daily ET estimates have been previously validated against EC measurements from 46 FLUXNET towers in North America, reporting for the daily estimates an average RMSD of ∼ 0.9 mm day −1 and a ∼ 0.6 average correlation coefficient.

PT-JPL
PT-JPL (Fisher et al., 2008) is based upon the Priestley and Taylor equation. As in PM-MOD, ET is estimated as the sum of interception loss I , transpiration ET t and evaporation from the soil ET s . The driving equations in the model are where α is known as the Priestley and Taylor (PT) coefficient and is considered here as a constant value (1.26) (Priestley and Taylor, 1972) that aims to summarize the atmospheric term in the Penman-Monteith equation (Eq. 5), λ is the latent heat of vaporization and f wet , f g , f M , f sm and f T are ecophysiological constraint functions with values between 0 and 1 referred to as f functions. The f functions are given by where f wet is the relative surface wetness, f g is green canopy fraction, f APAR (f IPAR) is the fraction of absorbed (intercepted) photosynthetically active radiation, f M is a plant moisture constraint, f APAR max is the maximum of f APAR, f sm is a soil moisture constraint, f T is a plant temperature constraint and T opt is the optimum plant growth temperature, estimated as the air temperature at the time of peak canopy activity when the highest f APAR and minimum VPD occur. Note that as the input data set does not include f IPAR; f IPAR is derived from the rescaled project LAI by inverting the model original relationships between LAI and f IPAR. Using this methodology, monthly estimates of ET were tested against EC measurements from 16 FLUXNET towers worldwide (Baldocchi et al., 2001) with a reported average RMSD of ∼ 0.4 mm day −1 and a ∼ 0.9 average correlation coefficient (Fisher et al., 2008). Note that unlike the above statistics reported for SEBS and PM-MOD, these numbers come from the model run with the tower meteorology, instead of global forcings.

GLEAM
GLEAM (Miralles et al., 2011a, b) calculates ET via the PT equation, a soil moisture stress computation and a Gash analytical model of rainfall interception loss (Gash, 1979). In the absence of snow, evaporation from land is calculated as where ET tc is transpiration from tall canopy, ET sc is transpiration from short vegetation, ET s is soil evaporation and I is tall canopy interception loss. β is a constant used to account for the times at which vegetation is wet; thus, transpiration occurs at lower rates (β = 0.93) (Gash and Stewart, 1977). The first three terms in Eq. (14) are derived using the Priestley and Taylor equation, so ET becomes where the subscripts "tc", "sc" and "s" correspond to tall vegetation, short vegetation and bare soil (respectively) and the fraction of each of these three cover types per pixel is represented by f . Different cover types have different values of α and parameterizations of G; additionally, R n is distributed within the cover fractions using average albedo ratios from the literature. S represents the evaporative stress due to soil moisture deficit and vegetation phenology. Soil moisture deficit is estimated using a multilayer running water balance to describe the infiltration of observed precipitation through the vertical soil profile. Microwave soil moisture observations are assimilated into the soil profile . In vegetated land covers, phenology effects on ET are based on microwave observations of vegetation optical depth, used as a proxy of vegetation water content. I is independently derived using a Gash analytical model (Gash, 1979), in which a running water balance for canopies and trunks is driven by observations of precipitation. The derivation of the parameters and the global implementation and validation of this I model are described in Miralles et al. (2010). For regions covered by ice and snow, sublimation is calculated based on a PT equation parameterized for ice and supercooled waters (Murphy and Koop, 2005).
The ET estimates from GLEAM have been validated against eddy covariance towers worldwide; reported average correlations are 0.83 and 0.90 for daily and monthly estimates, respectively, with an average RMSD of ∼ 0.3 mm day −1 , based on a sample of 43 towers (Miralles et al., 2011a), and correlations of 0.71-0.75 and 0.81-0.86 for daily and monthly estimates, respectively, based on a sample of 163 towers and different satellite products as forcing .

Surface meteorology
The European Centre for Medium-range Weather Forecasts (ECMWF) Re-Analysis-Interim (ERA-Interim) (Dee et al., 2011) was selected to provide the near-surface meteorology every 3 h at a spatial resolution of ∼ 75 km. The use of reanalysis data is necessary as satellite observations are generally unable to retrieve the surface variables needed, such as temperature, humidity and wind speed, with sufficient accuracy or at a suitable sub-daily temporal resolution. Although products of near-surface air temperature and humidity derived from satellite sounders exist (Ferguson and Wood, 2010), atmospheric reanalyses have the advantage of providing regular sub-daily estimates for all weather conditions. ERA-Interim is also used in the derivation of the land surface temperature product (see Sect. 2.2.2), to ensure interproduct consistency between air and surface temperatures. In terms of accuracy, ERA-Interim data have been evaluated through comparison with other reanalyses and weather stations over specific areas, showing a good general performance (e.g., Mooney et al., 2011;Szczypta et al., 2011).

Land surface temperature
Land surface temperature (LST) estimates have been internally generated by the project from level 1 radiances from the Advanced Along-Track Scanning Radiometer (AATSR) onboard ESA's Environmental Satellite (EnviSat) polarorbiting satellite, from Multi-functional Transport Satellites (MT-SAT) 2 (over Australia), Meteosat Second Generation (MSG) 2 and Geostationary Operational Environmental Satellite (GOES) 12. The data sets are provided over a sinusoidal grid with 1 km resolution for AATSR at the two satellite overpasses per day (∼ 10:00 LT descending node) and 5 km for the remaining sensors (1-hourly estimates for MSG and MT-SAT, 3-hourly for GOES). Ancillary atmospheric information for the inversion of the L1 radiances comes from ERA-Interim. Estimates of surface emissivity are taken from the Global Infrared Land Surface Emissivity UW-Madison Baseline Fit Emissivity Database developed by Seemann et al. (2008).

Surface radiation
The National Aeronautics and Space Administration (NASA)-GEWEX Surface Radiation Budget (SRB) satellite product version 3.1 (Stackhouse et al., 2004) is used to provide the surface net radiation input to the ET models. The SRB product is used by a large number of global ET algorithms to characterize the radiation at the surface, given its relatively long data record and sub-daily temporal resolution. SRB data sets include global 3-hourly averages of surface and top-of-atmosphere longwave and shortwave radiative parameters on a ∼ 100 km grid.

LAI and f APAR
To characterize the vegetation state using visible and nearinfrared wavebands, estimates of LAI and f APAR have been derived by applying the Joint Research Centre (JRC) two-stream inversion package (hereafter TIP) (Pinty et al., 2007(Pinty et al., , 2011a to the ESA GlobAlbedo bihemispherical reflectances. Here, LAI is defined as the one-sided leaf area per unit ground area and f APAR as the fraction of absorbed photosynthetically active radiation in the 400-700 nm region. The application of the TIP LAI and f APAR with our ET models required some LAI and f APAR calibration. The TIP LAI is a one-dimensional (1-D) equivalent LAI for solving the radiative transfer in a three-dimensional (3-D) medium, and it is consistent with the fluxes from which it is derived. It is not consistent with LAI derived using a 3-D radiative scheme that allows some form of horizontal canopy clumping (e.g., the MODIS MOD15A2 LAI product). In practical terms, this means that if an ET model was constructed to use a MODIS-like LAI and f APAR, a straight use of the project LAI and f APAR will result in the ET model producing lower values than expected for those biomes where horizontal clumping is significant (e.g., for forests). While the ET dynamics may not be highly affected (there is a high degree of correlation between different LAI and f APAR estimates), the absolute values would be. As the SEBS, PM-MOD and PT-JPL models have typically been used with the MODIS vegetation product, a rescaling of our TIP-derived LAI and f APAR products against the MODIS product has been undertaken. For running the models on the tower scale, a local rescaling is conducted by a linear regression between the MOD15A2 and the TIP values co-registered at each tower. For global model simulations, individual rescaling per biome or climate classification is conducted. For PT-JPL, given the model internal relationships between these variables and the vegetation indexes used as model inputs (see Table 1 in Fisher et al., 2008), it can be discussed whether the original TIP LAI and f APAR or the rescaled LAI and f APAR are the most appropriate to be used as model inputs. For simplicity we will apply also the rescaled LAI and f APAR, but this choice will be further evaluated in future applications of the model with the TIP LAI and f APAR inputs. Figure 1 illustrates an example of both products at two towers. The station Quebec -Eastern Boreal, Mature Black Spruce (CA-Qfo) is located in an evergreen needleleaf forest and shows that the MOD15A2 and WACMOS-ET LAI and f APAR absolute values differ considerably. This is expected, as allowing some form of horizontal clumping (MODIS 3-D radiative transfer scheme) or not (TIP 1-D) can result in large differences in the estimated LAI and f APAR in forests. It can be seen that the local calibration of the MODIS-like product retains the dynamics of the WACMOS-ET product, while adding absolute values close to the MODIS product. The station Brookings (US-Bkg) is situated in a cropland area, where the effects of clumping are much less severe, and the different LAI and f APAR values are much closer.

Vegetation height
Vegetation height on the global scale is required by SEBS. For shrubland and forest biomes the product developed by Simard et al. (2011) was used as static canopy height cover. For grassland and cropland biomes, where the temporal dynamics of canopy height can be more considerable, we approximated canopy height with the method by Chen et al. (2015), with the minimum and maximum canopy height obtained from the static vegetation table of the North American Land Data Assimilation System (NLDAS).

Soil moisture and vegetation optical depth
A soil moisture product combining observations from active and passive microwave sensors has been developed as part of the ESA Climate Change Initiative (CCI) and is adopted here to provide the surface soil moisture data that are assimilated into GLEAM. Details on the product algorithm and evaluation can be found in Liu et al. (2011b). The data are provided on a regular grid with a resolution of 0.25 • × 0.25 • . The product performs well in moderately vegetated regions but shows higher uncertainties in densely vegetated regions (as vegetation attenuates the microwave signal from the ground) and mountainous areas (due to the high surface roughness) (Liu et al., 2011b).
The retrieval of soil moisture from passive sensors discussed in Sect. 2.2.6 can be accompanied by an estimation of the vegetation optical depth (VOD). VOD can be used to account for the development of vegetation over the year as it is a good proxy of vegetation water content (Liu et al., 2015). Although most ET models traditionally use parameters derived from visible and near-infrared wavelengths, microwave VOD is used by GLEAM. Here the long-term record by Liu et al. (2011a) based on the application of the Land Parameter Retrieval Model by Owe et al. (2001) is used by GLEAM.

Precipitation and snow
Observations of precipitation and snow water equivalent are also required by GLEAM only. Precipitation is used both to estimate the effects of soil water limitations on ET and to calculate interception loss. To run the model on the tower scale we use the Climate Prediction Center (CPC) Morphing Technique (CMORPH) (Joyce et al., 2004). CMORPH transports the features of precipitation estimates derived from low orbiter satellite microwave observations using information from geostationary satellite infrared (IR) data. Precipitation estimates are available every 30 min on a grid with a spacing of 8 km at the Equator, although the resolution of the individual satellite-derived estimates is coarser at ∼ 12 km × 15 km. The spatial coverage ranges from 60 • N-60 • S. To run the model globally, we use the National Centers for Environmental Prediction (NCEP) Climate Forecast System Reanalysis (CFSR) land precipitation estimates (Coccia and Wood, 2015). These precipitation estimates come from the hourly CFSR output (Saha et al., 2010) but are corrected using the observation-based data sets of the CPC (Xie and Arkin, 1997) and the Global Precipitation Climatology Project (GPCP) (Adler et al., 2003). Finally, snow wa-ter equivalent estimates come from ESA GlobSnow. Since GlobSnow covers the Northern Hemisphere only, data from the National Snow and Ice Data Center (NSIDC) are used in snow-covered regions of the Southern Hemisphere (Kelly et al., 2003). The product combines satellite passive microwave measurements with ground-based weather station data in a data assimilation scheme (Luojus and Pulliainen, 2010). The products exist at a daily resolution and a spatial resolution of 25 km.

Tower selection
Model simulations are evaluated by comparison with the turbulent latent fluxes measured by the eddy-covariance technique at a selection of tower sites from FLUXNET (Baldocchi et al., 2001). A first sample of towers was compiled by selecting those stations from the FLUXNET La Thuile synthesis data set which contain latent flux measurements in the 2005-2007 period, as well as the meteorological and radiation inputs required to run the ET models at the towers' locations. The 24 selected stations are described in Table 2 and their geographical location is displayed in Fig. 2. While some meteorological variables such as near-surface air temperature or humidity are measured at nearly all towers, other inputs such as the surface net radiation or the ground heat flux are measured at only a few towers. Some stations that were very close to the shore or in places with regular flooding were discarded. The final selection of 24 towers represents a significant number of biomes and a reasonable sample of dry and wet climate regimes.
In a later step, by removing the constraint of requiring local measurements of all the model inputs, the first selection of 24 towers is extended to a total of 85 stations. This second selection is used to evaluate model performance when the models are run with the satellite data used for the global runs.

In situ surface energy balance
While, in principle, the surface energy balance should close at the tower, this is rarely the case: a lack of closure in the surface energy balance of about 10-30 % is commonly found when comparing the EC measurements against the energy balance residual (ER) term, i.e., the difference between net radiation and the sum of the sensible, latent and ground fluxes (e.g., Foken et al., 2006). Consequently, throughout the paper the model evaluation is discussed by comparing it with both the EC measurements and the in situ ER estimates.

In situ LST
To run SEBS, the broadband longwave radiometer measurements need to be converted into LST estimates. This is done by inverting the equation relating the upwelling spectral radiance measured by the radiometer and the LST. Broadband emissivity is required, and it is estimated from the MODIS-based Global Infrared Land Surface Emissivity Database (Seemann et al., 2008) operated by the Cooperative Institute for Meteorological Satellite Studies (CIMSS). The estimates are calculated by following the approach suggested by Wang et al. (2005) using a linear combination of narrowband emissivities at 8.5, 11 and 12 µm.

In situ vegetation height
SEBS also requires vegetation height to derive the surface roughness values. In most cases a mean annual value can be obtained from the tower metadata, and this value is adopted here as vegetation height at the tower. However, a clear limitation in this assumption is that it does not include dynamic changes in vegetation height over time. As discussed in Sect. 2.2, the importance of neglecting the temporal variability in height is biome-dependent; for instance, in forests the mean vegetation height is typically more constant than in, e.g., croplands, where the changes derived from agricultural practices can be large. Longitude q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Evaluation times
The model performance is investigated on sub-daily and daily timescales. The tower data are available at 0.5 h intervals and have been time-integrated to 3 hours in order to run the ET models at that sub-daily resolution. The satellite data have been time-matched to the 3-hourly or daily resolutions from their native resolution in different ways (see below), depending on the type of data and original resolution. The 3hourly inputs were then aggregated to daily values in order to run the models with tower-based daily inputs. The tower data record is not always time-continuous, as in some instances there are gaps in the record. This is not a problem for the PM-MOD, PT-JPL and SEBS models because the ET estimates depends only on the instantaneous atmospheric or surface state. When inputs to the models and/or ET for the evaluation are missing, those three models are not run. Conversely, GLEAM requires continuous data records to update the soil moisture state variable. To facilitate running GLEAM with tower inputs, the tower measurements are gap-filled with the corresponding pixel data (see Sect. 2.2). ET estimates from these periods are removed after the runs, so, as before, only the time steps where tower forcing data are available are used for model evaluation.
The models are validated against the tower ET only under dry (non-raining) conditions, as EC gas analyzers are not reliable during rain events due to disturbance of the infrared signal by droplets on the sensor (Burba et al., 2010;Hirschi et al., 2016). Therefore, any days with precipitation as indicated by the tower or satellite precipitation are removed from the validation, and the interception component from PM-MOD, PT-JPL and GLEAM is not considered in the validations.

Nighttime ET
Only PM-MOD and GLEAM specifically deal with nighttime evaporation. Nevertheless, nighttime values are required from all models to integrate the 3-hourly ET estimates to daily values. For SEBS and PT-JPL negative nighttime estimates are set to 0 to allow the daily integration for those models. To separate day and night, daylight times are identified by calculating the solar zenith angle. Time intervals, where the cosine of the zenith angle is larger than 0.2, are kept as day values. This day and night separation may be less accurate than using a solar downward radiation threshold, but it allows a day-night flag for those stations without solar radiation measurements. The impact of setting ET from SEBS and PT-JPL to 0, as these models cannot specifically simulate nighttime conditions, is addressed in Sects. 3.1 and 3.2, where sub-daily periods, including daytime, are investigated.

ET production
The following ET estimates are generated to evaluate model performance.

D. Michel et al.: The WACMOS-ET project -Part 1
-Tower-based ET: ET generated by the four models using the 3-hourly or daily in situ data (surface radiation, LST, air temperature, air humidity, and wind speed and precipitation), the scaled WACMOS-ET LAI and f APAR and gridded soil moisture and VOD data. Note that the 3-hourly tower-based ET estimates are also timeintegrated to daily values, so daily ET estimates exist both from the runs with daily inputs and from the integration of the 3-hourly ET estimates.
-Original-resolution satellite-based ET: ET generated by the four models using the 3-hourly or daily satellite data (SRB surface radiation, ERA-Interim air temperature, humidity, wind speed, CMORPH precipitation, ESA-CCI soil moisture, scaled WACMOS-ET LAI and f APAR) at their original resolutions. In situ LST is still used here in order to have the same number of SEBS estimates as in the tower-based ET (cloudiness, satellite overpass time and revisiting times would have notably reduced the number of SEBS estimates if the satellite LST had been used). As for the tower-based ET, 3-hourly tower-based ET estimates are also timeintegrated to daily values.
-Common-grid satellite-based ET: ET generated by the four models using the 3-hourly satellite data resampled to a common grid. In contrast to the previous runs, the satellite data are not applied at their original resolutions but after resampling them to the sinusoidal grid at ∼ 25 km, adopted to produce the global model runs.
Note that the CMORPH precipitation is replaced by the CFSR-Land product in order to have global coverage and that the LST is based on the AATSR observations.

Results and discussion
Here we look at the model performance against the in situ measurements, when the models are run with tower-based and satellite inputs. This section is divided into the three subsections, each of them dealing with one of the three experiments introduced in the previous Sect. 2.4.3. First, the 3-hourly and daily runs based on in situ forcing at 24 FLUXNET stations (see Table 2) are investigated. In the second part we look at the model performance at the same stations using 3-hourly and daily resolution satellite forcing. Finally, the ET estimates from the run using 3-hourly common-grid satellite forcing are compared to the in situ measurements at 85 FLUXNET stations.

Three-hourly and daily tower-based ET
The agreement of modeled evaporative fraction (EF) -defined here as λE/R n , using modeled λE and the net radiation from the respective forcing -with the measured EF (i.e., based on tower measurements of λE and R n ) gives an indication of the algorithm skill to model evaporative stress. Figure 3 (top panel) illustrates the agreement of modeled evaporative fraction with in situ measurements (derived using both EC and ER measurements of evapotranspiration; see Sect. 2.3.2), when models are run with tower inputs. GLEAM generally ranges between the EC and ER measurements, even at dry stations in open shrubland (OSH), woody savannas (WSA) and grassland (GRA) biomes, e.g., Sardinia/Arca di Noe (IT-Noe), Audubon Research Ranch (US-Aud), Santa Rita Mesquite (US-SRM) and Walnut Gulch Kendall Grasslands (US-Wkg). Only in the evergreen needleleaf forests (ENF) does GLEAM exceed the range of in situ measurements. PT-JPL mostly agrees with the reference as well, although it presents positive biases at some dry sites, like Wind River Crane (US-Wrc) and IT-Noe. PM-MOD underestimates EF for most stations (but it is very close to the EC measurements at six stations), while SEBS is characterized by an overall overestimation (for six stations SEBS EF is within the tower EC-ER range). In terms of the model performance per biome type, it can be stated that models generally perform the best in croplands (CRO) and deciduous broadleaf forest (DBF); at least this is the case for PT-JPL, PM-MOD and GLEAM. SEBS seems to perform better in grassland and savanna biomes (SAV). It is, however, difficult to derive robust conclusions on the model performance as a function of biome due to the low number of stations per biome type. As the surface meteorology plays an important role in the ET production, we also compare the point-scale model performance with the gridded ERA-Interim ET data set (ERA) in Fig. 3 (top panel). ERA-Interim estimates are mostly within the range of EF measurements. The good agreement between ERA EF and the in situ measurements indicates that the ERA-Interim meteorology reliably captures the station conditions. It can also be stated that the point-scale towerforced EF derived with PT-JPL and GLEAM match the ERA-Interim product based on a ∼ 75 km resolution.
A statistical assessment of the model performance is given in Fig. 4, which shows the correlation (R 2 ), the RMSD and the average of the bias normalized by the reference (MBD) between modeled ET and tower measurement of ET (i.e., using the EC approach). In the left column of Fig. 4 the station averages of the statistical inferences are shown according to measured EF, i.e., from wet to dry. In general, the correlation to in situ data is high in wet and in moderately wet biomes for most sites and for all models. This is also true for SEBS, despite its substantial overestimation of EF (see Fig. 3). However, there seems to be a distinct decrease in R 2 from wet to dry biomes for all models; this decrease in performance is lower for GLEAM and higher for PM-MOD, which presents correlations (R 2 < 0.4) at dry sites. PT-JPL stands out amongst the ensembles with the highest correlation at most sites and especially in dry conditions. In comparison to the mostly underestimated evaporative fraction derived with PM-MOD (see Fig. 3   sponds to PT-JPL and GLEAM and even produces the lowest maximum value (0.13 mm h −1 ), followed by GLEAM (0.17 mm h −1 ). Note that the large positive MBD values of PT-JPL and SEBS (> 200 %) may partly result from forcing ET to 0 during nighttime (see Sect. 2.4.2), when tower ET is negative, and thus leading to large relative errors, even for small negative reference ET values.
In order to evaluate the impact of using EC measurements as reference (in contrast to the ER method in Fig. 4), Table 3 shows the overall average 3-hourly model performance (i.e., the average of all station statistics) using both EC and ER data as reference. Overall, the average statistics of PT-JPL and GLEAM appear more favorable than those of SEBS and PM-Mu, although the RMSD and MBD of PM-MOD and the R 2 of SEBS are in general comparable to those of GLEAM and PT-JPL. This is again to a large extent affected by the overall overestimation and underestimation by SEBS and PM-MOD, respectively. The RMSD of SEBS is significantly smaller when using the ER method as reference (0.10 mm h −1 ) as opposed to using EC (0.13 mm h −1 ); on the other hand, the RMSD of PM-MOD is larger compared to ER (0.12 mm h −1 ) than compared to EC (from 0.06 mm h −1 ). Note that the transpiration resistances in PM-MOD are calibrated based on a biome-dependent annual ET derived from EC observations, which may explain the smaller RMSD and MBD when using EC as a reference. Finally, the RMSD station averages are similar to both in situ references for by PT-JPL (0.08, 0.09 mm h −1 ) and GLEAM (0.08, 0.08 mm h −1 ).
Here the skill of models at representing ET at specific times of the day is examined. Note that small nighttime ET values from models and measurements may produce small absolute errors and thus can improve the overall full-day model performance in comparison to daytime periods, even if the relative bias is large.
The Taylor diagrams in Fig. 5 show that correlations to in situ observations (using the EC method) considering the entire daily cycle (left panel) are very similar compared to those considering daytime values only (left panel, top row) or nighttime values only (right panel, top row). The overall R 2 with tower forcing including all models is 0.67 for full-day as well as daytime evaluation and 0.68 for nighttime evaluation; this indicates that the results are independent of the timescale. Note that nighttime is identified as cases when the cosine of the zenith angle is < 0.2.
We can see (right panel, top row) that forcing negative nighttime ET values of PT-JPL and SEBS to 0 (in contrast to specific negative ET produced by PM-MOD and GLEAM; see Sect. 2.4.2) does not have a substantial impact on the overall agreement with tower measurements. However, it should be noted that the uncertainty of nighttime EC mea-  surements is high because of low turbulence. Hence, large nighttime errors can be present not only in the ET simulations but also in the EC data. Sub-daily resolution is desirable in evaporation modeling, as it allows investigation of the underlying land-atmospheric interactions during the daily cycle of the planetary boundary layer. Given the short timescale of these interactions, one may expect that models that are able to reproduce short-term variability in ET would also be able to provide more reliable aggregates on daily timescales. Therefore, we investigate whether the model performance would benefit from solving evaporation at a 3-hourly resolution and aggregating it to daily values, as opposed to generating the estimates with daily input directly. Figure 5 (bottom row) clearly shows that not much more skill is gained by producing daily ET based on 3-hourly input (i.e., resolved diurnal cycles in the meteorological inputs) as opposed to forcing the models with the original daily input; results are almost identical when using aggregated 3-hourly output (left panel, bottom row) or using daily forcing (right panel, bottom row). In fact, for GLEAM the correlation to the EC reference is slightly higher when daily input is used, even if the standard deviation agrees marginally less well with the reference. Figure 6 shows the statistics of the models' evaluation after forcing them with daily inputs. As expected, the general correlations become lower when daily (as opposed to 3-hourly) estimates are validated, since the daily cycle no longer plays a role in the enhancement of correlations -this was already highlighted by Table 3. Comparison of Figs. 4 and 6 shows that the decline in average R 2 from wet to dry stations is less evident at a daily resolution. This may be due to the smaller sample size when daily values are analyzed. PM-MOD and SEBS in particular correlate poorly at dry stations (also at other stations, such as the moderately wet AU-How). PT-JPL and GLEAM perform worse (compared with the 3-hourly resolution) at dry stations when they are run at Tower Satellite • Forcing Tower Satellite Figure 5. Taylor diagrams of 3-hourly model performance against EC reference in sub-daily periods (top-row panels) and as function of temporal resolution (bottom-row panels). The left panel shows the average model statistics for full-day (compare to top row) and 3-hourly output data (compare to bottom row). Daytime is defined as cases when the cosine of the sun elevation azimuth is > 0.2; nighttime is defined as cases when the cosine of the sun elevation azimuth is < 0.2. Shown are the normalized standard deviation, the normalized RMSD and the correlation coefficient (R). Table 3. Summary of 24 stations average statistics for 3-hourly and daily tower forcing. EC denotes the model agreement with the evapotranspiration reference from eddy-covariance measurements, and ER is the model agreement with the evapotranspiration reference based on the in situ energy residual. RMSD is given in millimeters per hour for both 3-hourly data (3 h) and daily data (d). daily resolutions. In terms of the RMSD and MBD, the results are quite similar to the 3-hourly findings, but in most cases worst performance at the daily resolution is found at dry stations. An exception is GLEAM, which shows smaller RMSD at the dry stations when using daily rather than 3hourly resolution. The change in overall MBD (against the EC reference) from using 3-hourly tower input to using daily tower input is from 53.1 to 47.8 % for PT-JPL, from −6.7 to 3.8 % for PM-MOD, from 125.9 to 113.5 % for SEBS, and from 31.9 to 15.6 % for GLEAM. While the pattern of EF (Fig. 3) and MBD (Fig. 4) indicates a substantial underestimation of 3hourly ET by PM-MOD, this underestimation is attenuated when daily input is used (−18.2 to −11.3 % compared to the ER reference). Note that even if we employ the term daily input, the PM-MOD model estimates day and night ET separately by using integrated day and night inputs (as opposed to PT-JPL, SEBS and GLEAM, which use daily integrated inputs) and then combines them to provide a daily value. This is how the PM-MOD model was originally used and how it is implemented in this study for daily estimation. The better agreement on a daily scale thus may reflect a more appropriate use of the inputs.
The similarity of the results for different temporal resolutions underlines the robustness of the modeling processes. PT-JPL and GLEAM agree best with the in situ measurements, while SEBS yields a good correlation in comparison to the other models yet produces the largest absolute errors due to its large overestimation. PM-MOD produces the lowest correlation but agrees rather well in terms of absolute deviations. Table 3 summarizes the main statistics of the model evaluation for the 3-hourly and daily tower inputs. . Station mean statistics of daily data from daily input against EC reference. Left-column panels: tower-forced ET; right-column panels: satellite-forced ET; top-row panels: R 2 correlation coefficient (left y axis); middle-row panels: mean bias deviation (MBD, left y axis); bottom-row panels: root mean square difference (RMSD, left y axis). For all plots the evaporative fraction is given by the grey area (right y axis).

Three-hourly and daily original-resolution satellite ET
In this section we discuss the model performance using 3hourly and daily satellite forcing with original resolution at the selected 24 FLUXNET stations. The findings are compared to the results of the tower forcing in the previous section in order to allocate model uncertainty to either the algorithms used or the common forcing. The evaluation of 3-hourly modeled EF using satellite forcing (Fig. 3, bottom panel) shows a very similar picture of agreement with the reference compared to the results of the tower forcing. Note that the satellite EF shown here slightly differs from tower-forced EF, as the data availability of the input time series may be different at some stations. The ET overestimation by SEBS seems to be slightly emphasized when using satellite input in comparison to the tower forcing. Note that the LST used in SEBS is still obtained from the tower measurements, as discussed in Sect. 2.4.1. EF derived with PT-JPL and GLEAM still agrees well with the reference, yet GLEAM overestimates EF in dry biomes when using satellite forcing but is more accurate at needleleaf forest sites. The good model performance of PT-JPL and GLEAM, independent of forcing type, indicates a robust performance of the models on the one hand and a reliable satellite forcing -in the sense of their meteorology comparing well with the in situ tower data -on the other hand.
In Fig. 3 (bottom panel) we also compare the model performance with the gridded ERA-Interim ET data set. Note that while the tower forcing runs (top panel) are independent of ERA-Interim, the satellite runs use ERA-Interim estimates as inputs for the surface meteorology. As shown in Sect. 3.1, the ERA-Interim EF product agrees with the in situ measurements. The correlation of the models to ERA-Interim is not substantially improved with satellite input in comparison to the tower forcing, although the models now use the ERA-Interim meteorology as input.
The station averages of the statistical indices R 2 , RMSD and MBD of the models forced with satellite observations (Fig. 4, bottom panel) against the in situ measurements underline the previously reported high similarity of modeled ET based on tower and satellite forcing. Only the RMSD of SEBS is slightly attenuated with remotely sensed forcing. However, the algorithm is still characterized by substantial overestimation.
In the following we compare the model performance with daily satellite forcing to the model performance with daily tower forcing. In accordance with the evaluation of 3-hourly data (see Fig. 4), Fig. 6 indicates that the daily satellitebased ET products also correspond to the tower-based modeled ET. We want to highlight, however, that in contrast to the 3-hourly runs, the RMSD of SEBS substantially increases when satellite input is used. This suggests that the SEBS physical modeling captures the ET processes more accurately with the high temporal resolution inputs (3-hourly vs. daily). Table 4 provides a summary of the main statistics of the model evaluation for the 3-hourly and daily satellite inputs.

Three-hourly common-grid satellite ET
Here the ET algorithms are tested against 85 FLUXNET stations using the gridded sinusoidal (∼ 25 km) satellite input (as opposed to using their original input resolutions) in order to evaluate the common-gridded global ET estimates on the tower scale. Only the evaluation over the towers is discussed here, with the evaluation on the global scale discussed in the companion paper of Miralles et al. (2016). Note that the spatial mismatch between the tower fetch and the ∼ 750 km 2 of the gridded cells is very large, and the agreement between the tower fluxes and the modeled ET certainly depends on the tower conditions being representative of the corresponding gridded pixel. This was also the case for some of the original-resolution satellite inputs used over the 24 stations, such as the SRB radiation or the ERA-Interim meteorology. The results of the satellite runs using common-grid forcing are compared to the results using the tower and satellite inputs on the tower scale presented in Sects. 3.1 and 3.2.
The top panel of Fig. 7 shows the mean 3-hourly EF over 70 stations for PM-MOD, PT-JPL and GLEAM. For 15 of the 85 stations the surface radiation or the ground flux was not available; hence, the ER reference could not be calculated. As the gridded inputs use satellite LST from AATSR, SEBS ET is only estimated at the midmorning AATSR overpass. The bottom panel of Fig. 7 shows the annual midmorning evaporative fraction, this time including SEBS. Due to the 3-day revisiting time of AATSR and the lack of measurements in cloudy conditions, the number of available SEBS ET estimates reduces drastically, compared with the previous simulations using tower LST. The bottom panel of Fig. 7 Table 4. Summary of 24 station average statistics for 3-hourly and daily satellite forcing. ERA denotes the agreement of ERA-Interim evapotranspiration with the in situ reference evapotranspiration. For other abbreviations, see Table 3. RMSD is given in millimeters per hour for both 3-hourly data (3 h) daily data (d). shows station averages from all models only when SEBS ET is available. Thus, it is based on fewer data and with the number of stations reduced to 67.
The 3-hourly model performances from PM-MOD, PT-JPL and GLEAM correspond closely to the performance in the analysis using the 24 towers and the original-resolution satellite inputs. The EF station averages produced by PT-JPL and GLEAM are very close at all locations and respond well to the hydrological and energetic conditions expected in the respective biome. The overall agreement with the range between EC and ER in situ measurements is comparable to what has previously been found in the smaller sample of stations (see Fig. 3). PM-MOD keeps underestimating ET, except for the cropland biome, where the majority of station averages matches well with the reference. Concerning the midmorning evaporative fractions, the PM-MOD, PT-JPL and GLEAM patterns are all very similar to the case with the full diurnal cycle. SEBS again tends to overestimate over a large number of stations, compared with the in situ measurements. Overall, it can be stated that the model accuracy and inter-model agreement obtained with in situ and satellite forcing on the tower scale could be reproduced with the common-grid satellite forcing. Figure 8 summarizes the results above by displaying standard deviation, correlation and RMSD of the modeled ET shown in Fig. 7 against the EC reference. The Taylor plots highlight the fact that the variability of PT-JPL, PM-MOD and GLEAM is not substantially influenced by the low sample size for cases when SEBS ET is available. Again, the similarity between Fig. 5 (left panel) for satellite forcing on the tower scale and Fig. 8 for gridded input data confirms the robustness of the analyses independent of tower and time sampling.

Conclusions
In this first part of the WACMOS-ET study, the skill of the PT-JPL, PM-MOD, SEBS and GLEAM ET algorithms has been tested on the tower scale against in situ measurements at 24 FLUXNET sites. The algorithms are forced using in situ meteorological data from these towers, covering the period 2005-2007 on three continents and across nine different biomes, while ensuring spatial consistency between input and reference data. Additionally, the models are run for the same period with reanalysis and satellite forcing of varying spatial resolutions, including ERA-Interim (surface meteo-rology), SRB (radiation), AATSR (LST), GlobAlbedo (LAI and f APAR), CMORPH (precipitation) and WACMOS-CCI (soil moisture). The models were run with 3-hourly and daily input to assess the robustness of their performance for subdaily and daily resolution.
Our analyses have shown that the four models' performance is robust in terms of changes in forcing types and temporal resolutions (i.e., the changes do not alter the model behavior at the selected stations significantly). Against the in situ 3-hourly energy residual estimates at the tower, the tower-based model simulations are ranked (according to station averages) as follows: GLEAM (0.80, 0.08), PT-JPL (0.78, 0.09), SEBS (0.78, 0.10) and PM-MOD (0.55, 0.12). The first value in the brackets denotes R 2 and the second value denotes RMSD in millimeters per hour. Compared to the eddy-covariance measurements, however, the station averages of RMSD do not reflect the same outcome. Due to more substantial overestimation at two stations each, the RMSD of PT-JPL (0.77, 0.08) and GLEAM (0.70, 0.08) are larger than that of PM-MOD (0.58, 0.06). However, correlations are consistently higher for GLEAM and PT-JPL. Thus, over our selection of towers and the reference period (2005)(2006)(2007), we judge GLEAM and PT-JPL as the algorithms more closely matching the in situ observations. At some stations, PM-MOD and SEBS also agree well with the observations, but in general the PM-MOD and SEBS performance is characterized by under-and overestimation, respectively.
For the satellite forcing, the RMSD between the models and the reference yields very similar numbers as for tower forcing. Correlations are closer but in most situations slightly smaller for the satellite forcings. This can be the result of discrepancies between the spatial resolution of satellite observations and tower measurements, although different inputs errors (in situ vs. satellite) may also play a role. This performance closeness between in situ and satellite-derived values can be an indication of the spatial representativeness of the tower measurements (i.e., reasonable spatial homogeneity around the tower) and the consistency of the input data set across forcing types. This is underlined by a comparison to the ∼ 75 km resolved reanalysis ET product of ERA-Interim, which agrees well with the modeled ET across the different biomes.
Regarding the analysis over the 85 stations, a similar overall picture is obtained using the ∼ 25 km common-grid ET prepared for the global runs. The evaluations of McCabe et al. (2015) of a different selection of towers (45 stations), over a more extended period (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) and with different satellite forcings (LandFlux forcings) also results in an overall similar analysis, confirming the robustness of the model performance evaluations.
Using daily input data reduces the RMSD of the models with the tower measurements but results in slightly worse correlations. This is due to the lower variability of daily values in contrast to 3-hourly data (variability accentuated by the diurnal cycle). However, the consistency of the model agreements with the reference with regard to 3-hourly and daily ET estimates highlights the robustness of the integration method applied to the models. This is also underlined by the good agreements of modeled daily ET from aggregated 3-hourly output data with modeled daily ET from daily input.
While GLEAM and PM-MOD can produce negative ET, PT-JPL and SEBS cannot operate under these conditions (mostly at nighttime when the flux of available energy reverses sign) and their negative values are forced to 0. This does not have a large impact on their full-day performance, since these values occur at night, when tower ET is negative and with generally low values. Only for the relative bias is the effect significant, since the two models consequently overestimate ET in these cases.
In terms of high and low temporal input resolution, it was found that using 3-hourly input data does not significantly increase the accuracy of the models for producing daily ET. Hence, it is sufficient to use daily input to achieve a similar result if the intended application of the ET product does not demand a reproduction of the diurnal cycle.
The ET models generally perform best in wet biomes and tend to overestimate values in dry biomes, where ET is constrained by water availability. Focusing on water stress in the model development within the community would thus provide the opportunity to obtain more robust simulations of surface fluxes for global-scale employment.
The conducted analyses based on in situ ET are useful to evaluate model performance, but there are some clear limitations. Our requirements for tower selection resulted in a somewhat limited number of stations, so it would be desirable to extend the evaluations to larger regions in order to better cover different climate and biome conditions. Therefore, in the companion paper of Miralles et al. (2016) our analyses are extended by looking at the global spatiotemporal variability of the modeled ET, the closure of regional water budgets, and the discrete estimation of land evaporation components or sources (i.e., transpiration, interception loss and direct soil evaporation).