A coupled human – natural system to assess the operational value of weather and climate services for agriculture

Recent advances in weather and climate (W&C) services are showing increasing forecast skills over seasonal and longer timescales, potentially providing valuable support in informing decisions in a variety of economic sectors. Quantifying this value, however, might not be straightforward as better forecast quality does not necessarily imply better decisions by the end users, especially when forecasts do not reach their final users, when providers are not trusted, or when forecasts are not appropriately understood. In this study, we contribute an assessment framework to evaluate the operational value of W&C services for informing agricultural practices by complementing traditional forecast quality assessments with a coupled human–natural system behavioural model which reproduces farmers’ decisions. This allows a more critical assessment of the forecast value mediated by the end users’ perspective, including farmers’ risk attitudes and behavioural factors. The application to an agricultural area in northern Italy shows that the quality of state-ofthe-art W&C services is still limited in predicting the weather and the crop yield of the incoming agricultural season, with ECMWF annual products simulated by the IFS/HOPE model resulting in the most skillful product in the study area. However, we also show that the accuracy of estimating crop yield and the probability of making optimal decisions are not necessarily linearly correlated, with the overall assessment procedure being strongly impacted by the behavioural attitudes of farmers, which can produce rank reversals in the quantification of the W&C services operational value depending on the different perceptions of risk and uncertainty.


Referee comment #1
The subject of the paper "A coupled human-natural system to assess the operational value of weather and climate services for irrigated agriculture" is of direct interest to the Journal of Hydrology and Earth System Sciences. Authors introduce and apply a framework in the context of measuring the operational value of weather and climate services (WCs). The validation of the usefulness of the WCs to the final users is a much needed step towards the realization of these services.
SPECIFIC COMMENTS 1. One of my concerns is the limited duration of the analysis period (2001)(2002)(2003)(2004)(2005). Why authors didn't extend the analysis beyond 2005. Is it due to the limited data availability? If yes, it would be also interesting to see similar results for a longer time period even for less forecast products. The motivation for limiting the analysis to the time period (2001)(2002)(2003)(2004)(2005) is manifold: 1) the historical observations available for running the model covers the period (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005), which were divided into two periods with the first period used for post-processing the forecast products and the second one for performing the analysis; 2) ECMWF forecast products are obtained from the "ENSEMBLES" project, which provides hindcasts over the period ; 3) CSF v2 and CanSIps cover the period , but they are outperformed by ECMWF products. We clarified this point in the discussion of limitations/assumptions of the study that we included in the conclusion section (see page 22, lines 10-30).
2. Assuming that instead a single forecast product, a large ensemble developed by the combination of several products could outperform the forecast quality or the result to better decisions compared to single products? We agree with the reviewer that a larger ensemble (note that all the products we used are in the form of ensemble forecast) might attain a better performance in terms of forecast quality and, possibly, also in terms of operational value. However, the use of multi-model ensembles opens up a number of challenges -such as how to limit the smoothing effect on the extreme events, how to combine multiple products with different levels of accuracy, how to simplify the uptake of the resulting large ensemble -which goes beyond the scope of this paper and can be explored in a future analysis. In the revised version of the paper, we included this point in the list of limitations/assumptions added to the conclusions section, suggesting the opportunity of exploring it in a future research (see page 22, lines 10-30).
3. Another subject that could also be discussed is the limitations and/or assumptions of the study. I think that a limitations section should be added in the paper in order to summarize the main simplifications or assumptions considered in the work. For example the determinant yield factor is the water availability no matter the agricultural treating of the farmers during the cultivation period. Maybe such a section could also include some references to works in which they have been treated in other way. The point raised by the reviewer is well taken. In this work, we didn't explore the impacts of agricultural practices (primarily the use of nutrients and fertilizers) as the water availability is predominant in the considered case study. The validity of this assumption is discussed in our previous work . However, we agree that other determinants factors can be explored in a future work. Following the reviewer suggestion, we added a list of P5 -L26: add cross reference for Table 1. Following the reviewer suggestion, we will add a cross-reference to the table in the revised manuscript (page 5 line 5).
P9 -L5: Why do you set the resolution to 250m? Is this resolution adequate for representing the spatial detail of the crops/properties? The resolution of the model was set in previous works (e.g., Vassena el al., 2012) to allow a proper characterization of the spatial distributions of all the components of the model, especially in terms of water balance module. We clarified this point in the revised manuscript (page 9 lines 7-9). Vassena et al. (2012), Modeling water resources of a highly irrigated alluvial plain (Italy): calibrating soil and groundwater models, Hydrogeology journal, 20(3): 449-467. P9 -L25: Does this model take into account the behavioral dependency on the preceding year? Meaning that the farmers' decision is affected for example from a "previous (i-1) dry year" and as a result the potentially optimistic decision of year i would be more pessimistic?
In principle, our model can account for this type of behavioral dependency. However, the calibration of a decision model implementing such behavioral dependency requires long behavioral time series to identify the proper lag-time as well as the magnitude of the effect for different levels of drought intensity. In the absence of such large observational dataset, we decided to partially explore this point by 1) simulating farmers' decisions made assuming the next year is equal to the previous one or to the average of the last two (see EmpPast and Emp2Ave experiments); 2) running a sensitivity analysis using different levels of risk aversion. In the revised version of the paper, we included this point in the list of limitations added to the conclusions section, suggesting the opportunity of exploring it in a future research where enough observational data can be available (see page 22, lines 10-30).
P12 -Table1: the products listed here are single member experiments or there is a number of realizations? All the forecast products are in the form of ensembles: ECMWF products have 9 ensemble members (or 3 in case of decadal products), CanSIps have 10 ensemble members, CSFv2 has 4 ensemble members. We included this information in the revised table 1. Figure 4: You could also add total precipitation and average temperature for each year (row) on the right part of the figure (on the left from the legend). We thank the reviewer for this useful suggestion (which also allow solving the other issue of Figure  4). In the revised version of the paper, we added information about the total annual precipitation and average temperature of each year. P15 - Figure 6: it would be easier to read if you place the legend of each product on the corresponding sub-plot. Otherwise you could arrange the legend in similar order as the subplots because it is hard to detect. It would be also helpful if you could highlight the dry years. Following the reviewer suggestion (shared also by R#2), in the revised version we moved the legends inside the subplots as suggested to improve the interpretability of the figure.

P13 -
P16 - Figure 7: The differences are hard to distinguish. You could plot the anomalies instead or adjust the range of the temperature axis (for example from 17 to 23oC). Again it would be also helpful if you could highlight the dry years. We agree with the reviewer and, following his/her suggestion we modified the figure by adjusting the range of the temperature axis and by moving the legends inside the subplots.
P18 - Figure 8: You could use a continuous line for the deterministic simulation. We thank the reviewer for the suggestion and, in the revised version of the paper, we modified the figure accordingly.
P23 -Line 32: remove the space from "f armers" We fixed the typo in the revised version (page 23 line 24).

Referee comment #2
The presented manuscript describes and applies a methodological framework to assess the operational value of weather and climate forecast products on irrigated agriculture. It combines a set of forecast products with an agronomic model that simulates the crop yield based on meteorological inputs and an agent-based model that establishes the optimal cropping pattern depending on the forecasts available and the risk profile of the farmers. The novelty of the paper consists in the joint assessment of the forecast quality and its impact on management decisions and farmers risk profile. The methodology is well described and the structure and organization of the paper is coherent and adequate. The results point at the fact that the forecast quality is not necessary correlated with its impact on management decisions. The paper fits the scope of the journal and has a clear potential for publication, given the increasing momentum of weather and climate services and how its "real" impact can be measured. I have no major concerns about the manuscript, although some improvements would further increase its quality. Therefore, I would consider it ready for publication after fixing the minor concerns I point at below. We thank the referee for the positive comment.
TECHNICAL COMMENTS 1. Page 2, lines 28-31: In my opinion, the first sentence of this paragraph is just a summary of the previous one. I would delete it and reflect in the previous paragraph that an alternative promising metric would be the quality obtained on predicting decision-relevant variables. Following the reviewer comment, we deleted this sentence.
2. Page 3, line 23: Although it becomes clear when moving forward that "post-processing" means "downscaling and bias-correction", I would add a remark here just to clarify it. We thank the reviewer for the suggestion and we clarified from the beginning the meaning of postprocessing (page 3 line 22).
3. Page 4, line 6: what do you mean when you state "pilot"? I think it is a synonym of "case study", but sometimes the term "pilot" implies you run field experiments to apply the method developed. Please clarify the term. We agree with the reviewer that the term pilot might be misinterpreted. Since we used it a synonym for case study, we removed this term in the revised version. 4. Page 5, lines 33-34: can you provide information to support the assumption of using crop yield as main driver of the cropping pattern decisions? Sometimes other variables like management complexity or profit predictability is more important than crop yield. In my opinion, you should clarify, if it is the case, that you make this assumption in the absence of more detailed information about the farmers' decision-making process. We agree with the reviewer that this assumption should be clarified. In the revised version of the paper (page 5 lines 10-11), we mentioned that, in the absence of more detailed information about the farmers' decision-making process, we introduced this assumption on the basis of other similar studies (Hansen, 2004;Baigorria et al., 2008). Hansen, J. (2004). "Linking dynamic seasonal climate forecasts with crop simulation for maize yield prediction in semi-arid Kenya". In: Agricultural and Forest Meteorology 125.1-2, pp. 143-157. Baigorria, G. a., J. W. Jones, and J. J. O'Brien (2008). "Potential predictability of crop yield using an ensemble climate forecast by a regional circulation model". In: Agricultural andForest Meteorology 148.8-9, pp. 1353-1361. 5. Page 6, line 21: as far as I know, the quantile-based mapping is a bias correction procedure. It is true that it has some downscaling component due to matching CDFs obtained at different spatial scales but, on a broader view, it is considered as a bias correction technique. In fact, you previously named it as a bias correction technique. Please fix this Following the reviewer suggestion, we fixed this point by consistently characterizing the quantile mapping as a bias-correction technique (page 6 line 24).
6. Page 8, lines 15-16: the way in which the aggregation is performed it is not clear. I assume you aggregate the daily data of the same month, but it may also mean you aggregate the ensemble members. Please clarify it. If you aggregate the ensemble members to obtain a unique factor, I would rather suggest keeping the factor obtained by each ensemble member and generate synthetic daily time series with all of them. In this way, you will have a better representation of the extremes, which are flattened when taking the average. We agree with the reviewer that this step is not clear. We perform the following aggregations: first we aggregate the daily data of the same month, then we estimated a monthly perturbing factor for each ensemble members, and then we took the average factor across the ensemble's members. We are aware that in this way we lose some information on the extremes and we agree with the reviewer that performing the entire assessment on each single ensemble member would allow a better characterization of the extremes as well as exploring how this uncertainty is propagated when moving from the forecast quality to the operational value. Yet, this would be computationally challenging as it would require running 96 simulations per year, for a total of around 500 computational hours. This computational effort goes beyond the scope of this paper. Moreover, the use of large ensembles opens up a number of challenges (see the reply to the second point raised by R#2) and the consequences of aggregating or not aggregating the ensemble members can be analyzed in detail, potentially focusing on a single forecast product, in a future work. In the revised version of the paper, we clarified how we perform this aggregation (page 8 lines 17-19) and we included this aspect in the list of assumptions added in the conclusions section (see R#1 suggestion), suggesting as a possible follow-up work the opportunity of refining our analysis keeping all the ensemble members separated (see page 22, lines 10-30). 7. Page 12, table 1: please include the ensemble members of each WCS used unless all the products provide just one ensemble. In this last case, you should indicate in the text that all of them provide a unique ensemble member. All the forecast products are in the form of ensembles: ECMWF products have 9 ensemble members (or 3 in case of decadal products), CanSIps have 10 ensemble members, CSFv2 has 4 ensemble members. We added this information in the revised table 1. Furthermore, I would also provide the value of an average score for the time series inside each plot (for example the MAE). In this way, the reader has a numerical way to easily compare the accuracy of each WCS product type for each plot. We thank the reviewer for the suggestion and, in the revised version of the paper, we moved the legends of the subplots. As suggested, we also added the mean average error of each product in predicting the crop productivity. 12. Page 20, lines 1-14: In my opinion, the fact that the neutral or optimistic risk profiles did not obtain the best performance for the best forecast deserves more explanation. How can you justify this issue? In the absence of more information, I would doubt about the suitability of the score used (median and variance of MAE). Maybe the IFS/HOPE product does not predict extremes as ECHAM5/MPIOM does, and due to this reason the latter offers the best performance on both the neutral and the optimistic risk profiles. Please add some explanation or theory about this unexpected finding. This unexpected finding can be explained by the fact that forecast accuracy metrics quantify the error in predicting the agricultural production, while the operational value estimated through the decision model relies on the ranking of the available options (cropping patterns). Sub-optimal decisions are made when the forecasted productivity of the crops produces a different ranking with respect to the one resulting at the end of the agricultural season. However, such rank reversals are not linearly related to the forecast accuracy: large but consistent (e.g., systematic over/underestimation) errors for all the crops may produce the same ranking and result in optimal decisions, while smaller and variable errors can produce sub-optimal decisions. This is quite clear if we consider the forecast accuracy reported in Fig. 8 of ECMWF(annual) IFS/HOPE and ECHAM5/MPIOM: looking at the values in 2001, ECHAM5/MPIOM (which in Fig. 10 has the best performance) is systematically overestimating the productivity of all the crops; IFS/HOPE instead underestimates the productivity of tomato while overestimates the one of rice, potentially reverting the ranking of these crops and producing sub-optimal decisions. Following the reviewer suggestion, we clarified this point in the revised manuscript (page 20 lines 11-14; page 21 lines 1-6).

Referee comment #3
General comments: The paper is interesting and novel and it certainly falls within the scope of HESS. The paper presents a novel approach to evaluate climate predictions through the impacts they have on the user decisions. This is an important aspect in the evaluation of the predictions which is often overlook in the context of climate services. The paper try to reach some substantial and interesting conclusions but the results are somehow weakened by the design of the experiments and the methodology that has been followed. The assumptions made are clearly outlined but the scientific methods (bias-correction) and datasets used (ENSEMBLES) lag a bit behind what I would consider the current state of the art.
Specific comments: More information on the bias correction methodology should be provided to allow the reproduction of the results by fellow scientists. In particular reading section 3 it is not clear whether the bias correction is applied to the forecast on a lead-time basis or weather instead the author perform the Q-Q bias correction using a CDF obtained looking at the entire forecast period. If, as it seems, it is the latter, the approach is likely to lead to incorrect results as the forecast bias is lead-time dependent (e.g. Doblas-Reyes et al 2013) whilst the CDF would be calculated on a full 7 month forecast. This is unlikely to be a major problem in regions characterise by a limited seasonal cycle and a small model drift as you could assume the relationship linking model output and observations to be roughly the same throughout the year. Unfortunately I don't think such an assumption would hold in the region of study. We agree that part of Section 3 was probably not completely clear. Specifically, given the strong intra-annual seasonal cycle of our study site, the bias-correction was applied on a monthly basis and not using a CDF calculated on the full 7 month forecast period. We clarified this point in the revised manuscript (page 6 lines 20-23).
The paper appears to be based on a set of seasonal prediction ensembles characterised by a relatively small ensemble size. Given that we now know that, at least in the case of the NAO in Europe, the climate model signal strength depends on the number of ensemble members (e.g. Scaife et al. 2014) the results presented here may significantly under represent the real usefulness of seasonal climate prediction for the target users. We agree with the reviewer comment -which is shared by other reviewers -that a larger ensemble (note that all the products we used are in the form of forecasts' ensemble) might attain a better performance in terms of forecast quality and, possibly, also in terms of operational value. However, the use of large ensembles, potentially multi-model ensembles, opens up a number of challengessuch as how to limit the smoothing effect on the extreme events, how to combine multiple products with different levels of accuracy, how to simplify the uptake of the resulting large ensemble -which goes beyond the scope of this paper and can be explored in a future analysis. We clarified this point in the discussion of limitations of the study that we added in the conclusion section as suggested by R#1 (see page 22 lines 10-30).
As noted by other reviewers the evaluation was made on an extremely short time period something which can only further reduce the significance of the results.
In the light of the points raised above I am not convinced the approach, despite its novelty and user-consideration, is necessarily fair in the analysis of the seasonal predictions and their value for informing decision makers. The motivation for limiting the analysis to the time period (2001)(2002)(2003)(2004)(2005) is manifold: 1) the historical observations available for running the model covers the period (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005), and we used the first period for post-processing the forecast products and the second one for performing the analysis; 2) ECMWF forecast products are obtained from the "Ensemble" project, which provides hindcasts over the period ; 3) CSF v2 and CanSIps cover the period (1981-2010), but they are outperformed by ECMWF products. We clarified this point in the discussion of limitations/assumptions of the study that we added in the conclusion section as suggested by R#1 (see page 22 lines 10-30).
Technical comments: Weather and Climate Services (WCS) is not an acronym I came across before. Given the fundamental difference between the way in which climate and weather model output are typically dealt with I am not sure this is particularly useful. Furthermore World Climate Services. (WCS) is also a trade name of a MeteoGroup product. By googling WCS we found the acronym with the meaning it was used in the paper. In any case, to avoid confusion, we changed into W&C Services everywhere across the paper.
Stream 2 was an experiment in the context of ENSEMBLE project rather than a project per-se as erroneously stated in section 4. We thank the reviewer for pointing this out. In the revised version of the manuscript we specified that Stream2 was part of ENSEMBLE project (page 12 line 4).
The statement about usefulness of seasonal prediction in agricultural application that appears in line 9 of the abstract is too general too be correct as there are regions of the world where these kind of predictions are known to be usable and useful. We agree with the reviewer that this sentence a too vague. In the revised version of the manuscript, we modified it by specifying that this conclusion holds for the case study analyzed in the paper (page 1 lines 8-11).
Cloke and Pappenberger 2009 doesn't strike as being the most relevant reference to describe the recent development of WCS especially considering is nearly 10 years old now. Following the reviewer suggestion of citing more recent works, in the revised version of the paper we added the following references ( Weather and climate (W&C) services, defined as information on past, present, and future weather and climate useful to assist decision-making (GFCS, 2014), can provide valuable aid to a variety of economic sectors, ranging from hydropower production (e.g., Garcia-Morales and Dubus, 2007), drought management (e.g., Mwangi et al., 2014), flood protection (e.g., Cloke et al., 2017), disease spread control (e.g., Thomson et al., 2006). These services are particularly important in agriculture (Hammer et al., 2001), where weather-sensitive decisions, such as crop choices or irrigation scheduling (e.g., Dutra et al., 2013;20 Winsemius et al., 2014;Wetterhall et al., 2014), are frequently to be taken. Here, W&C services are expected to be even more helpful over the next years, when extreme weather conditions will be more frequent and intense (Dai, 2011).
Over past decades, W&C services have undergone a broad development in many parts of the world (Cloke and Pappenberger, 2009;Bauer et al., 2015;Brunet et al., 2015). The existence of slow, and hence predictable, variations in sea surface tempera-1 ture, sea ice, soil moisture, and snow cover, which interact with the atmosphere and impact on the global climate, can be used for extending predictability at the seasonal time scale (Palmer and Hagedorn, 2006). Despite some limitations still exist (e.g., Palmer et al., 2005;Lee et al., 2011), the recent increase of model resolutions (e.g., Prodhomme et al., 2016a), the improvement of the initialization procedures (e.g., Prodhomme et al., 2016b), and the more accurate representation of some physical processes (e.g., Hourdin et al., 2013) considerably advanced the accuracy of W&C services, with current state-of-the-art products 5 showing good forecast skills even over seasonal and longer time scales (Doblas-Reyes et al., 2013).
A causal link between better forecast quality and higher operational value is however not necessarily straightforward (e.g., Ritchie et al., 2004;Ramos et al., 2013), especially when forecasts do not reach their final users, when the provider is not trusted, or when forecasts are not appropriately understood (e.g., Ramos et al., 2010;Frick and Hegg, 2011). In other words, while quantifying forecast quality is a necessary step in the assessment of W&C services, other indicators should be considered 10 for capturing the stakeholders' judgment on the value of the forecast products, i.e. their operational value, particularly when this evaluation differs from the opinion of forecasters (Hartmann et al., 2002). Yet, most assessments reported in the literature focus solely on forecast quality, defined as the similarity between the forecast estimates and the actual observations of weather or hydrological variables based on some statistically formulated performance metrics (Murphy, 1993).
Recent attempts of assessing the operational value of W&C services tend to apply long-term forecasts for feeding simulation 15 models in order to predict decision-relevant information, such as soil water availability for irrigation scheduling (Wang and Cai, 2009;Calanca et al., 2011) or crop production for cropping pattern decision (e.g., Hansen, 2004;Baigorria et al., 2008).
The use of process-based simulation models contributes a better understanding of W&C services by stakeholders and users as it allows transforming weather forecasts (e.g., precipitation and temperature) into decision-relevant information (e.g., crop yield) through a transparent, objective, and reproducible procedure. For example, although farmers can be able to quantify 20 the risks associated to predictions of a dry season, they would benefit much more from anticipated information on the crop yield and the associated risk of crop failure (e.g., Challinor et al., 2005). In addition, the relationship between weather and decision-relevant variables is often nonlinear and an error in the weather forecast will not be linearly propagated into an error of the same magnitude in the crop yield prediction. The quality of forecast products evaluated on weather variables can differ from the evaluation performed on crop yield: two forecast products characterized by different levels of accuracy in 25 predicting weather variables can provide similar predictions of crop productivity; vice versa, two products having similar skills in predicting temperature and precipitation can attain different performance in predicting crop yield. Quantifying the value in terms of forecast accuracy in predicting decision-relevant information is therefore crucial for improving stakeholders' trust in W&C services.
Although the model-based prediction of decision-relevant information is surely a step further towards the end-users' per-30 spective, high quality forecasts may still be unused by stakeholders (e.g., Rayner et al., 2005;Coulibaly et al., 2015). For example, an attempt of increasing the forecast accuracy for providing more early warnings often implies the risk of increasing the number of false alarms, ultimately discouraging the use of W&C services in operational context due to different perceptions of risk and uncertainty (Demeritt et al., 2007). In addition, many studies have shown how stakeholders' adoption of weather forecast bears upon their social context (e.g., Hansen, 2002;Suarez and Patt, 2004;Crane et al., 2010). Such evidence motivates exploring how users' behavioral factors influence the uptake and the use of W&C services, and suggests the need of quantifying the operational value of W&C services as the improvement in the system performance obtained by informing stakeholders' decisions with W&C services (e.g., Zhu et al., 2002;Mylne, 2002;Giuliani et al., 2015;Denaro et al., 2017).
In this work, we propose a new framework for assessing the operational value of W&C services, which puts human in the loop by integrating traditional forecast quality assessments with a behavioral model reproducing farmers' decisions. The 5 proposed framework relies on a three-stage procedure, which starts by investigating the quality of post-processed forecast products. These forecasts are then used as input to an integrated model representing a Coupled Human-Natural Systems (CHNSs, see Liu et al. (2007)). This includes process-based models of the physical environment to predict decision-relevant information, coupled with decision models, which describe the farmers' decision-making process. Given the predicted climate forcing as inputs, the integrated CHNS model simulates the productions of different crops, among which each farmer selects the crop 10 to cultivate by maximizing the expected net profit at the end of the agricultural season . This combination of process-based and decision models contributes a comprehensive and complete framework for assessing W&C services and allows the evaluation of both the forecast quality and operational value. In addition, the decision model includes heterogeneous behavioral factors, specifically diverse levels of farmers' risk aversion (or degree of trust) with respect to forecast uncertainty, which allow the exploration the sensitivity of the overall assessment of W&C services with respect to variability 15 of stakeholders' behaviors.
We demonstrate the potential of our approach by developing an application in the Muzza agricultural district, in northern Italy. The district is organized in 39 irrigation units, each including a number of farms receiving a continuous water supply through an extensive irrigation network. A set of state-of-the-art long-range climate forecast products are collected from the European Centre for Medium-Range Weather Forecasts (ECMWF), National Centres for Environmental Prediction (NCEP) 20 and Canadian Seasonal to Inter-annual Prediction System (CanSIPS). The forecast horizon ranges from 7 months to 10 years.
Post-processing (i.e., downscaling and bias-correction) is then used to address the mismatch of temporal and spatial resolution between the simulation models and the raw forecast products, as well as to resolve the systematic bias and uncertainty in the ensemble forecasts. Finally, by simulating the combined process-based and decision models over the period 2001-2005, with 2003 and 2005 being extreme drought years, we perform the proposed three-stage assessment of forecast quality and opera-25 tional value of W&C services. First, we assess the traditional forecast quality by comparing forecast meteorological variables against observed data. Then, we measure, via model simulations, the prediction accuracy of crop yield as an intermediate assessment of decision-relevant information for supporting farmers in improving their practices. Finally, we quantify the operational value in terms of payoff (or opportunity cost) of using W&C services for informing the selection of the cropping pattern. This value is contrasted with the upper-bound of the system performance obtained using 'perfect forecasts' as well 30 as a baseline situation where farmers use few simple empirical forecast models, including climatology or past observations. In addition, our decision models allow exploring alternative uses of W&C services, which depend on the personal behavioral attitude of the farmers and on their level of trust in the forecast products. In particular, we explore three different levels of farmers' risk aversion, namely risk averse, risk neutral, or risk prone, which create a spectrum of possible behavioral attitudes (e.g., Mosley and Verschoor, 2005) The paper is organized as follows: in the next section we describe the study area, while section 3 provides details about the methodology, including the data preparation and the modeling framework. Results and discussion are then reported in sections 5-6. Finally, conclusion and directions for future research are presented in section 7.

Study site
In this work, the assessment of W&C services is conducted on the CHNS of the Muzza irrigation district, located southeast 5 to the city of Milan (see Figure 1). The selected district is one of the largest agricultural area in the region with an arable land of around 700 km 2 . Maize (ca. 74% of the surface) and temporary grasslands (ca. 20% of the surface) are currently the major cultivated crops, with minor crops including rice, soybean, wheat, tomato, and barley. Irrigation is provided through an extensive irrigation network (more than 4,000 km in total length) served by the Adda River and feeding 39 irrigation units, which are organized in 1722 comizi and include around 12,000 farms. These critical events are predicted to become more and more frequent over the next years (Lehner et al., 2006), representing a 15 major challenge for the sustainability of the agricultural practices in this region.
In this context, the use of W&C services offers a promising option for supporting agricultural activities as the improved forecast skill over medium to long lead times provides valuable information about the future agricultural season prior to the sowing date. Such information is key for better informing cropping pattern decisions to select the best crops with respect to the farmers objectives (e.g., the one characterized by the highest expected profit). Moreover, most W&C services are freely 20 available online and thus represent a cost-effective solution to improve the resilience of agricultural systems without introducing infrastructural changes, such as modifying or expanding the irrigation canals' network.

Methodology
The overall workflow of our assessment framework is composed by three main steps, as detailed in Figure 2: (i) forecast quality assessment of post-processed W&C services using retrospective forecast (i.e., hindcast) products; (ii) extension of the 25 forecast quality analysis via model-based prediction of decision relevant variables, namely the crop production at the end of the agricultural season; (iii) evaluation of cropping pattern decisions in terms of farmers' payoff at the end of the agricultural season as simulated by the integrated CHNS model. In this step, different levels of risk aversion can be simulated to explore the sensitivity of the overall assessment with respect to farmers' behavioral attributes.
The first step of the framework (upper block in Figure 2) starts with the post-processing of the hindcast data (see section as high precipitation frequency and low precipitation intensity (Ines and Hansen, 2006). The bias-corrected dataset is further downscaled using a stochastic weather generator in order to resolve the spatial and temporal scales mismatch between hindcast data and model inputs. In particular, the generator allows performing not only the spatial downscaling but also the temporal disaggregation to obtain forecast of daily precipitation and temperature from the forecast products, in case they have a monthly time resolution (see Table 1). The comparison of the post-processed precipitation and temperature forecast products with the 5 on-site historical observations provide a first estimate of the forecast quality.
The post-processed hindcast dataset is then fed into the process-based component of our integrated CHNS model (middle block in Figure 2). This includes a spatially distributed process-based representation of the Muzza irrigation district (see section 3.2), which extends the assessment of the forecast quality by looking at the difference between forecasted crops' yield and the one simulated using observed time series of precipitation and temperature, assuming the expected crop yield represent the main 10 determinant of farmers' cropping pattern decisions as in other similar applications (Hansen, 2004;Baigorria et al., 2008).

5
The human component of the CHNS (bottom block in Figure 2) is finally introduced in the form of an agent-based decision model (see section 3.3), which allows simulating farmers' cropping pattern decisions driven by different forecast information.
This decision model allows coupling the simulation of the process-based model and the prediction of crops' profitability with the selection by each farmer-agent of the best cropping pattern as the one characterized by the highest profitability. The agentbased model allows testing different behavioral criteria, capturing alternative levels of farmers' risk aversion (or degree of 5 trust) with respect to the forecast uncertainty. In particular, we consider a spectrum of behaviors ranging from a fully optimistic farmer, who makes decision on the basis of the best possible situation, to an extremely pessimistic farmer, who, instead, looks at the worst case performance. Then, given the selected cropping pattern by each farmer-agent, the model is simulated using the observed values of precipitation and temperature to obtain the production and the associated profit at the end of the agricultural season. The estimated agents' profit is compared with the one obtained under the hypothesis of perfect foresight, 10 which represents the ideal upper-bound of the system performance. The operational value of W&C services is finally estimated as the percentage of agents making optimal decisions using the forecast products, which represents the opportunity cost of using W&C services with respect to having a perfect foresight. The results are then validated against the profit obtained by the agents when informed with simple empirical forecasts.
Details about each step of the proposed framework, corresponding to a different block in Figure 2, are reported in the next 15 sections.

Post-processing of forecast products
The first step of the proposed procedure (upper box in Figure 2) aims at post-processing the forecast products. Depending on the characteristic of the forecast, we perform bias correction by means of the change factor approach (Crochemore et al., 2016) or the quantile mapping technique (Déqué, 2007). Given the strong intra-annual seasonal cycle of our study site, the 20 bias-correction was applied on a monthly basis. This also means that the corrections are differentiated according to the leadtime (Doblas-Reyes et al., 2013), as the forecasts are considered for the same starting months (i.e., April when the agricultural season start).
The quantile-based mapping technique is a bias correction method, which builds the transfer function by mapping the cumulated density function (CDF) of climate model outputs onto the site based observation. The calibrated transfer function 25 is used later on to derive corrected estimates from new incoming outputs by resolving the mismatch between the observed site measurements and the simulated climate outputs. The quantile-based mapping is applied to forecast products providing daily trajectories of precipitation and temperature, which allow a proper estimation of the corresponding CDFs. This step becomes questionable in case of monthly hindcast due to the limited dimension of the dataset. In this case, we apply the change factor approach, in which a multiplicative factor is used to scale the value of precipitation, while an additive factor to adjust the 30 temperature for each month.
Despite the systematic bias in the hindcast dataset can be partially solved by using bias correction, the difficulty in dealing with the uncertainty of ensemble forecasts remains a challenge. Previous studies (e.g., Tippett et al., 2007) have suggested the probabilistic use of long-range weather forecast by deriving the statistical signatures from ensemble forecasts, such as the  mean or the anomaly values. This statistic is then compared with the climatology to indicate whether the incoming year is expected to be normal, wet, or dry. As a consequence, the information on the intra-annual variability of the climate, which is critical for crops' growth and agricultural management, is not preserved. Rather, in this work the multi-ensemble data is assimilated into a stochastic weather generator, whose parameters are calibrated from observations and then perturbed based on the forecast conditions. This allows us to generate synthetic time series of precipitation and temperature maintaining the 5 information estimated by the ensemble forecast. In addition, the stochastic weather generator can also disaggregate monthly forecast into daily values, which are needed to run the process-based model in the next step, without losing the generality of the statistical behavior of the variables. The LARS-WG model (Semenov and Barrow, 1997) is selected for this task as it has been reported to outperform many other weather generators (Hashmi et al., 2011).

BIAS CORRECTION
The perturbation factors of the mean daily precipitation intensity (F P,pert These are determined according to eq. (1), where m is the number of days in the i-th month and 1(·) is the binary operator that returns 1 if the daily precipitation intensity P i,j is larger than 1 mm (wet), and 0 otherwise (dry) (Ceballos et al., 2004).
In particular, eqs. (1a)-(1b) represent the expected change of precipitation frequency with respect to the average historical observations (P h ) measured by the local stations located in the considered study area, while eq. (1c) specifies the change of 15 precipitation intensity conditioned on the rainy days. For the precipitation, the computed perturbation factor is used to scale up (down) the original parameter values. The change of temperature is formulated in eq. (1d) as an additive term. Specifically, we estimated a monthly perturbing factor for each ensemble member by aggregating the daily hindcast data, and then we considered the average factor across the ensemble's members. The perturbation parameters are then used to generate synthetic, daily time-series for one year according to the considered forecast information.

The process-based agricultural model
The second step of our procedure (middle block in Figure 2) aims at estimating the expected crops' yield at the end of the agricultural season, which is assumed a decision relevant information for the considered farmer-agents. For this purpose, we rely on a spatially distributed process-based model of the Muzza irrigation district , which is composed by three interlaced modules: i) a distributed-parameter water balance module that simulate water resources, conveyance, distribution, and soil-crop water balance (Facchi et al., 2004;Gandolfi et al., 2006); ii) a heat units module that simulates the sequence of growth stages as a function of the temperature (Neitsch et al., 2011); iii) a crop growth module that estimates the optimal and actual yields, accounting for the effects of water stresses due to the insufficient water supply that may have 5 occurred during the agricultural season ).
The water balance module partitions the irrigation district with a regular mesh of cells with a side length of 250 m, which was selected to properly reproduce the spatial distributions of all the modeled processes, especially in terms of water balance (Vassena et al., 2012). Each individual cell identifies a soil volume which extends from the soil surface to the lower limit of the root zone. This soil volume is subdivided into two layers, modeled as two non-linear reservoirs in cascade: the upper one 10 (evaporative layer) represents the upper 15 cm of the soil; the bottom one (transpirative layer) represents the root zone and has a time-varying depth. The water percolating out of the bottom layer constitutes the recharge to the groundwater system.
The heat units module defines the relationships between the temperature and some variables and parameters related to the crop growth stage (e.g., root length, basal coefficient, leaf area index), which also influences the water balance module.
According to the heat units theory (Neitsch et al., 2011), crop growth stage at time t in the i-cell is defined as a function of the 15 cumulated heat units (HU A range is defined for each crop: the minimum is the base temperature T b , which defines the day of sowing (i.e., when HU (i) t > T b ) and the maximum is the cut-off temperature over which the heat units are no longer cumulated.
Finally, the crop growth module first estimates the maximum yield achievable in optimal conditions and, then, reduces it to take into account the stresses due to insufficient water supply from precipitation and irrigation happened during the agricultural 20 season. The yield response to water stresses is estimated according to the empirical function proposed in the AquaCrop model Raes et al., 2009) and based on the approach proposed by FAO (Doorenbos et al., 1979): opt are the actual and optimal yield in the i-th cell, T r (i) real,tot and T r0 tot the actual and optimal transpiration in the i-th cell during the whole growth period, and k y is a crop-specific coefficient relating yield decline and water 25 stress.

The agent-based decision models
In the last step of our procedure (bottom block in Figure 2), the process-based model described in the previous section is combined with an agent-based model representing the decisions made by the farmers in the 39 irrigation units of the Muzza irrigation district. In particular, each irrigation unit is modeled as a single agent and the decision of each agent is limited to 30 a single crop in each agricultural season. The possible crop choices include tomato, corn, soybean and rice, which represent the most common crops in the considered study area. The crop growing period slightly varies from one crop to the other, with maize being the crop with longest growing period (see Figure 3). Note that the modeled agents do not represent individual farmers in the system, but rather a group of farmers located in one of the 39 irrigation units. This hypothesis is tantamount to describe the median behavior of the ensemble of farmers aggregated at the irrigation unit level under the assumption of rational behaviors, and provides a simple and effective way to capture the inter-annual dynamics of land use at the district scale 5 .
The agent's decision problem is hence formalized as follows: where P(·) is the net profit obtained at the end of the agricultural season from the yield Y real (γ k ) of crop γ k (estimated from eq. 2), p(γ k ) and c(γ k ) are the corresponding price and cost, respectively, and σ(A k ) the subsides for the k-th agent (with 10 k = 1, . . . , N ). The subsides, which depend on the cultivated area A k and not on the selected type of crop (Gandolfi et al., 2014), derive from the EU's Common Agricultural Policy (CAP), which complements a system of direct payments to farmers with measures to help rural areas in facing a wide range of economic, environmental, and social challenges (Britz et al., 2003).
In Problem (3), the optimal cropping pattern decision γ * k is conditioned on the forecast informationε, with the statistic Ψ filtering the uncertainty in the forecast products and capturing the personal risk aversion of each farmer-agent (Giuliani and 15 Castelletti, 2016). In fact, depending on its personal behavioral attitude and on its level of trust in the forecast products, an agent can use the forecast information in different ways, particularly when this is provided in the form of prediction ensembles.
In this work, we explore three different levels of farmers' risk aversion creating a spectrum of behavioral attitudes, namely risk averse, risk neutral, or risk prone (e.g., Rogers, 1975;Mosley and Verschoor, 2005;Koundouri et al., 2006;Djanibekov and Villamor, 2017). A risk averse, pessimistic behavior (or a low level of trust in the forecast products) implies that agents decide 20 on the basis of the worst-case realization, which means they will select the cropping patterns able to ensure the highest profit in the most adverse conditions. Yet, these decisions may result to be overly conservative if the actual realization is different from the worst possible one. Conversely, a risk prone, optimistic behavior produces decisions that rely on the best possible situation.
This choice, however, increases the risk of cultivating crops that are highly productive under favorable weather conditions, but might be also highly vulnerable under more adverse seasons. Finally, risk neutral agents with a sufficient level of trust in the 25 forecasts' products ground their decisions on the expected profitability of the crops using the probability of realizations derived from the forecast information.
These alternative behaviors are formalized by means of the following three statistics Ψ which are used in eq. (3) to filter the uncertainty in the forecast products: -Risk averse behaviors are modeled using the minimax regret metric (Savage, 1951), where decisions are based on the 30 regret, defined as the difference between the performance resulting from the best alternative given that the predictedε j is the true realization of precipitation and temperature, and the performance of a given cropping pattern γ under the same weather conditionsε j , i.e. r(γ,ε j ) = max γ (P(γ,ε j )) − P(γ,ε j ) Then, this metric selects the best cropping pattern γ * adopting a pessimistic approach, namely by minimizing the maximum regret across all the members of the forecast ensembleε ∈ Ξ, i.e. γ * = arg min γ max ε∈Ξ r(γ,ε) (5)

5
-Risk neutral behaviors are modeled using the principle of insufficient reason (Laplace, 1951), where decisions are made by assigning equal probability to each forecast ensemble member. Then, the best cropping pattern γ * is selected as the one associated to the maximum expected performance, i.e.
where n is the number of members in the ensemble.

10
-Risk prone behaviors are modeled using the maximax metric (French, 1988), where decisions are made by looking at the best possible performance of each decision and selecting the cropping pattern γ * such that γ * = arg max γ max ε∈Ξ P(γ,ε)  This metric is generally associated with an optimistic point of view as it assumes that the best state of the world will realize.

Experiment settings
Hindcast of precipitation and surface temperature data are collected from the ECMWF ENSEMBLE project, NCEP, and Canadian Centre for CCCma, respectively. Table 1 reports some general information about the considered forecast products.

5
The ECMWF hindcast consists of a comprehensive set of seasonal, annual, and decadal products. The Climate Forecast System version 2 (CFS v2) from NCEP is similar to ECMWF products, generated using fully coupled models representing the interactions between the Earth's atmosphere, oceans, land and sea-ice (Saha et al., 2014). The Canadian Seasonal to Interannual Prediction System (CanSIPS) is a long-range multi-model prediction system whose objective is to forecast the evolution of global climate conditions . There are two versions of coupled climate models inside the CanSIPS 10 system, namely the CamCM3 model (Arora et al., 2011) and CamCM4 model (Scinocca et al., 2008). To tackle with the  impact of uncertainties in the initial conditions, most models run a number of simulations with slightly different atmospheric and oceanic initial states to generate ensemble outputs.
In addition to the institutional forecast products, we also include in the analysis three simple empirical models representing farmers' prior knowledge based on past observation. Specifically, EmpPast refers to the empirical forecast obtained by duplicating the past year's observations. The Emp2Ave stands for the simple forecast averaging the past two years' observations 5 which is analogous to the climatology forecast with a 2-year memory basis as reflective of farmers' best possible capacity.
Lastly, the EmpClima is simply the climatology forecast over past observations.

Numerical results
The first step of our framework (Figure 2) aims at evaluating the forecast quality in terms of the difference between the postprocessed forecast variables and the observed ones. Figure 6 shows the post-processed forecast of precipitation against the observed one during the crop growing season across different forecast products. The empirical memory based forecast (blue bars in Figure 6) assumes the weather in the incoming year is similar with previous conditions. This mechanism leads to 5 significant forecast errors, such as 2002, which was predicted as a normal year but it was wet, or 2003, which was predicted as wet but was extremely dry. The climatology-based forecast assumes the realization of average conditions determined from historical observations. This strategy works in normal years, such as 2001 or 2004, while extreme weather conditions tend to be filtered out during the averaging and year-to-year variations are less significant. For the institutional forecast products, CFS product seems to work well in normal years while being less accurate in wet and dry years. In particular it is not able 10 to capture the variation from high to low precipitation in 2002-2003. Similar results can be observed for Canada CanSIps products, with CamCM4 generally underestimating more the precipitation compared with the CamCM3. Estimating the total precipitation for wet/dry years is challenging also for ECMWF products, which involve multiple forecast systems at various lead time. Nevertheless, there are some exceptions, such as IFS/HOPE model from annual forecast products, which seems to be weather data for removing possible model biases and focus only on the forecast errors) crop productions is reported in Figure   8, with the mean absolute error (MAE) of each forecast product over the five years is reported between brackets in the legend of each plot. In general, the fluctuations of the production follow the fluctuations of climate variables, especially the precipitation, with the highest productions in 2002 and the lowest in 2003. For most institutional forecast products, the predicted crop productivity in wet/dry years is significantly different from the ones obtained with the observed climate. In many cases, several 5 products tend to overestimate crop yield in dry years and to underestimate it in favorable wet years. One exception is again represented by the IFS/HOPE model, which is able to provide quite accurate forecast of crop yield (i.e., average MAE across the four crops equal to 17.1 kg/ha). These results suggest that, as expected, forecasting crop yields is a more complex task than forecasting precipitation and temperature. This is further confirmed by the poor performance attained by the empirical products when their forecast quality is evaluated in terms of crop yield (i.e., average MAE across the four crops and the three empirical forecast products equal to 24.4 kg/ha). Especially for water demanding profitable crop like tomato, erroneously forecasting a wet year causes an over-optimistic expectation, which significantly differ from the actual outcome. Similarly, some products 5 (e.g., decadal forecast from ECMWF ECHAM5/MPIOM models) may forecast a wet year which instead results to be dry, such as 2005, and produce a large overestimation of crops' productivity. Finally, these results also show the emergence of some differences in the accuracy of precipitation and temperature forecasts with respect to the corresponding prediction of crop yield. A clear example is 2001, for which CFS V2 exhibits a significantly higher accuracy in predicting the precipitation than IFS/HOPE model with ECMWF annual product (see Figure 6). Yet, this superiority does not imply a better forecast of the 10 crops' production and both the products indeed have similar levels of accuracy across all the four simulated crops (see Figure   8).
Looking at the accuracy of the predicted precipitation and temperature as well as the predicted crop yields provides a measure of the forecast quality without exploring the potential benefit of using W&C services to inform the farmers' decisions.
The quantification of the operational value of W&C services is performed in the third step of our framework (Figure 2),

15
where we use our agent-based model to simulate farmers' decision-making process and estimate the profit obtained from the cultivation of the selected crops. This is contrasted with the profit obtained under the assumption of perfect foresight to estimate the opportunity cost of using W&C services. It is worth pointing out that, although perfect forecast accuracy can be hardly achieved, farmers' decisions under forecast information may coincide with that selected with the perfect foresight. Figure 9 illustrates the relationship between the performance of agents decisions (x-axis), measured in terms of fraction of 20 farmers making optimal decisions (i.e., selecting the same cropping pattern as in the perfect foresight case) and attaining an opportunity cost equal zero, and the associated forecast quality (y-axis), evaluated in terms of mean absolute error (MAE) of the selected crops. The scatterplot is divided into four zones, where the bottom right corner indicates that a good prediction skill leads to better decision outcomes, while the upper left corner corresponds to the situation where forecast errors induce a large opportunity cost. Both the empirical forecast and the institutional forecast products are spread along the y-axis, confirming the 25 variability of forecast quality in predicting crops' productivity. Numerical results show that most of the points characterized by a good forecast quality, defined as MAE below 1000 kg/ha, corresponds to institutional products. These high quality forecast products provide valuable information to support agents decisions, as demonstrated by the fact that all the points below the 1000 kg/ha line successfully inform a large fraction of agents (i.e., 90%-100%), who are able to make optimal decisions. However, many empirical products are also able to achieve zero opportunity cost, even though their forecast quality is generally worse 30 than the one of institutional forecasts. This can be explained by considering that agents are deciding looking at the ranking of crops' profitability rather than on their absolute expected profitability. As a consequence, an overall under/overestimation of the profitability of all the crops (e.g., the profit of each crop is predicted to be 10% lower than reality) results in a poor forecast quality but, at the same time, this forecast error does not generate a rank reversal and the agents select the optimal cropping pattern anyway. 18 Figure 9. Scatterplot of forecast quality of predicted crop productivity (y-axis) and farmers' crop decisions performance (x-axis) under different forecast products.

Impacts of farmers' behavioral attitudes
The results presented in the previous section are obtained assuming risk-neutral agents, where the most profitable cropping pattern is selected by the modeled agents on the basis of the crops' profitability predicted by the agricultural model when simulated under a single synthetic timeseries of post-processed precipitation and temperature (see Figure 2). Yet, in a more realistic setting, farmers are exposed to uncertain forecasts and, moreover, their behavioral factors may influence the use of W&C ser-5 vices. In this section, we explore how different levels of risk aversion impact on the agents' decision and, consequently, on the estimated operational value of W&C services. In particular, we focus on the ECMWF annual product, which attained both the high forecast quality and high operational value, and we generate 100 synthetic time series of precipitation and temperature crops' profitability at the end of the agricultural season. This uncertainty is then filtered by agents through a proper statistic capturing their personal risk aversion, including risk neutral, risk prone, and risk averse behaviors (see section 3.3).
The results obtained adopting these different levels of risk aversion are reported in Figure 10, where the left y-axis shows the distributions of the forecast quality for the three considered models, while the right y-axis shows the fraction of farmer-agents making optimal decisions. The figure shows that, although we are considering a single forecast product, the forecast quality 5 varies according to the model used for producing the forecast, with IFS/HOPE characterized by the lowest MAE, both in terms of median and variance, and outperforming both ECHAM5/MPIOM and HadGEM2-AO. Interestingly, these differences in terms of forecast quality are not linearly transferred to the performance of agents' decisions. Our results show that the level of agents' risk aversion significantly impact on their use of forecast products. Risk averse behaviors (i.e., agents deciding on the basis of minimax regret, represented by the solid red line) attain a performance that decreases when moving from high to low 10 quality forecast. However, this does not hold for risk neutral or risk prone behaviors, simulated as agents deciding according to the principle of insufficient reason (red dashed line) and the maximax metric (red dotted line), respectively. In both cases, the highest fraction of agents making optimal decisions is obtained by using the ECHAM5/MPIOM forecast despite this product has a lower quality than IFS/HOPE. This unexpected finding can be explained by the fact that forecast accuracy metrics quantify the error in predicting the agricultural production, while the operational value estimated through the decision model relies on the ranking of the available options (i.e., cropping patterns). Sub-optimal decisions are made when the forecasted productivity of the crops produces a different ranking with respect to the one resulting at the end of the agricultural season. However, such rank reversals are not linearly related to the forecast accuracy: large but consistent (e.g., systematic over/underestimation) errors for all the crops may 5 produce the same ranking and result in optimal decisions, while smaller and variable errors can produce sub-optimal decisions.
For example, the values of forecast accuracy reported in Figure 8 show that in 2001 ECHAM5/MPIOM (which in Figure 10 attains the highest decision performance) is systematically overestimating the productivity of all the crops, while IFS/HOPE underestimates the productivity of tomato and overestimates the one of rice, potentially reverting the ranking of these crops and producing sub-optimal decisions.

10
Finally, it is worth noting that the criterion associated to the largest fraction of agents making optimal decisions, which might be considered as the "best" way for taking advantage of W&C services, varies across the models. The minimax regret is the best when applied to IFS/HOPE forecast, while the principle of insufficient reason is superior when used for ECHAM5/MPIOM and HadGEM2-AO products. A mis-definition of the stakeholders' perception of W&C services, here explored in terms of risk aversion, may hence represent a strong bias in the analysis of W&C services operational value. For example, the opportunity 15 cost of using ECHAM5/MPIOM simulated assuming the principle of insufficient reason is equal to 3%, meaning that one agent over the 39 considered in our model is selecting a sub-optimal cropping pattern. The opportunity cost for the same product simulated assuming the minimax regret is instead equal to 4%, meaning that two agents over 39 select sub-optimal cropping patterns. Finally, the simulation of risk prone agents adopting the maximax criterion produces an opportunity cost of 10%, meaning that four agents select sub-optimal cropping patterns. These results provide strong evidence about the importance of 20 considering personal, behavioral attributes to produce a proper assessment of W&C services operational value.

Conclusions
In this work, we propose a novel framework for assessing the operational value of several Weather and Climate Services. This approach, which relies on an integrated model of a Coupled Human-Natural System, is applied in the Muzza irrigation district (Italy), a complex agricultural system where farmer-agents select the crops to cultivate by maximizing the expected net profit 25 at the end of the agricultural season. Our framework allows quantifying the quality of the considered forecast products both in terms of climatic and decision-relevant variables as well as estimating the associated payoff for the farmers, also exploring the impacts of behavioral attributes on the uptake and use of W&C services.
Our study shows that, at present, the accuracy of most state-of-the-art weather forecast products is still limited, especially in the prediction of precipitation with a lead-time of 7 months or longer. The ECMWF annual forecasts simulated by the 30 IFS/HOPE model displayed the maximum forecast skill among the considered products and they were able to also predict some extreme events, including the intense drought of 2003. The predictions of crop yield obtained via simulations of process-based models using the predicted values of precipitation and temperature as climate forcing show similar performance in terms of forecast quality.
Numerical results on the use of these forecast to inform agents' decisions show that the accuracy of estimating crop yield and the probability of making optimal decisions are not necessarily linearly correlated. The assessment of the operational value of W&C services should therefore include a decision model reproducing the actual users' adoption of forecast products within 5 their decision making process. Some institutional forecast (e.g., ECMWF products) attain both high forecast quality and high agents decisions performance. However, our results also show that in many cases the agents decisions are still optimal even though informed by products with low forecast quality (e.g., CFS products). From the farmers point of view, the operational values of ECMWF and CFS products are therefore equivalent despite ECMWF would largely outperform CFS in terms of forecast quality. Finally, we provide numerical evidence about the impacts of different farmers' behavioral attributes (i.e., 10 levels of risk aversion) on the quantification of W&C services operational value. The exploration of this behavioral uncertainty further amplifies the key role of the decision model in the assessment procedure. Our results show that the opportunity cost of the same forecast product increases from 3% to 10% while moving from risk neutral to risk prone decisions, potentially producing rank reversals in the quantification of the W&C services operational value.
To generalize the results obtained in this work, future research efforts should focus on the following directions: extending 15 the evaluation horizon, using large multi-model ensembles, exploring the socio-economic dimension of the problem, simulating dynamic attitudes of farmers. Our analysis is limited to the time period (2001)(2002)(2003)(2004)(2005) because the historical observations available for running the model covers the period (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005), which were divided into two periods with the first period used for post-processing the forecast products and the second one for performing the analysis. Moreover, ECMWF forecast products are obtained from the ENSEMBLES project, which provides hindcasts over the period . Despite this time period 20 includes a fairly balanced number of normal, wet, and dry agricultural seasons with variable temperature patterns, a longer time horizon including more extreme events would produce more robust findings. The forecast products considered in this work are characterized by a relatively small ensemble size. The use of larger ensembles (possibly multi-model ensembles) has the potential for attaining a better performance in terms of forecast quality and, possibly, also in terms of operational value, especially if the analysis is performed on each single ensemble member to better represent the extreme events. However, the uptake of such 25 large ensembles opens up a number of additional challenges, such as how to limit the smoothing effect on the extreme events or how to combine multiple products with different levels of accuracy, which goes beyond the scope of this paper and can be explored in a future analysis. Our model assumes that the predicted water availability is the main factor influencing farmers' decisions, while additional drivers (e.g., expected crop prices, use of nutrients and fertilizers), are assumed as deterministically known. Another next step of our research will explore the role of the socio-economic dimension of the problem and its impacts 30 on farmers' decisions. The behavioral attitudes considered in our analysis include diverse levels of farmers' risk aversion with respect to forecast uncertainty. Future research will focus on capturing dynamic behavioral dependencies, where the attitude of the farmers in making decisions for the incoming agricultural season is affected by the yield in the previous one. However, the calibration of a decision model implementing such behavioral dependency requires long behavioral time series to identify the proper lag-time as well as the magnitude of the effect for different levels of drought intensity.