Authors: A. Carmona-Cabrero and R. Muñoz-Carpena, University of Florida, carpena@ufl.edu
Date: March 2024
Now that we have our model outputs, we can analyze its outputs with global sensitivity and uncertainty analysis. Although both analyses are related and are often done simultaneously, they are not the same:
Sensitivity analysis explores how important the model inputs are for the model uncertainty. This is how much changes in the model input distributions will affect the distribution of the model outputs.
Uncertainty analysis explores the distribution of the model outputs when there is input uncertainty and/or the model presents stochastic behavior. This is how uncertain are the model outputs when the input uncertainty is considered.
We will decompose the total variance in two components (deterministic and stochastic) following two parallel branches:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from SALib.sample import saltelli
from SALib.analyze import sobol
from ring_plot_funct import ring_plot
We define our problem (the same used to generate the first notebook for GSA sample), read the model outputs, and name them.
input_names = ['MutMxSz', 'MutSSz', 'MutSpP', 'GlutMnSz', 'GlutSSz', 'GlutFzP', 'GlutFzA', 'GlutIniPop']
problem = {'num_vars': len(input_names),
'names': input_names,
'bounds': [[1,10], [0.1,5], [0.01,0.3], [0.1,2], [0.1,5], [0.01,0.2], [0.01,0.2], [3,9]]
}
n_real = 100 # Number of realizations (model runs with the same inputs)
output_names = ['AvgGlutPop', 'AvgGlutRad', 'AvgGlutRadGn', 'AvgMutPop', 'MaxMutPop', 'MinMutPop'] # Output names
data = pd.read_csv('wolf.out', header=None, sep=' ') # This is with the modified file with just the output in the 4th column (AvgMut)
data.drop(6, axis=1, inplace=True) # There is a space in the last column (if there is no space write axis=0)
data.columns = output_names
data.head()
AvgGlutPop | AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | MinMutPop | |
---|---|---|---|---|---|---|
0 | 9.0 | 11.286599 | 7.157074 | 44.0 | 170.0 | 0.0 |
1 | 9.0 | 9.756715 | 1.126875 | 41.0 | 120.0 | 0.0 |
2 | 9.0 | 10.265316 | 4.335474 | 35.0 | 97.0 | 0.0 |
3 | 9.0 | 8.084866 | 18.202648 | 58.0 | 127.0 | 0.0 |
4 | 9.0 | 7.882991 | 5.344454 | 33.0 | 171.0 | 0.0 |
data.drop(['MinMutPop', 'AvgGlutPop'], axis=1, inplace=True) # We drop AvgGlutPop as it is fixed in GLutmut program, and MinMutPop as it is always 0
output_names = output_names[:-1]
output_names = output_names[1:]
data.head()
AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | |
---|---|---|---|---|
0 | 11.286599 | 7.157074 | 44.0 | 170.0 |
1 | 9.756715 | 1.126875 | 41.0 | 120.0 |
2 | 10.265316 | 4.335474 | 35.0 | 97.0 |
3 | 8.084866 | 18.202648 | 58.0 | 127.0 |
4 | 7.882991 | 5.344454 | 33.0 | 171.0 |
We will employ the Law of total variance to decompose deterministic (Vd) and stochastic (Es) variance components:
V[Y] = V[E[Y|X]] + E[V[Y|X]] = V(Yd) + E(Ψs) = Vd + Es (eq 1)
In this step, we calculate how much of the total output variance, V(Y), is attributed to deterministic effects (average of model realizations for a given Sobol sample). This fraction of the total output variance is Vd.
For that, we first estimate the expected value of the model realizations E[Y|X], Yd:
exp = pd.DataFrame(data.values.reshape(-1, n_real, data.shape[1]).mean(1)) # Mean of the model realizations, E[Y|X] = Y_d
exp.columns = output_names
exp.head()
AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | |
---|---|---|---|---|
0 | 8.632314 | 8.760255 | 44.02 | 125.37 |
1 | 69.496089 | 30.452940 | 4.02 | 54.42 |
2 | 8.735444 | 8.722916 | 50.84 | 138.52 |
3 | 2.243327 | 9.651485 | 33.92 | 57.76 |
4 | 8.600231 | 9.113118 | 46.22 | 127.47 |
We can also estimate the variance of the model realizations V[Y|X], Ψs.
var = pd.DataFrame(data.values.reshape(-1, n_real, data.shape[1]).var(1)) # Variance of the model realizations, V[Y|X] = \Psi_s
var.columns = output_names
var.head()
AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | |
---|---|---|---|---|
0 | 1.616960 | 70.061261 | 555.7996 | 640.2731 |
1 | 217.446319 | 741.147586 | 356.4996 | 1778.8836 |
2 | 1.223908 | 69.342294 | 666.8144 | 496.8296 |
3 | 0.150063 | 75.278266 | 71.7736 | 72.1624 |
4 | 2.101137 | 67.433012 | 626.7316 | 534.3291 |
We can check that the Law of total variance is satisfied by calculating Vd + Es = V(Y). For this we will calcfulate V(Y) with all the model outputs (all realizations for all the Sobol Samples, MxN). Then we will calculate the deterministic (V[E[Y|X]]), and stochastic (E[V[Y|X]]) components and sum them up to compare to the value of V(Y) with all outputs. They should match closely.
v_exp = exp.var()
e_var = var.mean()
for variable in output_names:
print('Output {}: V(Y) = {}. Vd + Es = {}'.format(variable, data.var()[variable], v_exp[variable] + e_var[variable]))
Output AvgGlutRad: V(Y) = 6063.752154326935. Vd + Es = 6064.321044416751 Output AvgGlutRadGn: V(Y) = 2998.681254928705. Vd + Es = 2998.780016559973 Output AvgMutPop: V(Y) = 133418.15569267675. Vd + Es = 133420.99322904038 Output MaxMutPop: V(Y) = 235418.66930422568. Vd + Es = 235430.38992564275
And we estimate the fraction of the total variance that is attributed to the deterministic component:
vy = data.var() # Total variance, V(Y)
v_exp = exp.var() # Deterministic component, V_s
s_exp = v_exp/vy
print('Fraction of deterministic variance:\n', s_exp)
Fraction of deterministic variance: AvgGlutRad 0.970698 AvgGlutRadGn 0.347255 AvgMutPop 0.227784 MaxMutPop 0.519812 dtype: float64
Here, we estimate the importance of each factor for the deterministic component. From the sobol.analyze call, SALib returns a dictionary with sensitivity indices (Si, Si,j and STi) and confidence intervals for their numerical estimation for individuals outputs. Hence, it is necessary to iterate over the model outputs we are interested on.
To take into account the stochastic uncertainty, we scale the sensitivity indices with the fraction of total variance attributed to deterministic effects (Vd/V[Y]). We will denote with prefix o the original indexes that assume that all the variance is deterministic, and those without that prefix correspond to the scaled down indexes after considering stochasticiy (i.e. the fraction of deterministic variance). Hence, we acknowledge that a fraction of the total variance is not caused by deterministic effects.
keys = ['S1', 'ST', 'S1_conf', 'ST_conf'] # Keys in the SALib returned dictionary
# Initialize the DataFrame containing our stochastic sensitivity indices
df_exp = pd.DataFrame()
for out in output_names:
# Analyze the GSA outputs
s_indices = sobol.analyze(problem, np.array(exp[out]), calc_second_order=False)
# Use a dictionary to store the sensitivity indices and their confidence intervals
s_dict = {x: s_indices[x] for x in keys}
# Save the original sensitivity indices
s_dict['oSi'] = s_dict['S1']
s_dict['oST'] = s_dict['ST']
# Scale the sensitivity indices with the fraction of deterministic variance
s_dict['S1'] = s_dict['S1'] * s_exp[out]
s_dict['ST'] = s_dict['ST'] * s_exp[out]
# Name the inputs and outputs
s_dict['input'] = problem['names']
s_dict['output'] = [out] * len(s_indices['S1'])
# Append the results for each of the outputs to the same dataframe
temp = pd.DataFrame(s_dict)
df_exp = pd.concat([df_exp, temp], ignore_index=1)
# Rename the DataFrame columns, changing the notation from S1 to Si
df_exp.columns = ['Si', 'ST', 'Si_conf', 'ST_conf', 'oSi', 'oST', 'input', 'output']
df_exp.head()
Si | ST | Si_conf | ST_conf | oSi | oST | input | output | |
---|---|---|---|---|---|---|---|---|
0 | 0.243631 | 0.506919 | 0.081006 | 0.093726 | 0.250986 | 0.522221 | MutMxSz | AvgGlutRad |
1 | -0.000591 | 0.000310 | 0.001364 | 0.000090 | -0.000609 | 0.000319 | MutSSz | AvgGlutRad |
2 | 0.095691 | 0.286182 | 0.036759 | 0.055437 | 0.098579 | 0.294821 | MutSpP | AvgGlutRad |
3 | -0.000012 | 0.000331 | 0.001273 | 0.000090 | -0.000012 | 0.000341 | GlutMnSz | AvgGlutRad |
4 | 0.065383 | 0.263578 | 0.047685 | 0.050147 | 0.067356 | 0.271535 | GlutSSz | AvgGlutRad |
We add a column with the fraction of variance due to deterministic effects for each factor.
# Initialize a column using the name of the output
df_exp['S_exp'] = df_exp['output'].values
# Substitute that column with the corresponding fraction of variance due to deterministic effects based on the output
df_exp['S_exp'] = df_exp['S_exp'].map(s_exp.to_dict())
print(df_exp.head())
Si ST Si_conf ST_conf oSi oST input \ 0 0.243631 0.506919 0.081006 0.093726 0.250986 0.522221 MutMxSz 1 -0.000591 0.000310 0.001364 0.000090 -0.000609 0.000319 MutSSz 2 0.095691 0.286182 0.036759 0.055437 0.098579 0.294821 MutSpP 3 -0.000012 0.000331 0.001273 0.000090 -0.000012 0.000341 GlutMnSz 4 0.065383 0.263578 0.047685 0.050147 0.067356 0.271535 GlutSSz output S_exp 0 AvgGlutRad 0.970698 1 AvgGlutRad 0.970698 2 AvgGlutRad 0.970698 3 AvgGlutRad 0.970698 4 AvgGlutRad 0.970698
Note: In some cases, the estimation of sensitivity indices can lead to negative values. These are caused by numerical errors. If the sensitivity indices are negative and of small absolute value (e.g., -0.001), they can be assumed to be equal to 0. However, if they are big (e.g., -0.1), they may indicate that our sampling intensity (l) was too small. Here, if present we will replace negative values of the sensitivity indices for 0 and save the sensitivity indices as a CSV file:
# Replace negative values with 0
df_exp[df_exp.select_dtypes("float64").columns] = df_exp.select_dtypes("float64").clip(lower=0)
df_exp.to_csv('det_S_ind.csv', index=False) # Save deterministic sensitivity indices
df_exp.head()
Si | ST | Si_conf | ST_conf | oSi | oST | input | output | S_exp | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.243631 | 0.506919 | 0.081006 | 0.093726 | 0.250986 | 0.522221 | MutMxSz | AvgGlutRad | 0.970698 |
1 | 0.000000 | 0.000310 | 0.001364 | 0.000090 | 0.000000 | 0.000319 | MutSSz | AvgGlutRad | 0.970698 |
2 | 0.095691 | 0.286182 | 0.036759 | 0.055437 | 0.098579 | 0.294821 | MutSpP | AvgGlutRad | 0.970698 |
3 | 0.000000 | 0.000331 | 0.001273 | 0.000090 | 0.000000 | 0.000341 | GlutMnSz | AvgGlutRad | 0.970698 |
4 | 0.065383 | 0.263578 | 0.047685 | 0.050147 | 0.067356 | 0.271535 | GlutSSz | AvgGlutRad | 0.970698 |
Now, we can call the function to make the ring plots. The outter ring represents how much of the analyzed variance is deterministic or stochastic. The middle ring is the total importance of the model inputs (STi). The inner ring indicates how much of each total-order input importance is caused by first-order effects (Si) or interactions.
This function will create a folder named results in the working directory with the figures.
ring_plot(df_exp, det_sto='det') # The det_sto argument indicates whether our sensitivity indices are deterministic or stochastic
We can see the most important factor for the deterministic component of all the model outputs.
For complicated and opaque models, such as ABMs, GSA also serves as a way of soft validating the model. If our understanding of the model does not match the results of the GSA, we may have committed some programing or modeling errors. From these plots, we can learn that in the time set for the model to run, 200 timesteps, either the model has not reached the dynamic stability from the initial conditions or low initial populations can lead, in some cases, to extinction. We can conclude this, as the initial number of sheep and wolves explains part of the total variance.
As we can see about a 77% of the AvgMutPop model output variance cannot be explained by the deterministic effects of the inputs. For some model outputs, this fraction may be very big, indicating that interventions on the model inputs will not have a strong effect on the expected value of that output (remember population 3 of the five cities model in the Carmona Cabrero et al. 2024 paper cited above). How the model inputs affect the stochastic uncertainty of the model can be explored by decomposing the stochastic component of the variance.
If the model is stochastic and the simulations were done with replications (i.e. n_real > 2), we proceed to decompose the stochastic variance. Here we decompose the second term variance E(V[Y|X])=E(Ψs) = Es. Hence, we refer the stochastic sensitivity indices to the total stochastic uncertainty (Es) and not the total variance V(Y) as we did for the deterministic indeces.
The stochastic sensitivity indices require a higher number of realizations than the deterministic indices. This is due to the variance of a random variable converging slower than its expected value. If the stochastic sensitivity indices are not robust (this can be assessed by bootstrapping the realizations that go into estimating the sensitivity indices), more realizations can be run and added to the pool.
keys = ['S1', 'ST', 'S1_conf', 'ST_conf'] # Keys in the SALib returned dictionary
if n_real > 2:
# Initialize the DataFrame containing our stochastic sensitivity indices
df_var = pd.DataFrame()
for out in output_names:
# Analyze the GSA outputs
s_indices = sobol.analyze(problem, np.array(var[out]), calc_second_order=False)
# Use a dictionary to store the sensitivity indices and their confidence intervals
s_dict = {x: s_indices[x] for x in keys}
# Save the original sensitivity indices
s_dict['oSi'] = s_dict['S1']
s_dict['oST'] = s_dict['ST']
# We do not scale the sensitivity indices as the variance decomposition is not continued from V[Y] (i.e., original ST = ST)
s_dict['S1'] = s_dict['S1']
s_dict['ST'] = s_dict['ST']
# Name the inputs and outputs
s_dict['input'] = problem['names']
s_dict['output'] = [out] * len(s_indices['S1'])
# Append the results for each of the outputs to the same dataframe
temp = pd.DataFrame(s_dict)
df_var = pd.concat([df_var, temp], ignore_index=1)
# Rename the DataFrame columns, changing the notation from S1 to Si
df_var.columns = ['Si', 'ST', 'Si_conf', 'ST_conf', 'oSi', 'oST', 'input', 'output']
if not df_var.empty:
print(df_var.head())
else:
print("No stochastic variance can be calculated without model repetitions (n_real=1):")
else:
print("ERROR: No stochastic variance can be calculated without model repetitions (n_rel=1)")
Si ST Si_conf ST_conf oSi oST input \ 0 0.183997 0.618793 0.071131 0.162488 0.183997 0.618793 MutMxSz 1 0.010782 0.021679 0.021707 0.007525 0.010782 0.021679 MutSSz 2 0.201237 0.456813 0.093894 0.127559 0.201237 0.456813 MutSpP 3 0.009562 0.024853 0.011170 0.010834 0.009562 0.024853 GlutMnSz 4 0.023953 0.290596 0.038795 0.083373 0.023953 0.290596 GlutSSz output 0 AvgGlutRad 1 AvgGlutRad 2 AvgGlutRad 3 AvgGlutRad 4 AvgGlutRad
As before here, if present, we replace negative values of the sensitivity indices for 0 and save the sensitivity indices as a CSV file:
if n_real > 2:
df_var[df_var.select_dtypes("float64").columns] = df_var.select_dtypes("float64").clip(lower=0)
df_var.to_csv('sto_S_ind.csv', index=False)
print(df_var.head())
else:
print("ERROR: No stochastic variance can be calculated without model repetitions(n_real=1)")
Si ST Si_conf ST_conf oSi oST input \ 0 0.183997 0.618793 0.071131 0.162488 0.183997 0.618793 MutMxSz 1 0.010782 0.021679 0.021707 0.007525 0.010782 0.021679 MutSSz 2 0.201237 0.456813 0.093894 0.127559 0.201237 0.456813 MutSpP 3 0.009562 0.024853 0.011170 0.010834 0.009562 0.024853 GlutMnSz 4 0.023953 0.290596 0.038795 0.083373 0.023953 0.290596 GlutSSz output 0 AvgGlutRad 1 AvgGlutRad 2 AvgGlutRad 3 AvgGlutRad 4 AvgGlutRad
And produce the ring plots:
if n_real > 2:
ring_plot(df_exp, det_sto='sto')
else:
print("ERROR: No stochastic variance can be calculated without model repetitions(n_real=1)")
We can see that the uncertainty in the model realizations is mostly affected by the sheep_gain_from_food input. Both the wolf and sheep populations are mostly affected by the same inputs. This makes sense as our model is very simple and the populations strongly depend on each other. A more complex system that included other processes such as competition with other predators and preys or diseases would result in more different GSA between outputs. However, GSA does not inform of the directionality of the effects of the inputs. Hence, an important input for two outputs could have a positive effect for one of the outputs and negative for the other output.
From the complete decomposition of the GSA (deterministic and stochastic components), we can conclude that our sheep and wolves populations are strongly affected by the same model inputs. We also learn that the sheep_gain_from_food input affects the most both components and that an intervention on this input will affect both the deterministic and stochastic uncertainties of the model. An intervention on this input will do both: 1) push the animal population in the direction we want and 2) reduce the dispersion around the animal average population (making extreme values less common and therefore increase the population resilience).
We have explored how sensitive the outputs are to the inputs. Now, we explore the uncertainty associated to the values of the outputs. This allows us to define management interventions.
# Threshold for our behavioral output distribution
threshold = 70
# Plot the output distributions
for output in ['AvgMutPop']:
data[output].hist(alpha=0.25, label=output, density=True, bins=50)
# Plot threshold line for behavioral distribution
plt.plot([threshold, threshold], [0, 0.006], '--k')
plt.legend()
plt.xlabel('Population')
plt.ylabel('Density')
Text(0, 0.5, 'Density')
We can calculate how likely it is that the sheep would be over 70:
print('Probability of AvgMutPop > {}: '.format(threshold), (data['AvgMutPop'] > threshold).sum()/len(data))
Probability of AvgMutPop > 70: 0.161720703125
We can also see some statistics of our outputs:
data.describe()
AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | |
---|---|---|---|---|
count | 1.024000e+06 | 1.024000e+06 | 1.024000e+06 | 1.024000e+06 |
mean | 4.641985e+01 | 3.964457e+01 | 9.355988e+01 | 2.122230e+02 |
std | 7.787010e+01 | 5.476022e+01 | 3.652645e+02 | 4.851996e+02 |
min | 1.009277e-01 | 1.116429e-12 | 0.000000e+00 | 3.000000e+00 |
25% | 3.589138e+00 | 2.996924e+00 | 1.000000e+00 | 2.400000e+01 |
50% | 1.355338e+01 | 1.514644e+01 | 8.000000e+00 | 5.000000e+01 |
75% | 4.856505e+01 | 5.431364e+01 | 3.700000e+01 | 1.410000e+02 |
max | 5.172543e+02 | 3.131049e+02 | 7.877000e+03 | 8.266000e+03 |
Now, let's relate those output values to the input values that generated them.
# Read the GSA sample matrix that generated the outputs
inputs = pd.read_csv('wolf_gsa_matrix_100real.txt', sep='\t', header=None).drop([0], axis=1) # The file included a column of sens_class and an empty column
inputs.columns = input_names
# Let's check everything looks good
inputs.head()
MutMxSz | MutSSz | MutSpP | GlutMnSz | GlutSSz | GlutFzP | GlutFzA | GlutIniPop | |
---|---|---|---|---|---|---|---|---|
0 | 2.977539 | 0.57373 | 0.160381 | 1.38584 | 1.47334 | 0.182373 | 0.018721 | 8.396484 |
1 | 2.977539 | 0.57373 | 0.160381 | 1.38584 | 1.47334 | 0.182373 | 0.018721 | 8.396484 |
2 | 2.977539 | 0.57373 | 0.160381 | 1.38584 | 1.47334 | 0.182373 | 0.018721 | 8.396484 |
3 | 2.977539 | 0.57373 | 0.160381 | 1.38584 | 1.47334 | 0.182373 | 0.018721 | 8.396484 |
4 | 2.977539 | 0.57373 | 0.160381 | 1.38584 | 1.47334 | 0.182373 | 0.018721 | 8.396484 |
# Concatenate the columns of the outputs and the inputs
data_inputs = pd.concat([data, inputs], axis=1)
# Let's check everything looks good
data_inputs.tail()
AvgGlutRad | AvgGlutRadGn | AvgMutPop | MaxMutPop | MutMxSz | MutSSz | MutSpP | GlutMnSz | GlutSSz | GlutFzP | GlutFzA | GlutIniPop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1023995 | 5.129673 | 8.557813 | 86.0 | 96.0 | 4.449707 | 0.293799 | 0.254263 | 1.247607 | 4.035791 | 0.132368 | 0.191929 | 5.006836 |
1023996 | 6.359249 | 8.077404 | 52.0 | 111.0 | 4.449707 | 0.293799 | 0.254263 | 1.247607 | 4.035791 | 0.132368 | 0.191929 | 5.006836 |
1023997 | 7.688465 | 18.294091 | 38.0 | 114.0 | 4.449707 | 0.293799 | 0.254263 | 1.247607 | 4.035791 | 0.132368 | 0.191929 | 5.006836 |
1023998 | 7.205223 | 39.904821 | 43.0 | 97.0 | 4.449707 | 0.293799 | 0.254263 | 1.247607 | 4.035791 | 0.132368 | 0.191929 | 5.006836 |
1023999 | 9.166289 | 46.482640 | 40.0 | 114.0 | 4.449707 | 0.293799 | 0.254263 | 1.247607 | 4.035791 | 0.132368 | 0.191929 | 5.006836 |
# Set the condition for behavioral distribution
behavioral = data_inputs[data_inputs['AvgMutPop'] > threshold]
print(behavioral.head())
AvgGlutRad AvgGlutRadGn AvgMutPop MaxMutPop MutMxSz MutSSz \ 6 8.490367 0.026615 78.0 126.0 2.977539 0.57373 13 10.786193 2.638279 80.0 95.0 2.977539 0.57373 15 7.652617 0.800113 121.0 121.0 2.977539 0.57373 17 10.407952 0.298364 82.0 135.0 2.977539 0.57373 29 8.359962 3.549761 116.0 117.0 2.977539 0.57373 MutSpP GlutMnSz GlutSSz GlutFzP GlutFzA GlutIniPop 6 0.160381 1.38584 1.47334 0.182373 0.018721 8.396484 13 0.160381 1.38584 1.47334 0.182373 0.018721 8.396484 15 0.160381 1.38584 1.47334 0.182373 0.018721 8.396484 17 0.160381 1.38584 1.47334 0.182373 0.018721 8.396484 29 0.160381 1.38584 1.47334 0.182373 0.018721 8.396484
Now, we can map the input values of the simulations to those output values that led to behavioral output values (i.e., what input values are more likely to cause values of sheep > 70). A quick way of doing this is by plotting the distributions of the input values that led to behavioral outputs.
for input_var in input_names:
plt.figure(input_var)
behavioral[input_var].hist()
plt.ylabel('Frequency')
plt.xlabel(input_var)
An important concept that the Stochastic GSA framework allows to explore is that of the Probability of failure (and correspondening risk) affronted by the group of shepherds in the region vs. that of an individual shepherd there. For this, let's examine the probability distributions of the individual shepherds (all 100 model realizations for each case -Sobol sample- NxM), and compare that against the distributions representing the group of shepherds in the region (the expected value of the 100 shepherds for each Sobol sample) in the region.
for input_var in output_names:
# Create a figure with two subplots (density histogram and cumulative probability)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the histograms of pre-intervention and post-intervention populations in the first subplot
exp[input_var].hist(alpha=0.25, label='E[Y|X]', density=True, bins=500, ax=ax1)
data[input_var].hist(alpha=0.25, label='All values', density=True, bins=500, ax=ax1)
# Plot threshold line for behavioral distribution if input_var is 'AvgMutPop'
if input_var == 'AvgMutPop':
ax1.plot([threshold, threshold], [0, 0.006], '--k')
ax1.legend()
ax1.set_xlabel(input_var) # Use input_var as x-axis label
ax1.set_ylabel('Density')
ax1.set_xlim(0, 200) # Set X-axis limit up to 200 units
# Plot cumulative probabilities of pre-intervention and post-intervention populations in the second subplot
sorted_exp = exp[input_var].sort_values()
sorted_data = data[input_var].sort_values()
cumulative_exp = (sorted_exp.rank() - 1) / len(sorted_exp)
cumulative_data = (sorted_data.rank() - 1) / len(sorted_data)
ax2.plot(sorted_exp, cumulative_exp, label='E[Y|X]')
ax2.plot(sorted_data, cumulative_data, label='All values')
# Plot threshold line if input_var is 'AvgMutPop'
if input_var == 'AvgMutPop':
ax2.axvline(x=threshold, color='r', linestyle='--')
ax2.set_xlim(0, 200) # Set X-axis limit up to 200 units
ax2.legend()
ax2.set_xlabel(input_var) # Use input_var as x-axis label
ax2.set_ylabel('Cumulative Probability')
Recall that the stochasticity of each output (fraction of the deterministic variance) was:
print('Fraction of deterministic variance:\n', s_exp)
Fraction of deterministic variance: AvgGlutRad 0.970698 AvgGlutRadGn 0.347255 AvgMutPop 0.227784 MaxMutPop 0.519812 dtype: float64
Comparing these last values with the differences between the probabilities for the individuals and the group we can pose that a main driver of the differences between the two is the stochasticity of the model, i.e. for the most deterministic output (AvgGlutRad) both probabilities are almost identical, where the biggest difference can be found for the most stochastic output in our case (AvgMutPop). This is an interesting result for many stochastic processes.
For example, this explains why when we play a risky game (like stock market) it makes sense to invest on a portfolio of individual stocks, as the expected risk from that will be lower compared to that of a single value in the portfolio.
Notice that the anlaysis of both the expected values (assuming the number of realizations is sufficient to achieve a robust mean), and that of all individual realizations is of interest for different applications. If the objective is to evaluate the risk of a group (i.e. a regional manager or company with a portfolio of shepherds), then the conventional (deterministic) GSA using average values for each sample is warranted. However, if the objective is to evaluate that of an individual shepherd in the region, the extended stochastic GSA should be used.