Published:
Motivation
Copulas are functions that link marginal distributions to form joint distributions. They separate the modeling of individual variables (marginals) from their dependence structure. This is particularly useful when dealing with non-normal data or when you want to preserve relationships between the variables.
We have previously seen copula-related posts on this blog:
- Dave Gold wrote a WaterProgramming post with a nice Introduction to Copulas.
- Rohini Gupta also made a nice “Copula Coding Exercise” with demonstrations for joint flood modeling in R (the link is worth a click, just to see Rohini’s copula meme at the start).
However, this is the first WaterProgramming post which (i) uses and shares Python code demonstrating copulas and (ii) uses copulas for modeling drought characteristics.
Drought characteristics used in this example
In this example, I am using drought characteristics calculated using an Standardized Streamflow Index (SSI) approach for drought classification. The data was generated using an ensemble of stochastic streamflow. In total, the stochastic ensemble contained 70,000 years of streamflow data.
For more information about the SSI drought classification, take a look at Rohini’s blog post on Introduction to Drought Metrics.
In these 70,000 years, there are 11,032 droughts identified (on average, one drought every 6 years approximately).
For each drought, I measure the following metrics:
- Drought Severity: The minimum SSI value during the drought period
- Drought Magnitude: The accumulated negative SSI over the drought period
With that in mind, the two different drought metrics correspond to different characteristics of the drought period. The severity is an indication of absolute deviation below average flow conditions while the magnitude corresponds to the accumulated deficit and persistence of the drought.
In a water resources and risk management context, it may be important to understand both of these drought characteristics together.
Specifically, stakeholders are likely most concerned about droughts that have both extreme severity and magnitude (or droughts that are very dry and long).
Modeling drought characteristics with copulas
The following steps are going to walk through:
- Loading and visualizing the data
- Fitting marginal distributions
- Visualizing the CDF-transformed uniform marginals
- Fitting a gaussian copula
- Visual verification of the fitted marginals and copula
- Using the copula to calculate joint exceedance probability of a drought with a certain severity and magnitude
Load and visualize the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import multivariate_t, multivariate_normal, t, norm
from scipy.optimize import minimize
import warnings
warnings.filterwarnings('ignore')
# Load data
fname = "ensemble_ssi12_drought_events.csv"
df = pd.read_csv(fname, index_col=0)
df['magnitude'] = np.log(np.abs(df['magnitude']))
df['severity'] = np.log(np.abs(df['severity']))
Let’s start by looking at the relationship between the transformed magnitude and severity metrics.
# Plot joint scatter w/ marginals
sns.jointplot(data=df, x='severity', y='magnitude', kind="scatter", height=8, marginal_kws=dict(bins=50, fill=True))
| ![[Use of Copula’s for modeling joint probability of drought characteristics - joint scatter.png | center | 300]] |
As shown in the figure above, the distributions of the marginals (histograms on x- and y-axis) are clearly not the same distribution. In this case, we cannot simply fit a multivariate gaussian model. The copula will be helpful to link the different distributions.
First, we need to fit appropriate distributions to the marginals.
Fitting marginal distributions
I’ve already gone through the process of comparing multiple different distributions and comparing their fits using the Kolmogorov–Smirnov (KS) test. The KS test was used to determine how well the CDF-transformed data aligned with the uniform distribution. Since we are ultimately going to use the copula, and the copula is built to link uniform marginals, we can use this KS metric to determine
While I won’t share the full distribution selection workflow for brevity, it generally followed:
dist_candidates = [
scipy.genexpon,
scipy.gamma,
...
]
for dist in dist_candidates:
params = dist.fit(data, **fit_kwargs)
u_transformed = dist.cdf(data, *params)
ks_stat, ks_pval = stats.kstest(u_transformed, 'uniform')
# compare the ks_stat of different candidate distributions
Using this approach, I found that the generalized exponential distribution fits the log-transformed drought severity the best, while the log normal distribution was the best fit for the log-transformed drought magnitude.
The code block below shows how I fit, then plotted the distributions to provide a visual check of their goodness-of-fit.
### Fitting marginals
best_sev_dist = stats.gengamma
best_sev_params = best_sev_dist.fit(df['severity'])
best_mag_dist = stats.lognorm
best_mag_params = best_mag_dist.fit(df['magnitude'])
# print the fitted parameters
print(f"Severity: {best_sev_dist.name} with params {best_sev_params}")
print(f"Magnitude: {best_mag_dist.name} with params {best_mag_params}")
### PLot fitted marginal distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Severity marginal
x_sev = np.linspace(0, df['severity'].max(), 1000)
ax1.hist(df['severity'], bins=200, density=True, alpha=0.7, color='skyblue', label='Observed (Severity)')
ax1.plot(x_sev, best_sev_dist.pdf(x_sev, *best_sev_params), 'r-', lw=2, label=f'{best_sev_dist.name.title()}')
ax1.set_xlabel('Severity')
ax1.set_ylabel('Density')
ax1.set_title('Severity Marginal Distribution')
ax1.legend()
ax1.grid(True, alpha=0.3)
# magnitude marginal
x_mag = np.linspace(0, df['magnitude'].max(), 1000)
ax2.hist(df['magnitude'], bins=100, density=True, alpha=0.7, color='lightcoral', label='Observed')
ax2.plot(x_mag, best_mag_dist.pdf(x_mag, *best_mag_params), 'r-', lw=2, label=f'{best_mag_dist.name.title()}')
ax2.set_xlabel('magnitude')
ax2.set_ylabel('Density')
ax2.set_title('Magnitude Marginal Distribution')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
| ![[Use of Copula’s for modeling joint probability of drought characteristics - marginal dists.png | center | 500]] |
The figure above shows some pretty nice alignment. Of course, if we had a larger sample size then the empirical data may be smoother. Although, this seems good for our purposes and data.
Now we can use the CDF transformation from the fitted marginal distributions to get the uniform-transformed marginal distributions.
### Use fitted distributions to transform data to uniform
# Transform to uniform using best fitted distributions
u_severity = best_sev_dist.cdf(df['severity'], *best_sev_params)
u_magnitude = best_mag_dist.cdf(df['magnitude'], *best_mag_params)
# Update uniform data matrix for copula
U = np.column_stack([u_severity, u_magnitude])
Let’s now plot the uniform-transformed data, and visualize the marginal distributions to make sure that they do appear to be ~uniform.
# Create joint plot with marginals using seaborn
g = sns.jointplot(x=U[:, 0], y=U[:, 1],
kind='scatter',
alpha=0.1,
s=20,
height=8,
marginal_kws={'bins': 50, 'alpha': 0.7})
g.set_axis_labels('U1 (Severity uniform)', 'U2 (Magnitude uniform)')
g.figure.suptitle('Empirical Copula with Marginal Distributions', y=1.02)
g.ax_joint.axhline(y=0.5, color='red', linestyle='--', alpha=0.5)
g.ax_joint.axvline(x=0.5, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
| ![[Use of Copula’s for modeling joint probability of drought characteristics - joint uniform marginals.png | center | 500]] |
Looking at the figure above, it appears that the fitted distributions result in nice uniformity of the CDF-transformed marginals.
At this point it is a good idea to visually inspect the above joint distribution and make some initial estimates of the best copula.
Fitting the copula
In this case, I tested both Gaussian and Student-t copulas. The student-t copula has more tail-dependence at high- and low-ends of the distribution compared to the Gaussian.
However, in my case, the parameters of the Student-t copula were converging toward the Gaussian copula (the nu parameter value was very large).
The Gaussian copula is very easy to fit, as we only need to calculate the correlation (rho) between the uniform marginals shown above.
### Fit gaussian copula
u1 = U[:, 0]
u2 = U[:, 1]
rho = np.corrcoef(norm.ppf(u1), norm.ppf(u2))[0,1]
best_copula = 'gaussian'
best_params = [rho]
Having fit the copula, we want to now compare our empirical data with samples drawn from the fitted copula.
The figure below includes 4 panels:
- Q-Q plots used to evaluate the severity and magnitude marginal distribution fits (top)
- Plots of the joint uniform distributions for both observed and copula-sampled data (bottom)
# Start fit
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 10))
# severity Q-Q plot
stats.probplot(df['severity'], dist=best_sev_dist, sparams=best_sev_params, plot=ax1)
ax1.set_title('Severity Q-Q Plot')
# magnitude Q-Q plot
stats.probplot(df['magnitude'], dist=best_mag_dist, sparams=best_mag_params, plot=ax2)
ax2.set_title('Magnitude Q-Q Plot')
## compare empirical data with samples from fitted copula
n_sim = U.shape[0]
# sample from fitted copula
rho = best_params[0]
z1, z2 = np.random.multivariate_normal([0, 0], [[1, rho], [rho, 1]], n_sim).T
u1_sim, u2_sim = stats.norm.cdf(z1), stats.norm.cdf(z2)
ax3.scatter(U[:, 0], U[:, 1], alpha=0.2, s=5, label='Observed', color='blue')
ax4.scatter(u1_sim, u2_sim, alpha=0.2, s=5, label='Simulated', color='red')
ax3.set_title(f'Empirical data')
ax4.set_title(f'Simulated data (fitted {best_copula} copula)')
plt.tight_layout()
plt.savefig("copula_diagnostics.png", dpi=300)
![[Use of Copula’s for modeling joint probability of drought characteristics - copula validation.png]]
Lastly, we can use the copula to calculate the joint exceedance probability of a drought occurrence with a certain severity AND magnitude threshold.
def joint_exceedance_prob(sev_threshold, mag_threshold):
# NOTE: thresholds must be on the same scale as the fitted marginals.
u = best_sev_dist.cdf(sev_threshold, *best_sev_params) # F_D(d)
v = best_mag_dist.cdf(mag_threshold, *best_mag_params) # F_S(s)
z1, z2 = stats.norm.ppf(u), stats.norm.ppf(v)
rho = best_params[0]
C_uv = stats.multivariate_normal.cdf([z1, z2], cov=[[1, rho], [rho, 1]])
return 1.0 - u - v + C_uv
# Individual marginal probabilities
sev_95th = best_sev_dist.ppf(0.95, *best_sev_params)
mag_95th = best_mag_dist.ppf(0.95, *best_mag_params)
print(f"95th percentile severity: {sev_95th:.2f}")
print(f"95th percentile magnitude: {mag_95th:.2f}")
sev_marg = 1 - best_sev_dist.cdf(sev_95th, *best_sev_params)
mag_marg = 1 - best_mag_dist.cdf(mag_95th, *best_mag_params)
print(f"P(Severity > {sev_95th:.2f}) = {sev_marg:.3f}")
print(f"P(Magnitude > {mag_95th:.2f}) = {mag_marg:.3f}")
# Joint exceedance probability
joint = joint_exceedance_prob(sev_95th, mag_95th)
print(f"P(Severity > {sev_95th:.2f} AND Magnitude > {mag_95th:.2f}) = {joint:.4f}")
# Compare to independence assumption
independent = sev_marg * mag_marg
print(f"Under independence assumption: {independent:.4f}")
OUTPUT:
95th percentile severity: 0.88
95th percentile magnitude: 3.67
P(Severity > 0.88) = 0.050
P(Magnitude > 3.67) = 0.050
P(Severity > 0.88 AND Magnitude > 3.67) = 0.0155
Under independence assumption: 0.0025
This analysis helps demonstrate the importance of modeling the dependencies in the drought characteristics.
- Individual risks: Each variable has a 5% chance of exceeding the 95th percentile (by definition)
- Joint risk: The probability of both exceeding their 95th percentile thresholds simultaneously is 1.55%
- Independence assumption: If we assumed the variables were independent, this probability would be only 0.25%
The 6x higher joint probability under the copula model (vs. independence) highlights the importance of modeling dependence structure in risk assessment.
