This notebook serves as supplementary material to the manuscript on low-level jets submitted to Wind Energy Science. The supplementary material consists of 6 notebooks:
In this notebook (5/6), we use the scikit-learn
package to train a statistical model to predict low-level jet occurrence based on ERA5 data.
In the other notebook I have shown that the ERA5 data encompasses valuable information related to low-level jet occurrence. Even though the 1:1 correspondence of low-level jets between the ERA5 data and the observed data is quite poor, we have shown that the seasonal cycle, the lamb weather type, and atmospheric stability all have a pronounced impact on the probability of low-level jets. This raises the question whether we can correlate these variables to the observed jet occurrence, and use the inferred correlation to postprocess the long-term ERA5 data to derive reliable long-term (climatological) low-level jet characteristics. (This procedure is roughly analogous to Model Output Statistics or Measure-Correlate-Predict methods, whichever you prefer; see the first figure below for a visual illustration of the procedure).
We will use the machine-learning package Scikit-learn
to do the statistics for us. Note that even though this introduces a lot of jargon with a 'black-box' connotation, all we do is quite straightforward statistics. In machine learning terms, the aforementioned features (month, lamb weather type, etc.) might be good predictors. Finding a suitable correlation translates to "training a model to predict the most probable outcome", i.e. whether or not a low-level jet will occur given new inputs.
The goal in this notebook is to illustrate that
There are many choices involved in setting up such a model:
The possibilities are endless and experienced data-scientists will probably be able to yield better results. We will focus on simplicity, as a proof-of-principle.
Since we are predicting a binary outcome (a jet will either occur, or not), we find ourselves in the branch of machine learning algorithms known as classification. We use a logistic (aka logit) regression model, i.e. we fit the parameters of a logistic function (aka sigmoid or s-curve) to our training data. The logistic model gives the probability that a low-level jet will occur, given the input parameters.
scikit-learn
implements various algorithms that can train (determine the parameters of) the logistic model. The default implementation (sklearn.linear_model.LogisticRegression
) uses coordinate descend
, but also allows for stochastic average gradient descend
solver. An alternative is to use the generalized stochastic gradient descend algorithm ( sklearn.linear_model.SGDClassifier
) with a logistic loss function. Other models based on support vector machines, Gaussian processes, neural networks or naive Bayes algorimths are availble as well, but will not be considered in this notebook.
Several regularization settings are avaible for fitting the logistic model. Regularization can prevent 'overfitting' and avoid collinearity effects, that occur if several predictors are correlated. Ridge regression or L2 regularization basically decreases the contribution of (some of) the input parameters. Lasso regression or L2 regularization can set some of the coefficients to zero. In other words, it seeks to prefer 'simpler' models. This is a very effective method to deal with collinearity effects. Elasticnet is a combination of Ridge and Lasso. This option is available for the SGD solver only.
Cross validation can be used to determine the best possible configuration of the model, in particular the strength of the regularization. To this end, the training data is split into several 'folds'. First, the model is trained on the first fold of the data, and the performance is evaluated based on the second fold. Then, the regularization strength is adjusted and the model is trained again, but now the next fold is used to evaluate the model performance, and so on. scikit-learn
offers an additional implementation of the logistic regression model with built-in cross-validation. However, the built-in evaluation statistics are not suitable for our application, e.g. the accuracy increases while the model's ability to predict the seasonal cycle clearly decreases.
The logistic model can be used to predict individual instances: if the probability of a low-level jet (given the input) is larger than 50%, it will predict a jet, otherwise it will not. Let's use an example to illustrate what this implies: if only the months of the year are used as predictors, the highest probability occurs e.g. if month = June
. However, the typical low-level jet rate in June in the training data is only about 15%. Given a new instance of the month is June, the model will thus predict a 'no jet' profile, because that is by far the most probable outcome. Now, if we predict 100 instances of the month June, we will get 100 'no jet' outcomes. This does not represent the true expectation. Averaging the probability predictions for those 100 instances, on the other hand, gives us an average probability of 15%. This example illustrates that the logistic model is very useful to derive long-term low-level jet statistics, but if we want to use it to predict individual jet instances, we need to pin down the probability very precisely. That is beyond the scope of this notebook.
Some of the above mentioned variables are categorical, for example the month and the lamb weather type. There are various ways to treat these classes. One way is to create dummy variables. That is: convert one variable with 27 values to 26 variables with only 2 values (1 or 0). However, this quickly increases the number of dimensions. Another method is to convert 27 to a binary representation of the same number (the binary representation of 27 is 11011). By splitting the digits of the binary representation, we can decompose the category into a combination of 5 binary features. A binary encoder is available as a side-package to scikit-learn: category_encoders.
Alternatively, we can use the fact that the month (or week) of the year is a cyclic variable. It can thus be represented in terms of a sine and cosine function. Similarly, the lamb weather type is built up out of a zonal and a meridional component, in conjunction with a vorticity. These raw components supposedly encompass the same (or more) information than their weather type abstractions.
In order to use L1-regularization, the input features need to be standardized, otherwise the weights are meaningsless. If we also center the data (i.e. standardization), we can set the fit_intercept
option to false.
Some input variables might be correlated, reduntant on just uninformative. It makes sense to train the model with good 'predictor' variables. These can be selected on the basis of physical insight (see above) or statistical preprocessing steps like variance threshold or anova. However, using the L1 (lasso) regularization is able to automatically set many 'irrelevant' input parameters to zero, so that gives us some freedom in experimenting with many (perhaps correlated) input variables.
Given that we have - in principle - a complete 4D grid of weather data availble as potential input parameters, it makes sense to try and use some of this information in our model. We might, for example, use the entire wind profile in the lower 500 m of the ERA5 data. This, however, introduces 28 new dimensions (14 levels times two wind components), which, moreover, are highly correlated. Therefore, we will use principal component analysis to capture most of the variability of these wind profiles in only a few input parameters.
This Q&A on Cross Validated touches upon the possibilies to not just predict jet occurrence, but to predict e.g. the jet height and jet strength as well. The way to go would be a two-stage procedure that first determines whether or not a jet will occur (using e.g. a logistic regression model) and, in case of a positive outcome, predict the jet characteristics (using e.g. a linear regression trained on previously observed jet instances only). Although this sounds plausible in theory, it is beyond the scope of this notebook. One of the difficulties is that in this notebook, we will limit ourselves to predicting the low-level jet probability only.
I will first test the methods based on the IJmuiden site data. I will split it in test and train data to compare methods. Later on, I will try wether the model trained on IJmuiden also performs well for other sites. I might do a small sensitivity test to see the impact of the length of the trainingset. Finally, I will use the 'best' method to correct the observed climatology at each platform and derive 'reliable, long term low-level jet climatologies'.
sklearn comes with built in scoring parameters, for logistic regression the default scoring is accuracy. However, this operates on the discrete binary predictions - which are ususally negative. Since we are interested in using the probability information rather than the discrete predictions, we need to use a scoring function that evaluates the output of predict_proba. One such score is the log-loss.
import xarray as xr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn import preprocessing
from sklearn.metrics import log_loss, make_scorer
import category_encoders as ce
import sys
# Low-level jet detection
sys.path.append('../../code/utils')
import lljs
plt.xkcd()
The figure below illustrates the MOS/MCP procedure for a continuous variable. The red box represents the 'machine-learning' model. In the first phase, it is trained to identify a suitable correlation between ERA5 and the observations. In the second phase, it is used to predict observed values based on ERA5 and the identified correlation. The method is analogous for a binary variable, but in this case the training model is a logistic regression model and the predicted values are probabilities of the outcome, rather than the discrete output. The ERA5 data may consist of various informative variables.
def correlated_signal(n_samples, corr, mu=0, sigma=1):
''' Generate a synthetic autocorrelated time series
Borrowed from https://stackoverflow.com/a/33904277/6012085 '''
assert 0 < corr < 1, "Auto-correlation must be between 0 and 1"
c = mu * (1 - corr)
sigma_e = np.sqrt((sigma ** 2) * (1 - corr ** 2))
# Sample the auto-regressive process.
signal = [c + np.random.normal(0, sigma_e)]
for _ in range(1, n_samples):
signal.append(c + corr * signal[-1] + np.random.normal(0, sigma_e))
return np.array(signal)
fig, axs = plt.subplots(2, 1, figsize=(10,4), sharex=True, sharey=True)
x_model = np.arange(1000)
y_model = correlated_signal(1000, 0.75) # np.random.randn(1000)
x_obs = np.arange(200)
y_obs = y_model[:200] + np.random.randn(200)*0.5
x_inf = np.arange(200, 1000)
y_inf = y_model[200:] + np.random.randn(800)*0.5
# Plot the time series
axs[0].plot(x_model, y_model, color='k')
axs[1].plot(x_obs, y_obs, color='k')
axs[1].plot(x_inf, y_inf, color='lightgrey', ls='-')
# Add a fancy box
import matplotlib.patches as mpatches
fancybox = mpatches.FancyBboxPatch([-5, -4], 210, 15, clip_on=False, zorder=5,
boxstyle=mpatches.BoxStyle("Round", pad=0.02), edgecolor='r', facecolor='none', lw=10)
axs[1].add_patch(fancybox)
# Add an arrow
axs[1].annotate('', (500, 4), (180, 4), ha="center", va="center", annotation_clip=False, fontsize=18, zorder=6,
arrowprops=dict(width=7, headlength=50,headwidth=20, fc="r", ec="r"))
# Aestethics
axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_xlabel(r'Timeseries')
axs[1].set_ylabel('Observations')
axs[0].set_ylabel('ERA5')
axs[1].set_ylim(-3, 3)
# Load ERA5 wind and thermodynamic profiles
era_wind = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ.nc').squeeze().sel(level=slice(None, 124, -1))
era_z_nd = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_z.nc').squeeze()
era_qt = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_qt.nc').squeeze()
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc').sel(longitude=3.3, latitude=52.8, method='nearest')
era_mmij = xr.merge([era_wind, era_z_nd, era_sfc]).drop(['latitude', 'longitude'])
# Detect LLJS in the ERA5 wind profiles (these will be our target arrays as a test dataset)
ds = era_wind
wspd = (era_wind.u**2+era_wind.v**2)**.5
falloff, strength, index = lljs.detect_llj_vectorized(wspd.values, axis=1, output='all', mask_inv=True, inverse=False)
jetflag = xr.DataArray(falloff>2, coords={'time': ds.time}, dims=['time']).to_dataframe(name='jet').astype(int)
# Load Lamb Weather type data
lwtfile = '/home/peter919/NoordzeeData/ERA5/synoptic_classification/data/processed/labels_LWT_Jones.csv'
lwt = pd.read_csv(lwtfile, index_col=0, parse_dates=True)#[['W', 'S', 'Z']] #['LWT']
# start with a simple dataset (only from the ERA5 surface variables)
predictors = era_mmij[['u10', 'v10','u100','v100', 'msl','sst', 't2m']].to_dataframe()
predictors[['LWT_U', 'LWT_V', 'LWT_Z']] = lwt[['W', 'S', 'Z']]
# # Add month info to predictors
# Option 1: convert months to sin/cos component
# month = predictors.index.month.values
# predictors['M_sin'] = np.sin(month*np.pi/6)
# predictors['M_cos'] = np.cos(month*np.pi/6)
# Option 2: More precise: DayOfYear
# doy = predictors.index.dayofyear.values
# predictors['DOY_sin'] = np.sin(doy*np.pi/365)
# predictors['DOY_cos'] = np.cos(doy*np.pi/365)
# Option 3: Encode it (see below)
predictors['month'] = predictors.index.month
# Add Richardson number
rib = lambda df: 9.81/df.t2m*(df.t2m-df.sst)/2/(df.u10**2+df.v10**2)*10*10
predictors['RiB'] = rib(predictors)
predictors['stab'] = (predictors.RiB>0).astype(int)
# Add lamb weather type category
predictors['LWT'] = lwt.LWT
# Encode the categorical variables, either as binary or as dummy
import category_encoders as ce
# enc = ce.BinaryEncoder(cols=['month', 'LWT'])
enc = ce.OneHotEncoder(cols=['month', 'LWT'])
predictors = enc.fit_transform(predictors)
predictors.head()
Here we do a logistic regression. We split the MMIJ data in a train and test set. We use the ERA5 low-level jets at MMIJ as a target variable (later on, we will use observed low-level jets instead). We compute the seasonal cycle of the test data and the predicted probabilities. The RMSE between these two seasonal cycles is used to quantify the model's performance. We also print the contributions of each variable as determined by the regularization, so that we can see which variables are the most meaningful, according to the algorithm
def do_logistic_regression(X, y, algorithm=None, plot=True, print_coeff=True):
# In the first test phase I'll split the data in half
half = len(y)//2
y_train, y_test = y[:half], y[half:]
X_train, X_test = X[:half], X[half:]
# Scale the data
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Set-up and train the logistic regression model
if algorithm==None:
lrm = LogisticRegression(penalty='l1', C=0.01, solver='saga', multi_class='ovr', fit_intercept=True)
else:
lrm = algorithm
lrm.fit(X_train, y_train.values.ravel())
# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=y_test.index)
# Print scores
# print('score:', lrm.score(X_test, y_test))
# print(log_loss(y_test, y_pred))
def monthly_RMSE(y_true, y_pred):
""" Compute RMSE on monthly mean LLJ rate
Default score is accuracy, but this evaluates the discrete
predictions rather than the probability estimates. Log-loss
does evaluate proba, but benefits from high bias/low variance.
"""
monthlymean = lambda df: df.groupby(df.index.month).mean().values*100
return ((monthlymean(y_pred) - monthlymean(y_true))**2).mean()**.5
print(monthly_RMSE(y_test, y_pred))
# Visualization of predicted seasonal cycle
if plot==True:
monthgrouper = lambda df: df.month
fig, ax = plt.subplots()
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax)
(y_test.groupby(monthgrouper).mean()*100).plot(ax=ax)
ax.legend(['predicted', 'testdata'])
plt.show()
if print_coeff==True:
for label, coef in zip(X.columns.values, lrm.coef_[0]):
print('predictor: {:s}, coeff: {:.2f}'.format(label, coef))
return #lrm.coef_
y = jetflag
X = predictors
do_logistic_regression(X, y)
We see that the model makes a reasonable estimate of the seasonal cycle, but the RMSE is still about 2 m/s. We also find that most of the input parameters are set to 0, i.e. they are not valuable predictors according to the model.
We try a number of values for the strength of the regularization. We also compare the coordinate descend and gradient descend solvers.
y = jetflag
X = predictors
# X = predictors[['LWT_U', 'LWT_V', 'LWT_Z', 'sst']]
# X = predictors[['LWT_U', 'LWT_V', 'LWT_Z', 'month_0', 'month_1', 'month_2', 'month_3', 'month_4']]
# This illustrates bias/variance trade-off:
# At very low c (strong regularization) the model prefers to
# predict a constant LLJ rate which leads to a very constant
# but not very specific low-level jet prediction. Interestingly,
# this also receives the highest score in terms of log-loss.
# This indicates that log-loss is not a suitable score for our purpose.
# The differences between saga and liblinear are small or non-existent.
for c in [0.001, 1, 1000, 100000]:
for solver in ['saga', 'liblinear']:
print(solver, c)
lrm = LogisticRegression(penalty='l1', C=c, solver=solver, multi_class='ovr', fit_intercept=True)
do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
for small values of the regularization parameter we find that the model tends to predict a constant rate of low-level jets throughout the year. This points to a too strong regularization. There is little difference between the two algorithms.
I played around with the alpha parameter.
# for alpha in ...
for penalty in ['l1', 'l2', 'elasticnet']:
clf = SGDClassifier(loss='log', penalty=penalty, alpha=0.000003, l1_ratio=0.5, max_iter=1000)
do_logistic_regression(X, y, algorithm=clf, plot=True, print_coeff=False)
# Load observed low-level jets
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')
jetflag = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y = jetflag
X = predictors.loc[jetflag.index]
X.head()
lrm = LogisticRegression(penalty='l1', C=0.05, solver='saga', multi_class='ovr', fit_intercept=True)
do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
# Start anew with selection of predictor variables
# stability as difference between sst en t2m gives better performance than only stable/unstable, and also better than Ri_b
predictors = era_mmij[['sst', 't2m']].to_dataframe()
predictors['stab'] = (predictors.t2m-predictors.sst)#.astype(int)
predictors.drop(['sst', 't2m'], inplace=True, axis=1)
# Lamb Weather Type as binary encoded gives much better performance than continuous LWT components.
predictors['LWT'] = lwt.LWT
# DOY encoded in sine/cosine contribution gives better results than categorical 'month'
doy = predictors.index.dayofyear.values
predictors['DOY_sin'] = np.sin(doy*np.pi/365)
predictors['DOY_cos'] = np.cos(doy*np.pi/365)
# Adding time doesn't improve the performance
# Binary encoding on LWT gives better results than OHE
enc = ce.BinaryEncoder(cols=['LWT'])
predictors = enc.fit_transform(predictors)
# Show the selection
predictors.head()
y = jetflag
X = predictors.loc[jetflag.index]
lrm = LogisticRegression(penalty='l2', C=1000000, solver='saga', multi_class='ovr', fit_intercept=True)
do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
# Now that we have a sleaker selection of variables, let's try different regularization methods
for penalty in ['l1', 'l2', 'elasticnet']:
clf = SGDClassifier(loss='log', penalty=penalty, alpha=0.000003, l1_ratio=0.5, max_iter=1000)
do_logistic_regression(X, y, algorithm=clf, plot=True, print_coeff=False)
Let's quickly try if it is possible to train the model at 1 platform and then predict for another.
predictors.head()
# Load observed low-level jets at MMIJ as train data
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')
jetflag_mmij = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y_mmij = jetflag_mmij
X_mmij = predictors.loc[jetflag_mmij.index]
# Load observed low-level jets at LEG as test data
lljs_leg = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='LEG', version='1hmean').drop(['platform', 'version'])
lljs_leg.dropna(dim='time')
jetflag_leg = lljs_leg.falloff.rename('jetflag').to_dataframe().dropna()>2
y_leg = jetflag_leg
# Determine the predictors for LEG
era_wind = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ.nc').squeeze().sel(level=slice(None, 124, -1))
era_z_nd = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_z.nc').squeeze()
era_qt = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_qt.nc').squeeze()
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc').sel(longitude=3.3, latitude=52.8, method='nearest')
era_leg = xr.merge([era_wind, era_z_nd, era_sfc]).drop(['latitude', 'longitude'])
predictors_leg = era_leg[['sst', 't2m']].to_dataframe()
predictors_leg['stab'] = (predictors_leg.t2m-predictors_leg.sst)
predictors_leg.drop(['sst', 't2m'], inplace=True, axis=1)
predictors_leg['LWT'] = lwt.LWT
doy = predictors_leg.index.dayofyear.values
predictors_leg['DOY_sin'] = np.sin(doy*np.pi/365)
predictors_leg['DOY_cos'] = np.cos(doy*np.pi/365)
enc = ce.BinaryEncoder(cols=['LWT'])
predictors_leg = enc.fit_transform(predictors_leg)
X_leg = predictors_leg.loc[jetflag_leg.index]
# Scale the data
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_mmij)
X_test = scaler.transform(X_leg)
# Rename platform sets to train/test data
y_train = y_mmij
y_test = y_leg
# Set up and train the model
lrm = LogisticRegression(penalty='l1', C=1000000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())
# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=y_test.index)
def monthly_RMSE(y_true, y_pred):
""" Compute RMSE on monthly mean LLJ rate
Default score is accuracy, but this evaluates the discrete
predictions rather than the probability estimates. Log-loss
does evaluate proba, but benefits from high bias/low variance.
"""
monthlymean = lambda df: df.groupby(df.index.month).mean().values*100
return ((monthlymean(y_pred) - monthlymean(y_true))**2).mean()**.5
print(monthly_RMSE(y_test, y_pred))
# Visualization of predicted seasonal cycle
monthgrouper = lambda df: df.month
fig, ax = plt.subplots()
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax)
(y_test.groupby(monthgrouper).mean()*100).plot(ax=ax)
ax.legend(['predicted', 'testdata'])
plt.show()
This doesn't work well. Let's forget about it.
# Load ERA surface data for MMIJ
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc')
era_mmij = era_sfc.sel(longitude=3.3, latitude=52.8, method='nearest').drop(['latitude', 'longitude'])
# stability as difference between sst en t2m gives better performance than only stable/unstable, and also better than Ri_b
predictors = era_mmij[['sst', 't2m']].to_dataframe()
predictors['stab'] = (predictors.t2m-predictors.sst)#.astype(int)
predictors.drop(['sst', 't2m'], inplace=True, axis=1)
# Lamb Weather Type as binary encoded gives much better performance than continuous LWT components.
predictors['LWT'] = lwt.LWT
# DOY encoded in sine/cosine contribution gives better results than categorical 'month'
doy = predictors.index.dayofyear.values
predictors['DOY_sin'] = np.sin(doy*np.pi/365)
predictors['DOY_cos'] = np.cos(doy*np.pi/365)
# Adding time doesn't improve the performance
# Binary encoding on LWT gives better results than OHE
enc = ce.BinaryEncoder(cols=['LWT'])
predictors = enc.fit_transform(predictors)
# Scale the continuous predictor variables
for var in ['stab', 'DOY_sin', 'DOY_cos']:
scaler = preprocessing.StandardScaler()
predictors[var] = scaler.fit_transform(predictors[var].values.reshape(-1, 1))
# Show the selection
predictors.head()
# Initialize a figure
fig = plt.figure(figsize=(10, 3), dpi=200)
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(223)
ax3 = fig.add_subplot(122)
########################################################
# First plot conceptual illustration
def correlated_signal(n_samples, corr, mu=0, sigma=1):
''' Generate a synthetic autocorrelated time series
Borrowed from https://stackoverflow.com/a/33904277/6012085 '''
assert 0 < corr < 1, "Auto-correlation must be between 0 and 1"
c = mu * (1 - corr)
sigma_e = np.sqrt((sigma ** 2) * (1 - corr ** 2))
# Sample the auto-regressive process.
signal = [c + np.random.normal(0, sigma_e)]
for _ in range(1, n_samples):
signal.append(c + corr * signal[-1] + np.random.normal(0, sigma_e))
return np.array(signal)
x_model = np.arange(1000)
y_model = correlated_signal(1000, 0.75) # np.random.randn(1000)
x_obs = np.arange(200)
y_obs = y_model[:200] + np.random.randn(200)*0.5
x_inf = np.arange(200, 1000)
y_inf = y_model[200:] + np.random.randn(800)*0.5
# Plot the time series
ax1.plot(x_model, y_model, color='k')
ax2.plot(x_obs, y_obs, color='k')
ax2.plot(x_inf, y_inf, color='lightgrey', ls='-')
# Add a fancy box
import matplotlib.patches as mpatches
fancybox = mpatches.FancyBboxPatch([-5, -4], 210, 15, clip_on=False, zorder=5,
boxstyle=mpatches.BoxStyle("Round", pad=0.02), edgecolor='r', facecolor='none', lw=6)
ax2.add_patch(fancybox)
# Add an arrow
ax2.annotate('', (500, 4), (180, 4), ha="center", va="center", annotation_clip=False, fontsize=18, zorder=6,
arrowprops=dict(width=7, headlength=50,headwidth=20, fc="r", ec="r"))
# Aestethics
ax1.set_xticks([])
ax1.set_yticks([])
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_xlabel(r'Timeseries')
ax2.set_ylabel('Observations')
ax1.set_ylabel('ERA5')
ax2.set_ylim(-3, 3)
#########################################################
# Now, we add results from the ML procedure
# Load observed low-level jets at MMIJ as train data, and align the model data with the observation period
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')
jetflag_mmij = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y_mmij = jetflag_mmij
X_mmij = predictors.loc[jetflag_mmij.index]
##########################################################
# Step 1: split MMIJ data in two, and train/test the model
n = len(X_mmij)//2
X_train, X_test = X_mmij[:n], X_mmij[n:]
y_train, y_test = y_mmij[:n], y_mmij[n:]
# Set up and train the model
lrm = clf#LogisticRegression(penalty='l1', C=1000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())
# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=y_test.index)
# Verify the score of the training data
print(monthly_RMSE(y_test, y_pred))
# Visualize the result so far
(y_train.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)
(y_test.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)
##########################################################
# Step 2: train on the full mmij data, predict the future
X_train, X_test = X_mmij, predictors
y_train = y_mmij
# Set up and train the model
lrm = LogisticRegression(penalty='l1', C=1000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())
# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=X_test.index)
# Visualize the result
(y_train.groupby(monthgrouper).mean()*100).plot(ax=ax3)
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax3)
ax3.legend(['2-y train', '2-y pred', '2-y test', '4-y train', '10-y pred'])
ax3.set_ylabel('LLJ rate (%)')
ax3.set_xticks(np.arange(1,13))
ax3.set_xticklabels('jfmamjjasond')
##### Decoration for paper
ax1.set_title('A. Conceptual illustration', loc='right')
ax3.set_title('B. Predicted seasonal cycle')
fig.savefig('../../reports/figures/ML_result', bbox_inches='tight')
plt.show()
Running this cell multiple times gives slightly varying results for the RMSE, but it is always around 1 m/s (lowest that I saw was 0.7, highest about 1.3). This is due to a stochastic component in the fitting algorithm. The precicted long-term seasonal cycle is always very similar, though.