Supplementary material 3/6

This notebook serves as supplementary material to the manuscript on low-level jets submitted to Wind Energy Science. The supplementary material consists of 6 notebooks:

  1. A very quick overview of the measurement data from the LiDARs.
  2. Overview of the ERA5 data and especially the procedure of aligning it with the observations.
  3. Some error statistics on wind speed from ERA5 and observations
  4. Main results on low-level jets
  5. Reconstruction of seasonal cycle using a machine learning package
  6. Spatial figures from ERA5 (data-intensive).

In this notebook (5/6), we use the scikit-learn package to train a statistical model to predict low-level jet occurrence based on ERA5 data.

Can we statistically infer the long-term LLJ characteristics based on ML?

Intro

In the other notebook I have shown that the ERA5 data encompasses valuable information related to low-level jet occurrence. Even though the 1:1 correspondence of low-level jets between the ERA5 data and the observed data is quite poor, we have shown that the seasonal cycle, the lamb weather type, and atmospheric stability all have a pronounced impact on the probability of low-level jets. This raises the question whether we can correlate these variables to the observed jet occurrence, and use the inferred correlation to postprocess the long-term ERA5 data to derive reliable long-term (climatological) low-level jet characteristics. (This procedure is roughly analogous to Model Output Statistics or Measure-Correlate-Predict methods, whichever you prefer; see the first figure below for a visual illustration of the procedure).

We will use the machine-learning package Scikit-learn to do the statistics for us. Note that even though this introduces a lot of jargon with a 'black-box' connotation, all we do is quite straightforward statistics. In machine learning terms, the aforementioned features (month, lamb weather type, etc.) might be good predictors. Finding a suitable correlation translates to "training a model to predict the most probable outcome", i.e. whether or not a low-level jet will occur given new inputs.

The goal in this notebook is to illustrate that

  • it is possible to train a model to predict the probability of low-level jets with reasonable accuracy, and
  • that these probability estimates can be used to obtain corrected estimates of the long-term low-level jet climatology.

There are many choices involved in setting up such a model:

  • Which variables (or: features) do we use as predictors and how do we represent and preprocess them?
  • what type of model do we use and which algorithm do we use to fit it to the training data (minimize the cost function)?
  • Which type of regularization do we use to prevent overfitting or reduce multicollinearity effects, and what is the best value for the regularization parameter(s)?
  • Which scoring parameter or visualization do we use to evaluate the results?

The possibilities are endless and experienced data-scientists will probably be able to yield better results. We will focus on simplicity, as a proof-of-principle.

Selecting a suitable model

Since we are predicting a binary outcome (a jet will either occur, or not), we find ourselves in the branch of machine learning algorithms known as classification. We use a logistic (aka logit) regression model, i.e. we fit the parameters of a logistic function (aka sigmoid or s-curve) to our training data. The logistic model gives the probability that a low-level jet will occur, given the input parameters.

scikit-learn implements various algorithms that can train (determine the parameters of) the logistic model. The default implementation (sklearn.linear_model.LogisticRegression) uses coordinate descend, but also allows for stochastic average gradient descend solver. An alternative is to use the generalized stochastic gradient descend algorithm ( sklearn.linear_model.SGDClassifier) with a logistic loss function. Other models based on support vector machines, Gaussian processes, neural networks or naive Bayes algorimths are availble as well, but will not be considered in this notebook.

Several regularization settings are avaible for fitting the logistic model. Regularization can prevent 'overfitting' and avoid collinearity effects, that occur if several predictors are correlated. Ridge regression or L2 regularization basically decreases the contribution of (some of) the input parameters. Lasso regression or L2 regularization can set some of the coefficients to zero. In other words, it seeks to prefer 'simpler' models. This is a very effective method to deal with collinearity effects. Elasticnet is a combination of Ridge and Lasso. This option is available for the SGD solver only.

Cross validation can be used to determine the best possible configuration of the model, in particular the strength of the regularization. To this end, the training data is split into several 'folds'. First, the model is trained on the first fold of the data, and the performance is evaluated based on the second fold. Then, the regularization strength is adjusted and the model is trained again, but now the next fold is used to evaluate the model performance, and so on. scikit-learn offers an additional implementation of the logistic regression model with built-in cross-validation. However, the built-in evaluation statistics are not suitable for our application, e.g. the accuracy increases while the model's ability to predict the seasonal cycle clearly decreases.

Note: predicting probabilities rather than actual outcome

The logistic model can be used to predict individual instances: if the probability of a low-level jet (given the input) is larger than 50%, it will predict a jet, otherwise it will not. Let's use an example to illustrate what this implies: if only the months of the year are used as predictors, the highest probability occurs e.g. if month = June. However, the typical low-level jet rate in June in the training data is only about 15%. Given a new instance of the month is June, the model will thus predict a 'no jet' profile, because that is by far the most probable outcome. Now, if we predict 100 instances of the month June, we will get 100 'no jet' outcomes. This does not represent the true expectation. Averaging the probability predictions for those 100 instances, on the other hand, gives us an average probability of 15%. This example illustrates that the logistic model is very useful to derive long-term low-level jet statistics, but if we want to use it to predict individual jet instances, we need to pin down the probability very precisely. That is beyond the scope of this notebook.

Determining the input parameters/predictors

Relevant parameters / promising predictors / (potentially interesting...)

  • month (or week?) of year
  • hour of day
  • lamb weather type
  • modelled wind profile
  • modelled thermodynamic profile
  • historic wind/thermodynamic profiles
  • surrounding wind/thermodynamic structure
  • historic surrounding wind/thermodynamic structure
  • modelled stability
  • modelled surface fluxes
  • modelled (sea) surface temperature
  • modelled baroclinicty
  • modelled sea level pressure (field)
  • tendency of modelled sea level pressure (field)
  • modelled low-level jet flag/falloff/strenght/height

Using categorical variables

Some of the above mentioned variables are categorical, for example the month and the lamb weather type. There are various ways to treat these classes. One way is to create dummy variables. That is: convert one variable with 27 values to 26 variables with only 2 values (1 or 0). However, this quickly increases the number of dimensions. Another method is to convert 27 to a binary representation of the same number (the binary representation of 27 is 11011). By splitting the digits of the binary representation, we can decompose the category into a combination of 5 binary features. A binary encoder is available as a side-package to scikit-learn: category_encoders.

Alternatively, we can use the fact that the month (or week) of the year is a cyclic variable. It can thus be represented in terms of a sine and cosine function. Similarly, the lamb weather type is built up out of a zonal and a meridional component, in conjunction with a vorticity. These raw components supposedly encompass the same (or more) information than their weather type abstractions.

Scaling

In order to use L1-regularization, the input features need to be standardized, otherwise the weights are meaningsless. If we also center the data (i.e. standardization), we can set the fit_intercept option to false.

Feature selection

Some input variables might be correlated, reduntant on just uninformative. It makes sense to train the model with good 'predictor' variables. These can be selected on the basis of physical insight (see above) or statistical preprocessing steps like variance threshold or anova. However, using the L1 (lasso) regularization is able to automatically set many 'irrelevant' input parameters to zero, so that gives us some freedom in experimenting with many (perhaps correlated) input variables.

Reducing dimensionality

Given that we have - in principle - a complete 4D grid of weather data availble as potential input parameters, it makes sense to try and use some of this information in our model. We might, for example, use the entire wind profile in the lower 500 m of the ERA5 data. This, however, introduces 28 new dimensions (14 levels times two wind components), which, moreover, are highly correlated. Therefore, we will use principal component analysis to capture most of the variability of these wind profiles in only a few input parameters.

Note: predicting jet characteristics

This Q&A on Cross Validated touches upon the possibilies to not just predict jet occurrence, but to predict e.g. the jet height and jet strength as well. The way to go would be a two-stage procedure that first determines whether or not a jet will occur (using e.g. a logistic regression model) and, in case of a positive outcome, predict the jet characteristics (using e.g. a linear regression trained on previously observed jet instances only). Although this sounds plausible in theory, it is beyond the scope of this notebook. One of the difficulties is that in this notebook, we will limit ourselves to predicting the low-level jet probability only.

Approach

I will first test the methods based on the IJmuiden site data. I will split it in test and train data to compare methods. Later on, I will try wether the model trained on IJmuiden also performs well for other sites. I might do a small sensitivity test to see the impact of the length of the trainingset. Finally, I will use the 'best' method to correct the observed climatology at each platform and derive 'reliable, long term low-level jet climatologies'.

Note on loss or scoring function

sklearn comes with built in scoring parameters, for logistic regression the default scoring is accuracy. However, this operates on the discrete binary predictions - which are ususally negative. Since we are interested in using the probability information rather than the discrete predictions, we need to use a scoring function that evaluates the output of predict_proba. One such score is the log-loss.

References

Loading stuff

In [2]:
import xarray as xr
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn import preprocessing
from sklearn.metrics import log_loss, make_scorer

import category_encoders as ce

import sys
# Low-level jet detection
sys.path.append('../../code/utils')
import lljs

plt.xkcd()
/home/peter919/miniconda3/envs/thesis/lib/python3.7/site-packages/sklearn/utils/__init__.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sequence
Out[2]:
<matplotlib.rc_context at 0x2b1eeb842b70>

Illustrate the goal

The figure below illustrates the MOS/MCP procedure for a continuous variable. The red box represents the 'machine-learning' model. In the first phase, it is trained to identify a suitable correlation between ERA5 and the observations. In the second phase, it is used to predict observed values based on ERA5 and the identified correlation. The method is analogous for a binary variable, but in this case the training model is a logistic regression model and the predicted values are probabilities of the outcome, rather than the discrete output. The ERA5 data may consist of various informative variables.

In [3]:
def correlated_signal(n_samples, corr, mu=0, sigma=1):
    ''' Generate a synthetic autocorrelated time series
        Borrowed from https://stackoverflow.com/a/33904277/6012085 '''
    assert 0 < corr < 1, "Auto-correlation must be between 0 and 1"

    c = mu * (1 - corr)
    sigma_e = np.sqrt((sigma ** 2) * (1 - corr ** 2))

    # Sample the auto-regressive process.
    signal = [c + np.random.normal(0, sigma_e)]
    for _ in range(1, n_samples):
        signal.append(c + corr * signal[-1] + np.random.normal(0, sigma_e))

    return np.array(signal)

fig, axs = plt.subplots(2, 1, figsize=(10,4), sharex=True, sharey=True)
x_model = np.arange(1000)
y_model = correlated_signal(1000, 0.75) # np.random.randn(1000)

x_obs = np.arange(200)
y_obs = y_model[:200] + np.random.randn(200)*0.5

x_inf = np.arange(200, 1000)
y_inf = y_model[200:] + np.random.randn(800)*0.5

# Plot the time series
axs[0].plot(x_model, y_model, color='k')
axs[1].plot(x_obs, y_obs, color='k')
axs[1].plot(x_inf, y_inf, color='lightgrey', ls='-')

# Add a fancy box
import matplotlib.patches as mpatches
fancybox = mpatches.FancyBboxPatch([-5, -4], 210, 15, clip_on=False, zorder=5,
        boxstyle=mpatches.BoxStyle("Round", pad=0.02), edgecolor='r', facecolor='none', lw=10)
axs[1].add_patch(fancybox)

# Add an arrow
axs[1].annotate('', (500, 4), (180, 4), ha="center", va="center", annotation_clip=False, fontsize=18, zorder=6,
            arrowprops=dict(width=7, headlength=50,headwidth=20, fc="r", ec="r"))

# Aestethics
axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_xlabel(r'Timeseries')
axs[1].set_ylabel('Observations')
axs[0].set_ylabel('ERA5')
axs[1].set_ylim(-3, 3)
Out[3]:
(-3, 3)

Load sample data

In [4]:
# Load ERA5 wind and thermodynamic profiles
era_wind = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ.nc').squeeze().sel(level=slice(None, 124, -1))
era_z_nd = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_z.nc').squeeze()
era_qt = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_qt.nc').squeeze()
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc').sel(longitude=3.3, latitude=52.8, method='nearest')
era_mmij = xr.merge([era_wind, era_z_nd, era_sfc]).drop(['latitude', 'longitude'])

# Detect LLJS in the ERA5 wind profiles (these will be our target arrays as a test dataset)
ds = era_wind
wspd = (era_wind.u**2+era_wind.v**2)**.5
falloff, strength, index = lljs.detect_llj_vectorized(wspd.values, axis=1, output='all', mask_inv=True, inverse=False)
jetflag = xr.DataArray(falloff>2, coords={'time': ds.time}, dims=['time']).to_dataframe(name='jet').astype(int)

# Load Lamb Weather type data
lwtfile = '/home/peter919/NoordzeeData/ERA5/synoptic_classification/data/processed/labels_LWT_Jones.csv'
lwt = pd.read_csv(lwtfile, index_col=0, parse_dates=True)#[['W', 'S', 'Z']] #['LWT']
In [5]:
# start with a simple dataset (only from the ERA5 surface variables)
predictors = era_mmij[['u10', 'v10','u100','v100', 'msl','sst', 't2m']].to_dataframe()
predictors[['LWT_U', 'LWT_V', 'LWT_Z']] = lwt[['W', 'S', 'Z']]

# # Add month info to predictors

# Option 1: convert months to sin/cos component
# month = predictors.index.month.values
# predictors['M_sin'] = np.sin(month*np.pi/6)
# predictors['M_cos'] = np.cos(month*np.pi/6)

# Option 2: More precise: DayOfYear
# doy = predictors.index.dayofyear.values
# predictors['DOY_sin'] = np.sin(doy*np.pi/365)
# predictors['DOY_cos'] = np.cos(doy*np.pi/365)

# Option 3: Encode it (see below)
predictors['month'] = predictors.index.month

# Add Richardson number
rib = lambda df: 9.81/df.t2m*(df.t2m-df.sst)/2/(df.u10**2+df.v10**2)*10*10
predictors['RiB'] = rib(predictors)
predictors['stab'] = (predictors.RiB>0).astype(int)

# Add lamb weather type category
predictors['LWT'] = lwt.LWT

# Encode the categorical variables, either as binary or as dummy
import category_encoders as ce
# enc = ce.BinaryEncoder(cols=['month', 'LWT'])
enc = ce.OneHotEncoder(cols=['month', 'LWT'])
predictors = enc.fit_transform(predictors)

predictors.head()
Out[5]:
month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 ... u100 v100 msl sst t2m LWT_U LWT_V LWT_Z RiB stab
time
2008-01-01 00:00:00 1 0 0 0 0 0 0 0 0 0 ... -0.675014 1.303021 102765.648438 281.097443 280.378479 0.859834 6.478652 -24.524984 -0.638623 0
2008-01-01 01:00:00 1 0 0 0 0 0 0 0 0 0 ... -1.120747 1.654987 102730.960938 281.097443 280.299591 0.557215 7.263127 -25.076325 -0.386857 0
2008-01-01 02:00:00 1 0 0 0 0 0 0 0 0 0 ... -1.727834 1.935802 102696.703125 281.097443 280.377960 0.218007 7.898596 -25.755912 -0.207993 0
2008-01-01 03:00:00 1 0 0 0 0 0 0 0 0 0 ... -2.171783 2.053440 102655.742188 281.097443 280.409821 0.135683 8.184087 -25.444475 -0.150655 0
2008-01-01 04:00:00 1 0 0 0 0 0 0 0 0 0 ... -2.698639 2.117951 102639.562500 281.097443 280.455322 -0.299570 8.814547 -25.162358 -0.106043 0

5 rows × 53 columns

Initial attempts

Here we do a logistic regression. We split the MMIJ data in a train and test set. We use the ERA5 low-level jets at MMIJ as a target variable (later on, we will use observed low-level jets instead). We compute the seasonal cycle of the test data and the predicted probabilities. The RMSE between these two seasonal cycles is used to quantify the model's performance. We also print the contributions of each variable as determined by the regularization, so that we can see which variables are the most meaningful, according to the algorithm

In [6]:
def do_logistic_regression(X, y, algorithm=None, plot=True, print_coeff=True):
    
    # In the first test phase I'll split the data in half
    half = len(y)//2
    y_train, y_test = y[:half], y[half:]
    X_train, X_test = X[:half], X[half:]

    # Scale the data
    scaler = preprocessing.StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Set-up and train the logistic regression model 
    if algorithm==None:
        lrm = LogisticRegression(penalty='l1', C=0.01, solver='saga', multi_class='ovr', fit_intercept=True)
    else:
        lrm = algorithm
    lrm.fit(X_train, y_train.values.ravel())
    
    # Predict the PROBABILITIES of outcome y given X_test
    _, y_pred = lrm.predict_proba(X_test).T
    y_pred = pd.DataFrame(y_pred, index=y_test.index)
    
#     Print scores
#     print('score:', lrm.score(X_test, y_test))
#     print(log_loss(y_test, y_pred))    
    
    def monthly_RMSE(y_true, y_pred):
        """ Compute RMSE on monthly mean LLJ rate
        Default score is accuracy, but this evaluates the discrete
        predictions rather than the probability estimates. Log-loss 
        does evaluate proba, but benefits from high bias/low variance.
        """
        monthlymean = lambda df: df.groupby(df.index.month).mean().values*100
        return ((monthlymean(y_pred) - monthlymean(y_true))**2).mean()**.5
    
    print(monthly_RMSE(y_test, y_pred))
    
    # Visualization of predicted seasonal cycle
    if plot==True:
        monthgrouper = lambda df: df.month
        fig, ax = plt.subplots()
        (y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax)
        (y_test.groupby(monthgrouper).mean()*100).plot(ax=ax)
        ax.legend(['predicted', 'testdata'])
        plt.show()
    
    if print_coeff==True:
        for label, coef in zip(X.columns.values, lrm.coef_[0]):
            print('predictor: {:s}, coeff: {:.2f}'.format(label, coef))
    return #lrm.coef_
    
y = jetflag
X = predictors
do_logistic_regression(X, y)
/home/peter919/miniconda3/envs/thesis/lib/python3.7/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
2.0351713966191673
predictor: month_1, coeff: 0.00
predictor: month_2, coeff: 0.00
predictor: month_3, coeff: 0.06
predictor: month_4, coeff: 0.07
predictor: month_5, coeff: 0.09
predictor: month_6, coeff: 0.18
predictor: month_7, coeff: 0.01
predictor: month_8, coeff: -0.10
predictor: month_9, coeff: -0.08
predictor: month_10, coeff: 0.00
predictor: month_11, coeff: 0.00
predictor: month_12, coeff: -0.04
predictor: month_-1, coeff: 0.00
predictor: LWT_1, coeff: 0.00
predictor: LWT_2, coeff: 0.05
predictor: LWT_3, coeff: 0.09
predictor: LWT_4, coeff: 0.03
predictor: LWT_5, coeff: 0.00
predictor: LWT_6, coeff: 0.00
predictor: LWT_7, coeff: 0.00
predictor: LWT_8, coeff: -0.00
predictor: LWT_9, coeff: 0.00
predictor: LWT_10, coeff: 0.00
predictor: LWT_11, coeff: -0.02
predictor: LWT_12, coeff: 0.00
predictor: LWT_13, coeff: 0.00
predictor: LWT_14, coeff: 0.05
predictor: LWT_15, coeff: 0.00
predictor: LWT_16, coeff: -0.05
predictor: LWT_17, coeff: -0.01
predictor: LWT_18, coeff: -0.08
predictor: LWT_19, coeff: -0.07
predictor: LWT_20, coeff: -0.02
predictor: LWT_21, coeff: 0.05
predictor: LWT_22, coeff: 0.06
predictor: LWT_23, coeff: 0.02
predictor: LWT_24, coeff: 0.14
predictor: LWT_25, coeff: 0.01
predictor: LWT_26, coeff: 0.00
predictor: LWT_27, coeff: 0.00
predictor: LWT_-1, coeff: 0.00
predictor: u10, coeff: -0.06
predictor: v10, coeff: -0.03
predictor: u100, coeff: -0.71
predictor: v100, coeff: -0.15
predictor: msl, coeff: 0.20
predictor: sst, coeff: 0.00
predictor: t2m, coeff: 0.10
predictor: LWT_U, coeff: 0.00
predictor: LWT_V, coeff: 0.00
predictor: LWT_Z, coeff: 0.00
predictor: RiB, coeff: 0.00
predictor: stab, coeff: 0.68

interpretation:

We see that the model makes a reasonable estimate of the seasonal cycle, but the RMSE is still about 2 m/s. We also find that most of the input parameters are set to 0, i.e. they are not valuable predictors according to the model.

sensitivity to algorithm and regularization.

We try a number of values for the strength of the regularization. We also compare the coordinate descend and gradient descend solvers.

In [7]:
y = jetflag
X = predictors
# X = predictors[['LWT_U', 'LWT_V', 'LWT_Z', 'sst']]
# X = predictors[['LWT_U', 'LWT_V', 'LWT_Z', 'month_0', 'month_1', 'month_2', 'month_3', 'month_4']]

# This illustrates bias/variance trade-off:
# At very low c (strong regularization) the model prefers to 
# predict a constant LLJ rate which leads to a very constant
# but not very specific low-level jet prediction. Interestingly,
# this also receives the highest score in terms of log-loss.
# This indicates that log-loss is not a suitable score for our purpose.

# The differences between saga and liblinear are small or non-existent.

for c in [0.001, 1, 1000, 100000]:
    for solver in ['saga', 'liblinear']:
        print(solver, c)
        lrm = LogisticRegression(penalty='l1', C=c, solver=solver, multi_class='ovr', fit_intercept=True)
        do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
saga 0.001
3.905120780867142
liblinear 0.001
4.24125083809848
saga 1
/home/peter919/miniconda3/envs/thesis/lib/python3.7/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
1.7624360041758471
liblinear 1
1.7495417149344668
saga 1000
1.7602860606865067
liblinear 1000
1.747351713405803
saga 100000
1.760259071780177
liblinear 100000
1.747350067729602

interpretation

for small values of the regularization parameter we find that the model tends to predict a constant rate of low-level jets throughout the year. This points to a too strong regularization. There is little difference between the two algorithms.

Using the SGD algorithm with three different regularization methods.

I played around with the alpha parameter.

In [8]:
# for alpha in ...
for penalty in ['l1', 'l2', 'elasticnet']:
    clf = SGDClassifier(loss='log', penalty=penalty, alpha=0.000003, l1_ratio=0.5, max_iter=1000)
    do_logistic_regression(X, y, algorithm=clf, plot=True, print_coeff=False)
2.481796596130585
2.147304992026117
2.085110781586727

Intermediate conclusions

  • Generally, I can predict the seasonal cycle with a RMSE of about 2%.
  • This is based on 8 years of data
  • Strong regularization leads to way too 'flat' results
  • Generally, L1 seems to perform better than L2, and elastic net sometimes does a great job but sometimes absolutely not
  • Performance sometimes seems better for dummy coded variables than for continous derivatives.
  • Rib is more useful as boolean than as continous variable
  • Difference between liblinear and saga is very small.
  • Perhaps use more 'physical insight' in selecting suitable parameters and then use L2 regularization gives better results?

TODO

  • Apply to observed LLJs
  • Estimate sensitivity to shorter periods.
  • Play more with different input parameters.
In [9]:
# Load observed low-level jets
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')
Out[9]:
<xarray.Dataset>
Dimensions:   (time: 38160)
Coordinates:
  * time      (time) datetime64[ns] 2011-11-02 2011-11-02T01:00:00 ...
Data variables:
    falloff   (time) float64 0.0 0.0 0.0 0.02635 0.0 0.03307 0.006848 0.1995 ...
    strength  (time) float64 0.0 0.0 0.0 6.196 0.0 6.041 6.407 7.954 9.671 ...
    height    (time) float64 -1.0 -1.0 -1.0 265.0 -1.0 165.0 140.0 140.0 ...
In [10]:
jetflag = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y = jetflag
X = predictors.loc[jetflag.index]
X.head()
Out[10]:
month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 ... u100 v100 msl sst t2m LWT_U LWT_V LWT_Z RiB stab
time
2011-11-02 00:00:00 0 0 0 0 0 0 0 0 0 0 ... 3.026968 1.132172 101288.609375 287.663757 286.967560 6.337785 9.885352 -28.869112 -0.133943 0
2011-11-02 01:00:00 0 0 0 0 0 0 0 0 0 0 ... 2.848181 1.548656 101309.445312 287.663757 286.884094 5.798060 10.607971 -29.364690 -0.143136 0
2011-11-02 02:00:00 0 0 0 0 0 0 0 0 0 0 ... 2.498162 2.039962 101288.945312 287.663757 286.853210 4.919773 11.674754 -29.685042 -0.151180 0
2011-11-02 03:00:00 0 0 0 0 0 0 0 0 0 0 ... 2.029791 2.772865 101242.367188 287.663757 286.874878 4.360311 12.817601 -29.007795 -0.129699 0
2011-11-02 04:00:00 0 0 0 0 0 0 0 0 0 0 ... 2.307623 3.427338 101261.976562 287.663757 286.935028 4.068055 14.305860 -28.615278 -0.084056 0

5 rows × 53 columns

In [11]:
lrm = LogisticRegression(penalty='l1', C=0.05, solver='saga', multi_class='ovr', fit_intercept=True)
do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
/home/peter919/miniconda3/envs/thesis/lib/python3.7/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
1.3548316922633037

Notes

  • Right now the model is very erratic due to the too short training period.
  • Maybe a better selection of input variables can lead to better performance?

Being more selective with the predictor variables

In [12]:
# Start anew with selection of predictor variables

# stability as difference between sst en t2m gives better performance than only stable/unstable, and also better than Ri_b
predictors = era_mmij[['sst', 't2m']].to_dataframe()
predictors['stab'] = (predictors.t2m-predictors.sst)#.astype(int)
predictors.drop(['sst', 't2m'], inplace=True, axis=1)

# Lamb Weather Type as binary encoded gives much better performance than continuous LWT components.
predictors['LWT'] = lwt.LWT

# DOY encoded in sine/cosine contribution gives better results than categorical 'month'
doy = predictors.index.dayofyear.values
predictors['DOY_sin'] = np.sin(doy*np.pi/365)
predictors['DOY_cos'] = np.cos(doy*np.pi/365)

# Adding time doesn't improve the performance

# Binary encoding on LWT gives better results than OHE
enc = ce.BinaryEncoder(cols=['LWT'])
predictors = enc.fit_transform(predictors)

# Show the selection
predictors.head()
Out[12]:
LWT_0 LWT_1 LWT_2 LWT_3 LWT_4 LWT_5 stab DOY_sin DOY_cos
time
2008-01-01 00:00:00 0 0 0 0 0 1 -0.718964 0.008607 0.999963
2008-01-01 01:00:00 0 0 0 0 0 1 -0.797852 0.008607 0.999963
2008-01-01 02:00:00 0 0 0 0 0 1 -0.719482 0.008607 0.999963
2008-01-01 03:00:00 0 0 0 0 0 1 -0.687622 0.008607 0.999963
2008-01-01 04:00:00 0 0 0 0 0 1 -0.642120 0.008607 0.999963
In [13]:
y = jetflag
X = predictors.loc[jetflag.index]
lrm = LogisticRegression(penalty='l2', C=1000000, solver='saga', multi_class='ovr', fit_intercept=True)
do_logistic_regression(X, y, algorithm=lrm, plot=True, print_coeff=False)
0.8577858017858158

Note

  • Nice! Now we can predict the seasonal cycle to within 1 m/s RMSE.
  • The model seems to perform much less well for shorter time periods than ~2 years. That means we really need to aggregate all platform data in order to get sensible results.
In [14]:
# Now that we have a sleaker selection of variables, let's try different regularization methods
for penalty in ['l1', 'l2', 'elasticnet']:
    clf = SGDClassifier(loss='log', penalty=penalty, alpha=0.000003, l1_ratio=0.5, max_iter=1000)
    do_logistic_regression(X, y, algorithm=clf, plot=True, print_coeff=False)
0.7286380986600616
0.9277145434186261
1.0046168546300525

Note

  • this algorithm does not give robust results. Sometimes performance is better than other times.
  • I'll continue to work with the saga solver.

Cross-validate to different platforms

Let's quickly try if it is possible to train the model at 1 platform and then predict for another.

In [15]:
predictors.head()
Out[15]:
LWT_0 LWT_1 LWT_2 LWT_3 LWT_4 LWT_5 stab DOY_sin DOY_cos
time
2008-01-01 00:00:00 0 0 0 0 0 1 -0.718964 0.008607 0.999963
2008-01-01 01:00:00 0 0 0 0 0 1 -0.797852 0.008607 0.999963
2008-01-01 02:00:00 0 0 0 0 0 1 -0.719482 0.008607 0.999963
2008-01-01 03:00:00 0 0 0 0 0 1 -0.687622 0.008607 0.999963
2008-01-01 04:00:00 0 0 0 0 0 1 -0.642120 0.008607 0.999963
In [16]:
# Load observed low-level jets at MMIJ as train data
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')

jetflag_mmij = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y_mmij = jetflag_mmij
X_mmij = predictors.loc[jetflag_mmij.index]

# Load observed low-level jets at LEG as test data
lljs_leg = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='LEG', version='1hmean').drop(['platform', 'version'])
lljs_leg.dropna(dim='time')
jetflag_leg = lljs_leg.falloff.rename('jetflag').to_dataframe().dropna()>2
y_leg = jetflag_leg

# Determine the predictors for LEG
era_wind = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ.nc').squeeze().sel(level=slice(None, 124, -1))
era_z_nd = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_z.nc').squeeze()
era_qt = xr.open_dataset('../../data/interim/ERA5platforms/MMIJ_qt.nc').squeeze()
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc').sel(longitude=3.3, latitude=52.8, method='nearest')
era_leg = xr.merge([era_wind, era_z_nd, era_sfc]).drop(['latitude', 'longitude'])
predictors_leg = era_leg[['sst', 't2m']].to_dataframe()
predictors_leg['stab'] = (predictors_leg.t2m-predictors_leg.sst)
predictors_leg.drop(['sst', 't2m'], inplace=True, axis=1)
predictors_leg['LWT'] = lwt.LWT
doy = predictors_leg.index.dayofyear.values
predictors_leg['DOY_sin'] = np.sin(doy*np.pi/365)
predictors_leg['DOY_cos'] = np.cos(doy*np.pi/365)
enc = ce.BinaryEncoder(cols=['LWT'])
predictors_leg = enc.fit_transform(predictors_leg)
X_leg = predictors_leg.loc[jetflag_leg.index]
In [17]:
# Scale the data
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_mmij)
X_test = scaler.transform(X_leg)

# Rename platform sets to train/test data
y_train = y_mmij
y_test = y_leg

# Set up and train the model
lrm = LogisticRegression(penalty='l1', C=1000000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())


# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=y_test.index)

def monthly_RMSE(y_true, y_pred):
    """ Compute RMSE on monthly mean LLJ rate
    Default score is accuracy, but this evaluates the discrete
    predictions rather than the probability estimates. Log-loss 
    does evaluate proba, but benefits from high bias/low variance.
    """
    monthlymean = lambda df: df.groupby(df.index.month).mean().values*100
    return ((monthlymean(y_pred) - monthlymean(y_true))**2).mean()**.5

print(monthly_RMSE(y_test, y_pred))

# Visualization of predicted seasonal cycle
monthgrouper = lambda df: df.month
fig, ax = plt.subplots()
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax)
(y_test.groupby(monthgrouper).mean()*100).plot(ax=ax)
ax.legend(['predicted', 'testdata'])
plt.show()
2.821717700953895

Interpretation:

This doesn't work well. Let's forget about it.

For paper: MMIJ 2 years train/validate and then 4 years train/10y predict

In [18]:
# Load ERA surface data for MMIJ
era_sfc = xr.open_mfdataset('../../data/external/ECMWF/ERA5/era5_20*_sfc.nc')
era_mmij = era_sfc.sel(longitude=3.3, latitude=52.8, method='nearest').drop(['latitude', 'longitude'])

# stability as difference between sst en t2m gives better performance than only stable/unstable, and also better than Ri_b
predictors = era_mmij[['sst', 't2m']].to_dataframe()
predictors['stab'] = (predictors.t2m-predictors.sst)#.astype(int)
predictors.drop(['sst', 't2m'], inplace=True, axis=1)

# Lamb Weather Type as binary encoded gives much better performance than continuous LWT components.
predictors['LWT'] = lwt.LWT

# DOY encoded in sine/cosine contribution gives better results than categorical 'month'
doy = predictors.index.dayofyear.values
predictors['DOY_sin'] = np.sin(doy*np.pi/365)
predictors['DOY_cos'] = np.cos(doy*np.pi/365)

# Adding time doesn't improve the performance

# Binary encoding on LWT gives better results than OHE
enc = ce.BinaryEncoder(cols=['LWT'])
predictors = enc.fit_transform(predictors)

# Scale the continuous predictor variables
for var in ['stab', 'DOY_sin', 'DOY_cos']:
    scaler = preprocessing.StandardScaler()
    predictors[var] = scaler.fit_transform(predictors[var].values.reshape(-1, 1))

# Show the selection
predictors.head()
Out[18]:
LWT_0 LWT_1 LWT_2 LWT_3 LWT_4 LWT_5 stab DOY_sin DOY_cos
time
2008-01-01 00:00:00 0 0 0 0 0 1 0.078971 -2.035981 1.41863
2008-01-01 01:00:00 0 0 0 0 0 1 0.035626 -2.035981 1.41863
2008-01-01 02:00:00 0 0 0 0 0 1 0.078686 -2.035981 1.41863
2008-01-01 03:00:00 0 0 0 0 0 1 0.096192 -2.035981 1.41863
2008-01-01 04:00:00 0 0 0 0 0 1 0.121193 -2.035981 1.41863
In [27]:
# Initialize a figure
fig = plt.figure(figsize=(10, 3), dpi=200)
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(223)
ax3 = fig.add_subplot(122)

########################################################
# First plot conceptual illustration
def correlated_signal(n_samples, corr, mu=0, sigma=1):
    ''' Generate a synthetic autocorrelated time series
        Borrowed from https://stackoverflow.com/a/33904277/6012085 '''
    assert 0 < corr < 1, "Auto-correlation must be between 0 and 1"

    c = mu * (1 - corr)
    sigma_e = np.sqrt((sigma ** 2) * (1 - corr ** 2))

    # Sample the auto-regressive process.
    signal = [c + np.random.normal(0, sigma_e)]
    for _ in range(1, n_samples):
        signal.append(c + corr * signal[-1] + np.random.normal(0, sigma_e))

    return np.array(signal)

x_model = np.arange(1000)
y_model = correlated_signal(1000, 0.75) # np.random.randn(1000)

x_obs = np.arange(200)
y_obs = y_model[:200] + np.random.randn(200)*0.5

x_inf = np.arange(200, 1000)
y_inf = y_model[200:] + np.random.randn(800)*0.5

# Plot the time series
ax1.plot(x_model, y_model, color='k')
ax2.plot(x_obs, y_obs, color='k')
ax2.plot(x_inf, y_inf, color='lightgrey', ls='-')

# Add a fancy box
import matplotlib.patches as mpatches
fancybox = mpatches.FancyBboxPatch([-5, -4], 210, 15, clip_on=False, zorder=5,
        boxstyle=mpatches.BoxStyle("Round", pad=0.02), edgecolor='r', facecolor='none', lw=6)
ax2.add_patch(fancybox)

# Add an arrow
ax2.annotate('', (500, 4), (180, 4), ha="center", va="center", annotation_clip=False, fontsize=18, zorder=6,
            arrowprops=dict(width=7, headlength=50,headwidth=20, fc="r", ec="r"))

# Aestethics
ax1.set_xticks([])
ax1.set_yticks([])
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_xlabel(r'Timeseries')
ax2.set_ylabel('Observations')
ax1.set_ylabel('ERA5')
ax2.set_ylim(-3, 3)

#########################################################
# Now, we add results from the ML procedure

# Load observed low-level jets at MMIJ as train data, and align the model data with the observation period
lljs_mmij = xr.open_dataset('../../data/interim/lljs_obs_platforms.nc').sel(platform='MMIJ', version='1hmean').drop(['platform', 'version'])
lljs_mmij.dropna(dim='time')

jetflag_mmij = lljs_mmij.falloff.rename('jetflag').to_dataframe().dropna()>2
y_mmij = jetflag_mmij
X_mmij = predictors.loc[jetflag_mmij.index]

##########################################################
# Step 1: split MMIJ data in two, and train/test the model
n = len(X_mmij)//2
X_train, X_test = X_mmij[:n], X_mmij[n:]
y_train, y_test = y_mmij[:n], y_mmij[n:]

# Set up and train the model
lrm = clf#LogisticRegression(penalty='l1', C=1000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())

# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=y_test.index)

# Verify the score of the training data
print(monthly_RMSE(y_test, y_pred))

# Visualize the result so far
(y_train.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)
(y_test.groupby(monthgrouper).mean()*100).plot(ax=ax3, alpha=0.3)


##########################################################
# Step 2: train on the full mmij data, predict the future
X_train, X_test = X_mmij, predictors
y_train = y_mmij

# Set up and train the model
lrm = LogisticRegression(penalty='l1', C=1000, solver='saga', multi_class='ovr', fit_intercept=True)
lrm.fit(X_train, y_train.values.ravel())

# Predict the PROBABILITIES of outcome y given X_test
_, y_pred = lrm.predict_proba(X_test).T
y_pred = pd.DataFrame(y_pred, index=X_test.index)

# Visualize the result
(y_train.groupby(monthgrouper).mean()*100).plot(ax=ax3)
(y_pred.groupby(monthgrouper).mean()*100).plot(ax=ax3)

ax3.legend(['2-y train', '2-y pred', '2-y test', '4-y train', '10-y pred'])

ax3.set_ylabel('LLJ rate (%)')
ax3.set_xticks(np.arange(1,13))
ax3.set_xticklabels('jfmamjjasond')

##### Decoration for paper
ax1.set_title('A. Conceptual illustration', loc='right')
ax3.set_title('B. Predicted seasonal cycle')
fig.savefig('../../reports/figures/ML_result', bbox_inches='tight')
plt.show()
0.7145283444890124

Interpretation

Running this cell multiple times gives slightly varying results for the RMSE, but it is always around 1 m/s (lowest that I saw was 0.7, highest about 1.3). This is due to a stochastic component in the fitting algorithm. The precicted long-term seasonal cycle is always very similar, though.