# Split Time-Series Dataset

Sometimes the easiest part is the hardest part. Usually splitting your dataset to train, validation and test isn’t a complicated task, on the contrary. But when happens if you are using time series data or you want to custom your split. Then you need to adjust it rather than using a regular split. In one of my projects I had to prepare my dataset for training and testing and to do so, I came across various ways you can split your dataset. I tried several of them until I reached the one that provided me with the optimal solution.

I’ll go over the different methods and provide code examples for each one of them.

For the sake of the argument, I’ll use the Iris dataset in my examples

`from sklearn.datasets import load_iris`

Then load the iris dataset.

`iris = load_iris()`

Then store the data and target value into separate variables.

`X, y = iris.data, iris.target`

First, we will use the sklearn vanilla **train_test_split** as follows

`from sklearn.model_selection import train_test_split`

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this simple example, I split my data into 20% test and 80% training respectively. I set a random seed to 42 and the outcome is 4 datasets divided into features and labels for training and testing as well. You can also set another important parameter named shuffle. If true, it will shuffle the data before splitting. Pay attention to this one especially if you’re using time series data, it can mess up your results as you don’t want data points from the future to be included in past training (look-ahead bias).

In this example you just split your data in one point, it isn’t very helpful when you have time-series data.

Now, say you don’t want to just regularly split your dataset in a fixed manner. Say you have time series data and you want to split your data into fixed intervals. For this kind of task, you can split your dataset with **TimeSeriesSplit **which provides train and test indices to split time series data samples that are observed at fixed time intervals.

In each split, test indices must be higher than before, and as stated before you cannot use shuffling in a cross validator. This is the plane example from sklearn

`import numpy as np`

from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)

for train_index, test_index in tscv.split(X):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

Another option for using the TimeSeriesSplit method is with **GridSearchCV**

For those of you who aren’t familiar with GridSearchCV, in a nutshell, it iterates over specified parameter values for an estimator. You are supposed to provide the parameter values you want to iterate over and the scoring method you want to choose to evaluate the test set. The result will be the best parameter you should use in your model.

Here is an example of using GridSearchCV along with TimeSeriesSplit

`import numpy as np`

from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

from sklearn.ensemble import RandomForestRegressor

# predefine the variables

n_splits = 5 # Number of splits

model = RandomForestRegressor() # I used random forest as my model

# Add the parameter to grid serach on - this is just an example

grid_params = {'n_estimators': [int(x) for x in np.linspace(200, 1000, 3)],

'max_depth': [int(x) for x in np.linspace(5, 55, 11)],

'max_features': ['auto', 'sqrt', 'log2'],

'random_state': [42]

}

refit = True # Refit an estimator using the best found parameters on the whole dataset

scoring = 'neg_mean_squared_error' # Strategy to evaluate the performance of the cross-validated model on the test set

n_jobs = -1 # Number of jobs to run in parallel

tscv = TimeSeriesSplit(n_splits=5)

grid_search = GridSearchCV(estimator=model, param_grid=grid_params, refit=refit,

scoring=scoring, cv=tscv, n_jobs=n_jobs).fit(X, y)

print(f'Model: {model} best params are: {grid_search.best_params_}')

Lastly, say you don’t want to use fixed intervals but predefined intervals that are accustomed to your needs. In this case, you can use the **PredefinedSplit **cross-validator.

It provides train/test indices to split data into train/test sets using a predefined scheme specified by you with the `test_fold`

parameter.

Again, I’ll use it as a cross-validate in my GridSearchCV. The following picture illustrates it best

import numpy as np

import pandas as pd

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import GridSearchCV, PredefinedSplit

model = RandomForestRegressor()

grid_params = {'n_estimators': [int(x) for x in np.linspace(200, 1000, 3)],

'max_depth': [int(x) for x in np.linspace(5, 55, 11)],

'max_features': ['auto', 'sqrt', 'log2'],

'random_state': [42]

}

refit = True

scoring = 'neg_mean_squared_error'

n_jobs = -1

validation_size = 24

X.reset_index(inplace=True)

X.sort_values('date', inplace=True)

train_dates = pd.to_datetime(X['date'].unique()).sort_values()

val_dates = train_dates[-validation_size:]

n_test_obs = X['date'].isin(train_dates).sum()

n_valid_obs = X['date'].isin(val_dates).sum()

test_fold_encoding = list(np.concatenate([np.ones(n_test_obs - n_valid_obs), np.zeros(n_valid_obs)]))

cv = [[c for c in PredefinedSplit(test_fold=test_fold_encoding).split()][0]]

grid_search = GridSearchCV(estimator=model, param_grid=grid_params, refit=refit,

scoring=scoring, cv=cv, n_jobs=n_jobs).fit(X, y)

print(f'Model: {model} best params are: {grid_search.best_params_}')# Credit to: Idan Schatz

Note that you need to have a date column in your dataset to use PredefinedSplit.

Remember that when using a validation set, you need to set the `test_fold_encoding`

to 0 for all samples that are part of the validation set, and to -1 for all other samples. Also, the entry `test_fold_encoding[i]`

represents the index of the test set that sample `i`

belongs to. It is possible to exclude the sample `i`

from any test set by setting `test_fold_encoding[i]`

equal to -1.

That’s it for now, I hope this article will prove useful to your endeavors in the future. Thank you so much for reading!