Visualizing Sklearn Cross-validation: K-Fold, Shuffle & Split, and Time Series Split | by Boriharn K | Jul, 2023


K-fold is a common method for cross-validation. Firstly, all the data are divided into folds. Then, the learning model is created from the training set (k-1 folds), and the testing set (the fold left) is used for validation.

Normally, the folds obtained from the K-fold cross-validation are divided as equally as possible. Next, we are going to see the process of the K-fold cross-validation.

Import libraries and load data

For example, this article will work with the wine dataset, which can be downloaded from the Sklearn library. The dataset is a copy of UCI ML wine data under the CC BY 4.0 license.

In total, there are 13 constituents found in each of the three types of wines. These attributes will be used to build a classification model for classifying the wine classes.

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine

data = load_wine()
X = pd.DataFrame(data=data.data, columns=data.feature_names)
y = pd.DataFrame(data=data.target, columns=['class'])
df = pd.concat([X, y], axis=1)
df.head()

The next process can be explained: we will start by applying the KFold function from Sklearn to group the data into training and testing sets. The number of folds can be specified with the n_splits parameter.

Then, the Support Vector Machines (SVMs) will be created to classify wine classes in each iteration using the svm function. Lastly, the score function will be used to measure the mean accuracy of the model performance.

These steps can be performed using the for-loop function in Python, as shown in the code below.

from sklearn.model_selection import KFold
from sklearn import svm

kf = KFold(n_splits=10)
# a list for keeping training index, testing index and obtained score
keep = []
for train, test in kf.split(df):
X_train = df.iloc[list(train),:-1]
y_train = df.iloc[list(train),-1]
X_test = df.iloc[list(test),:-1]
y_test = df.iloc[list(test),-1]
clf = svm.SVC(kernel='linear').fit(X_train, y_train)
score = clf.score(X_test, y_test)

keep.append([train, test, score])
print(score)

Now that we have data and accuracy scores from the iterations, let’s define a function to create a DataFrame for plotting.

def create_df(input_):
df_train = pd.DataFrame(zip(input_[0], len(input_[0])*['train']),
columns = ['index','group'])
df_test = pd.DataFrame(zip(input_[1], len(input_[1])*['test']),
columns = ['index','group'])
df_comb = pd.concat([df_train, df_test])
df_comb['score'] = len(df_comb)*[input_[2]]
return df_comb

#create a DataFrame from the list
keep_df = [create_df(i) for i in keep]
df_in = pd.concat(keep_df)
df_in.reset_index(inplace=True, drop=True)
df_in.head()

Assign the order number of the iterations to the DataFrame.

#create a list of numbers for assigning the n th iteration
list_num = [i[0] + 1 for i in list(enumerate(keep))]
list_num.reverse()

list_it = [len(df)*[i] for i in list_num]
df_kf = pd.DataFrame(sum(list_it,[]), columns=['CV iteration'])
df_kf.reset_index(inplace=True, drop=True)

df_cv = pd.concat([df_in, df_kf], axis=1)
df_cv.head()

Visualizing K-Fold cross-validation iterations

Next, we can plot the process with scatter plots using Plotly, a useful data visualization library that can help build an interactive chart with just a few lines of code.

import plotly.express as px
fig1 = px.scatter(df_cv, x='index', y='CV iteration', color='group',
color_discrete_map={'test':'red','train':'blue'})

fig1.show()

Plotting the K-fold cross-validation iterations. Image by Author.

It can be noticed from the scatter plot that the training and testing sets in K-fold cross-validation are cross-over in successive rounds in each iteration.

Visualizing K-Fold cross-validation results

Let’s continue by plotting the accuracy scores on the chart to get more information. Filter the DataFrame by selecting only rows with testing sets to get the accuracy scores.

df_score = df_cv[df_cv['group'].isin(['test'])]
df_score.head()

Plot the score values using a color scale.

import plotly.express as px
fig2 = px.scatter(df_score, x='index', y='CV iteration', color='score',
color_continuous_scale=px.colors.sequential.YlOrRd_r,
range_color=(0.6,1))
fig2.update_layout(coloraxis_colorbar_x=-0.15)
fig2.show()
Plotting the K-fold cross-validation results. Image by Author.

Plotting the scores can facilitate comparing the outcomes. We can tell from the chart that the SVMs’ accuracy obtained in the 7th iteration has the lowest score compared to the others.

Bonus!!

Fortunately, we can combine the scatter plots to see both the process and the validation scores in the same plot with Plotly. The results will be presented as an interactive chart.

import plotly.express as px
fig1 = px.scatter(df_cv, x='index', y='CV iteration', color='group',
color_discrete_map={'test':'red','train':'blue'})

fig2 = px.scatter(df_score, x='index', y='CV iteration', color='score',
color_continuous_scale=px.colors.sequential.YlOrRd_r,
range_color=(0.65,1))
fig2.update_layout(coloraxis_colorbar_x=-0.15)

fig2.add_traces(list(fig1.select_traces()))
fig2.show()

Ta-da…

Combining the K-fold cross-validation iterations plot and results plot. Image by Author.



Source link

Leave a Comment