Before we start: This Python tutorial is a part of our series of Python Package tutorials.

Scikit-Learn is one of the most useful Machine Learning (ML) libraries in Python. It includes many supervised and unsupervised algorithms that can be used to analyze datasets and make predictions about the data. Learn more about scikit-learn.

This post will show you how to make predictions using a variety of algorithms, including:

  • Linear regression predictions
  • Decision tree predictions
  • Random forest predictions
  • Neural network predictions
  • Bayesian ridge predictions

Shapley Value Regression

The Shapley value is a concept in cooperative game theory, and can be used to help explain the output of any machine learning model. In practice, Shapley value regression attempts to resolve a weakness in linear regression reliability when predicting variables that have moderate to high correlation.

In this post, we’ll find Shapley values for each variable in a regression in order to try and find the correct weight for it. The sum of all Shapley values should be the difference between the predictions and average value of the model.

Linear Regression Example

In this example, a linear regression model is created with random data, and an estimated regression line is displayed:

# Import the packages and classes needed for this example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Create random data with numpy:
rnstate = np.random.RandomState(1)
x = 10 * rnstate.rand(50)
y = 2 * x - 5 + rnstate.randn(50)
# Create a linear regression model based on the positioning of the data and Intercept, and predict a Best Fit:
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
# Plot the estimated linear regression line with matplotlib:
plt.scatter(x, y)
plt.plot(xfit, yfit);
plt.show()

Figure 1 Estimated Linear Regression Line

Figure 1. Estimated Linear Regression Line

Decision Tree Example

# Import the library required for this example
# Create the decision tree regression model:
from sklearn import tree
dtree = tree.DecisionTreeRegressor(min_samples_split=20)
dtree.fit(X_train, y_train)
print_accuracy(dtree.predict)
print accuracy

# Use Shap explainer to interpret values in the test set:
ex = shap.TreeExplainer(dtree)
shap_values = ex.shap_values(X_test)
# Plot Shap values:
shap.summary_plot(shap_values, X_test)
# Plot BMI (Body Mass Index) values: 
shap.dependence_plot("bmi", shap_values, X_test)

figure 2 BMI values distribution in a shap

Figure 2.  BMI values distribution in a Shap Decision Tree

Random Forest Example

# Import the library required for this example
# Create a Random Forest regression model
# that implements a Fast TreeExplainer:
from sklearn.ensemble import RandomForestRegressor
rforest = RandomForestRegressor(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
rforest.fit(X_train, y_train)
print_accuracy(rforest.predict)
print accuracy 2

# Use Shap explainer to interpret values in the test set:
explainer = shap.TreeExplainer(rforest)
shap_values = explainer.shap_values(X_test)
# Plot Shap values:
shap.summary_plot(shap_values, X_test)
# Plot BMI values:
shap.dependence_plot("bmi", shap_values, X_test)

 

figure 3 BMI values distribution in a shap random forest

Figure 3.  BMI values distribution in a Shap Random Forest

Neural Network Example

# Import the library required in this example
# Create the Neural Network regression model:
from sklearn.neural_network import MLPRegressor
nn = MLPRegressor(solver='lbfgs', alpha=1e-1, hidden_layer_sizes=(5, 2), random_state=0)
nn.fit(X_train, y_train)
print_accuracy(nn.predict)
print accuracy 3
# Use Shap explainer to interpret values in the test set:
explainer = shap.KernelExplainer(nn.predict, X_train_summary)
shap_values = explainer.shap_values(X_test)
# Plot the Shap values:
shap.summary_plot(shap_values, X_test)
# Plot BMI values:
shap.dependence_plot("bmi", shap_values, X_test)

figure 4 BMI values distribution in a shap neural network

Figure 4.  BMI values distribution in a Shap Neural Network

Shap Regression Examples

In this set of examples, Shap values are applied to multiple regression models. Decision Tree, Random Forest, and Neural Network, are displayed in code blocks that share common libraries and classes:

# Import libraries required in the Shap examples, and load some data::
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import shap
import time
X,y = shap.datasets.diabetes()
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Summarize the training set with a subset of weighted kmeans, 
# each weighted by the number of points they represent:
X_train_summary = shap.kmeans(X_train, 10)
def print_accuracy(f):
    print("Root mean squared test error = {0}".format(np.sqrt(np.mean((f(X_test) - y_test)**2))))
    time.sleep(0.5) # Allow print() to take place before other processes.
shap.initjs()

Bayesian Ridge Example

Requirement

Bayesian Ridge regression requires the Python Bayesian-Optimization package. Enter the following command in a command-line or terminal to install the package:

pip install bayesian-optimization  
or
python -m pip install bayesian-optimization 

In this example, the BayesianRidge estimator class is used to predict new values in a regression model that lacks sufficient data. A linear regression is formulated using a probable distribution of values in the absence of actual values. The output, response ‘y’, is derived from the probable distribution rather than from actual values. 

# Import libraries needed in this example:
from sklearn.linear_model import BayesianRidge
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from numpy import sqrt
# Load the Boston Housing dataset: 
boston = load_boston()
# Split the data into x test, y train sets:
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)
# Define the BayesianRidge model with default parameters: 
bay_ridge = BayesianRidge()
# Fit the model with trained Bayesian data:
bay_ridge.fit(xtrain, ytrain)
# Use r-squared metrics to measure how close 
# the data is to the predicted regression line.
# Independent variable x^2 = score:
score=bay_ridge.score(xtrain, ytrain)
print("BayesianRidge regression model - score (r-squared mean): %.2f" % score)
# Predict the test data and check for accuracy: 
ypredict = bay_ridge.predict(xtest)
mse = mean_squared_error(ytest, ypredict)
# print("Mean squared error level: %.2f" % mse)
print("BayesianRidge regression - mean error level: %.2f" % sqrt(mse))
print accuracy 4
# Plot a BayesianRidge regression fitted to probable
# Bayesian values in relation to actual values:
xaxis = range(len(ytest))
plt.scatter(xaxis, ytest, s=8, color="blue", label="Actual values")
plt.plot(xaxis, ypredict, lw=0.8, color="red", label="Predicted regression")
plt.legend()
plt.show() 

Get a version of Python, pre-compiled with Scikit-learn and other popular ML Packages

ActiveState Python is the trusted Python distribution for Windows, Linux and Mac, pre-bundled with top Python packages for machine learning – free for development use.

Some Popular ML Packages You Get Pre-compiled – With ActiveState Python

Machine Learning:

  • TensorFlow (deep learning with neural networks)*
  • scikit-learn (machine learning algorithms)
  • keras (high-level neural networks API)

Data Science:

  • pandas (data analysis)
  • NumPy (multidimensional arrays)
  • SciPy (algorithms to use with numpy)
  • HDF5 (store & manipulate data)
  • matplotlib (data visualization)

Get ActiveState Python for Machine Learning for Windows, macOS or Linux here.

Why use ActiveState Python instead of open source Python?

While the open source distribution of Python may be satisfactory for an individual, it doesn’t always meet the support, security, or platform requirements of large organizations.

This is why organizations choose ActiveState Python for their data science, big data processing and statistical analysis needs.

Pre-bundled with the most important packages Data Scientists need, ActiveState Python is pre-compiled so you and your team don’t have to waste time configuring the open source distribution. You can focus on what’s important–spending more time building algorithms and predictive models against your big data sources, and less time on system configuration.

ActiveState Python is 100% compatible with the open source Python distribution and provides the security and commercial support that your organization requires.

With ActiveState Python you can explore and manipulate data, run statistical analysis, and deliver visualizations to share insights with your business users and executives sooner–no matter where your data lives.

Download ActiveState Python to get started or contact us to learn more about using ActiveState Python in your organization.