Before we start: This Python tutorial is a part of our series of Python Package tutorials.
Scikit-learn is an open source data analysis library, and the gold standard for Machine Learning (ML) in the Python ecosystem. Key concepts and features include:
- Algorithmic decision-making methods, including:
- Classification: identifying and categorizing data based on patterns.
- Regression: predicting or projecting data values based on the average mean of existing and planned data.
- Clustering: automatic grouping of similar data into datasets.
- Algorithms that support predictive analysis ranging from simple linear regression to neural network pattern recognition.
- Interoperability with NumPy, pandas, and matplotlib libraries.
ML is a technology that enables computers to learn from input data and to build/train a predictive model without explicit programming. ML is a subset of Artificial Intelligence (AI).
Why Use Scikit-Learn For Machine Learning
Whether you are just looking for an introduction to ML, want to get up and running fast, or are looking for the latest ML research tool, you will find that scikit-learn is both well-documented and easy to learn/use. As a high-level library, it lets you define a predictive data model in just a few lines of code, and then use that model to fit your data. It’s versatile and integrates well with other Python libraries, such as matplotlib for plotting, numpy for array vectorization, and pandas for dataframes.
Basic Machine Learning Concepts
To use scikit-learn, you should first be familiar with some of the terminology typically used in ML projects.
- Accuracy – the fraction of predictions that a classification model got right.
- In multi-class classification, accuracy is defined as follows:
Accuracy = Correct Predictions / Total Number Of Examples
- In multi-class classification, accuracy is defined as follows:
- In binary classification, accuracy has the following definition:
Accuracy * = (True Positives + True Negatives) / Total Number Of Examples - Example Data – particular instance (feature) of data, defined as x. There are two categories of data examples:
- Labeled Data– includes both feature(s) and the label, defined as
{features, label}: (x, y)
- Unlabeled Data – contains features but not the label, defined as:
{features, ?}: (x, ?)
- Labeled Data– includes both feature(s) and the label, defined as
- Feature – an input variable. It is a measurable characteristic or property of a thing being observed. Every ML project has 1 or more features.
- Clustering – a technique that groups data points based on their similarities. Each group is called a Cluster.
- K-Means Clustering – an unsupervised learning technique that looks for a fixed number (k) of means (centroids) of data points, and assigns them to the nearest cluster.
- Model – defines the relationship between features and a label. For example, a Rumor Detection Model that associates certain features associated with rumors.
- Regression vs Classification – both are models that allow you to make predictions that answer questions, such as which team will win a sporting event.
- Regression models provide a numerical or continuous value.
- Classification models provide a categorical or discrete value.
- Supervised Learning – the algorithm uses a labeled dataset to ‘learn’ how to recognize correct answers, which it can then apply to training data. The algorithm’s accuracy is then evaluated and refined. Most ML projects use supervised learning.
- Unsupervised Learning – the algorithm tries to make sense of unlabeled data by ‘learning’ features and patterns on its own.
ML Algorithms
For computers to learn without being explicitly programmed, algorithms are required. Algorithms are merely sets of rules applied to computation.
ML algorithm basic concepts:
- Representation – is a way to configure data such that it can be assessed. Examples include decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles and others.
- Evaluation – given a hypothesis, evaluation is a way of assessing its validity. Examples include accuracy, prediction and recall, squared error, likelihood, posterior probability, cost, margin, entropy k-L divergence and others.
- Optimization – the process of adjusting hyperparameters in order to minimize model errors by using techniques like combinatorial optimization, convex optimization, constrained optimization, etc.
Scikit-Learn ML Algorithms
Examples of the main algorithms and methods used in Scikit-learn, are listed here in a descending order of complexity:
Linear Regression Algorithm Example
Linear Regression is a supervised ML algorithm in which the predicted output is a slope in a straight line. It’s used to predict values within a given set of data points and not beyond.
Simple linear regression uses the slope-intercept form of a straight line, where:
- m and b are variables that the algorithm uses to learn or create the most accurate predictive slope
- x represents input data
- y represents the prediction
y=mx+b
In this example, the linear regression model fits 3 datasets and predicts an unknown data value fitted to the existing data:
# Import the linear regression model: from sklearn import linear_model linreg = linear_model.LinearRegression() # Use the linear regression model to fit data: linreg.fit ([[0, 0], [2, 2], [4, 4]], [0, 2, 4]) # Run the model and return a point fitted to the data # and projected on the same line: print(linreg.coef_)
Output:
[0.5 0.5]
Logistic Regression Algorithm Example
Logistic Regression is the go-to method for binary classification problems (eg., is the input data point On or Off?). An equation similar to Linear Regression can then be used to assess the values (e.g., how likely is it that a specific data point is On or Off?).
In this example, NumPy and matplotlib are used to plot an array in a logistic regression model. The class probabilities of a sample dataset in the array are predicted by three different classifiers and averaged by the VotingClassifier (estimates average best fit of the 3 classifiers):
print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import VotingClassifier clf1 = LogisticRegression(max_iter=1000, random_state=123) clf2 = RandomForestClassifier(n_estimators=100, random_state=123) clf3 = GaussianNB() X = np.array([[-1.0, -1.0], [-1.2, -1.3], [-1.4, -2.2], [2.1, 2.2]]) y = np.array([1, 1, 2, 2]) eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[1, 2, 3]) # Predict the class probabilities for all classifiers: probas = [c.fit(X, y).predict_proba(X) for c in (clf1, clf2, clf3, eclf)] # Get the class probabilities for the sample dataset: class1_1 = [pr[0, 0] for pr in probas] class2_1 = [pr[0, 1] for pr in probas] # Plot the probabilities:
N = 4 # Number of groups in the array ind = np.arange(N) # Group positions width = 0.275 # Bar graph width # Create a matplotlib subplots for each bar graph: fig, ax = plt.subplots() # Create a bar graph for each classifier: p1 = ax.bar(ind, np.hstack(([class1_1[:-1], [0]])), width, color='red', edgecolor='k') p2 = ax.bar(ind + width, np.hstack(([class2_1[:-1], [0]])), width, color='blue', edgecolor='k') # Create a bar graph for the VotingClassifier estimated average: p3 = ax.bar(ind, [0, 0, 0, class1_1[-1]], width, color='red', edgecolor='k') p4 = ax.bar(ind + width, [0, 0, 0, class2_1[-1]], width, color='blue', edgecolor='k') # Create plot annotations to add descriptive text: plt.axvline(2.8, color='k', linestyle='dashed') ax.set_xticks(ind + width) ax.set_xticklabels(['LogisticRegression\nweight 1', 'GaussianNB\nweight 2', 'RandomForestClassifier\nweight 3', 'VotingClassifier\n(average probabilities)'], rotation=40, ha='right') plt.ylim([0, 1]) plt.title('Class Probabilities for a Sample, Predicted by Different Classifiers') plt.legend([p1[0], p2[0]], ['class 1', 'class 2'], loc='upper left') plt.tight_layout() plt.show()
Decision Tree Algorithm Example
A Decision Tree algorithm formulates a tree composed of root nodes (points where a choice must be made), branch nodes (binary yes/no answers to the choice) and leaf nodes (represent variables).
In this example, numpy and matplotlib are used to plot a decision tree structure represented by parallel arrays with different properties:
import numpy as np from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn import tree # Train the decision tree classifier by fitting the DecisionTreeClassifier class using the datasets.load_iris` dataset: iris = load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0) clf.fit(X_train, y_train) # Decision tree structure: # The decision classifier has an attribute called ‘tree_’ which allows access # to low level attributes such as ‘node_count’, the total number of nodes, # and ‘max_depth’ (max depth of the tree). # The tree structure is represented as a number of parallel arrays. The # ‘i-th’ element of each array holds information about the node ‘i’. Node 0 is # the tree's root. Some of the array features only applies to leaves or split # nodes. # In this example the arrays feature and threshold only apply to split nodes. # The values for leaf nodes in these arrays are therefore arbitrary. # # Array definitions: # children_left[i]. Id of the left child of node ‘i’ or -1 if leaf node. # children_right[i]. Id of the right child of node ‘i’ or -1 if leaf node. # feature[i]. Feature used for splitting node ‘i’. # threshold[i]. Threshold value at node ‘i’. # n_node_samples[i]. Number of training samples reaching node ‘i’. # impurity[i]. The impurity at node ‘i’. # The tree structure is traversed and array nodes with different properties # are computed: # Compute the depth of each node and decide whether or not it is a leaf: n_nodes = clf.tree_.node_count children_left = clf.tree_.children_left children_right = clf.tree_.children_right feature = clf.tree_.feature threshold = clf.tree_.threshold node_depth = np.zeros(shape=n_nodes, dtype=np.int64) is_leaves = np.zeros(shape=n_nodes, dtype=bool) stack = [(0, 0)] # start with the root node id (0) and its depth (0) while len(stack) > 0: node_id, depth = stack.pop() # Ensures each node is only visited once. node_depth[node_id] = depth # If the left and right child of a node is not the same, it is a split node. is_split_node = children_left[node_id] != children_right[node_id] # If a split node, append left and right children and depth to `stack` # so we can loop through them if is_split_node: stack.append((children_left[node_id], depth + 1)) stack.append((children_right[node_id], depth + 1)) else: is_leaves[node_id] = True print("The binary tree structure has {n} nodes and has " "the following tree structure:\n".format(n=n_nodes)) for i in range(n_nodes): if is_leaves[i]: print("{space}node={node} is a leaf node.".format( space=node_depth[i] * "\t", node=i)) else: print("{space}node={node} is a split node: " "go to node {left} if X[:, {feature}] <= {threshold} " "else to node {right}.".format( space=node_depth[i] * "\t", node=i, left=children_left[i], feature=feature[i], threshold=threshold[i], right=children_right[i])) # Plot the decision tree: tree.plot_tree(clf) plt.title('Decision Tree with array nodes that have different properties.') plt.show()
Advanced ML Algorithms
Random Forest
A Random Forest is a model composed of multiple Decision Trees and different learning algorithms (ensemble learning method) to obtain better predictive analysis than could be obtained from any single learning algorithm.
Gradient Boosting
Gradient Boosting is a method that can be used where there may be regression and classification problems. It produces a predictive model based on an ensemble of weak prediction models, typically decision trees.
The following tutorials will provide you with step-by-step instructions on how to work with Python packages:
Why use ActivePython for Data Science
While the open source distribution of Python may be satisfactory for an individual, it doesn’t always meet the support, security, or platform requirements of large organizations.
This is why organizations choose ActivePython for their data science, big data processing and statistical analysis needs.
Pre-bundled with the most important packages Data Scientists need, ActivePython is pre-compiled so you and your team don’t have to waste time configuring the open source distribution. You can focus on what’s important–spending more time building algorithms and predictive models against your big data sources, and less time on system configuration.
ActivePython is 100% compatible with the open source Python distribution, and provides the security and commercial support that your organization requires.
With ActivePython you can explore and manipulate data, run statistical analysis, and deliver visualizations to share insights with your business users and executives sooner–no matter where your data lives.
Some Popular Python Packages You Get Pre-compiled – with ActivePython for Data Science/Big Data/Machine Learning
- pandas (data analysis)
- NumPy (multi-dimensional arrays)
- SciPy (algorithms to use with numpy)
- HDF5 (store & manipulate data)
- Matplotlib (data visualization)
- Jupyter (research collaboration)
- PyTables (managing HDF5 datasets)
- HDFS (C/C++ wrapper for Hadoop)
- pymongo (MongoDB driver)
- SQLAlchemy (Python SQL Toolkit)
- redis (Redis access libraries)
- pyMySQL (MySQL connector)
- scikit-learn (machine learning)
- TensorFlow (deep learning with neural networks)
- scikit-learn (machine learning algorithms)
- keras (high-level neural networks API)
Download ActivePython Community Edition to get started or contact us to learn more about using ActivePython in your organization.