Logistic Regression with Scikit-learn: A Beginner's Guide
Logistic Regression is a statistical method that is used for modeling the relationship between a categorical dependent variable and one or more independent variables. It is commonly used in various fields such as medical research, social sciences, and economics to analyze and predict the outcome of a binary event, i.e., the event that can only have two outcomes.
In this article, we will discuss how to perform logistic regression using scikit-learn, a popular Python library for machine learning.
What is Scikit-learn?
Scikit-learn is an open-source machine learning library for Python. It provides a range of supervised and unsupervised learning algorithms for classification, regression, clustering, and dimensionality reduction. Scikit-learn is built on top of NumPy, SciPy, and matplotlib, which makes it a powerful and easy-to-use tool for data analysis and modeling.
Installing Scikit-learn
Before we dive into the details of logistic regression with Scikit-learn, let's first install the library. You can use pip, the Python package installer, to install scikit-learn. Open your terminal or command prompt and enter the following command:
pip install scikit-learn
If you're using Anaconda, you can install scikit-learn by running the following command in your terminal:
conda install scikit-learn
Logistic Regression with Scikit-learn
Now that we have scikit-learn installed, let's dive into how to perform logistic regression using this library. We will use the famous iris dataset, which contains the measurements of the sepal length, sepal width, petal length, and petal width for three different species of iris flowers.
Importing the Required Libraries
First, we need to import the required libraries. We will use pandas to load the dataset and numpy for some numerical operations.
import pandas as pd
import numpy as np
Loading the Dataset
Next, we will load the iris dataset using pandas. You can download the dataset from the UCI Machine Learning Repository. Alternatively, you can use the following code to load the dataset directly from scikit-learn:
from sklearn.datasets import load_iris
iris = load_iris()
Preparing the Data
Before we can train our logistic regression model, we need to prepare the data. We will split the dataset into training and testing sets and encode the target variable into binary values.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
X = iris.data
y = iris.target
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# encode the target variable into binary values
lb = LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test = lb.fit_transform(y_test)
Training the Model
Now, we can train our logistic regression model using scikit-learn. We will use the LogisticRegression class to create an instance of the model and then fit it to our training data.
from sklearn.linear_model import LogisticRegression
# create an instance of the model
lr = LogisticRegression()
# fit the model to the training data
lr.fit(X_train, y_train)
Making Predictions
Once we have trained our logistic regression model, we can use it to make predictions on new data. We will use the predict method of the model to predict the class of the test data.
# predict the class of the test data
y_pred = lr.predict(X_test)
Evaluating the Model
To evaluate the performance of our logistic regression model, we will calculate the accuracy score and the confusion matrix. The accuracy score measures the proportion of correct predictions, while the confusion matrix shows the number of true and false positives and negatives.
from sklearn.metrics import accuracy_score, confusion_matrix
# calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
# calculate the confusion matrix
cm = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
print("Confusion Matrix:\n", cm)
The output should look like this:
Accuracy: 96.67%
Confusion Matrix:
[[11 0 0]
[ 0 13 1]
[ 0 0 5]]
Visualizing the Results
To visualize the results of our logistic regression model, we will use a scatter plot to show the distribution of the iris flowers in the feature space, and we will add the decision boundary of the model to the plot.
import matplotlib.pyplot as plt
# create a meshgrid of the feature space
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
# predict the class of the meshgrid points
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# plot the scatter plot of the iris data
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Scatter Plot of Iris Data')
# plot the decision boundary of the logistic regression model
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdBu)
plt.show()
The scatter plot of iris data shows the distribution of the iris flowers in the feature space, where the x-axis represents the sepal length and the y-axis represents the sepal width. Each point in the plot represents an iris flower, and the color of the point indicates its class label (setosa, versicolor, or virginica).
The decision boundary of the logistic regression model is shown as a contour plot, which separates the feature space into two regions, one for each class label. The contour plot is drawn with a red-blue color map, where blue represents the region for class 0 (setosa), and red represents the region for class 1 (versicolor and virginica).
Conclusion
Logistic regression is a powerful and widely used statistical method for analyzing and predicting the outcome of a binary event. Scikit-learn is a popular Python library for machine learning that provides a range of supervised and unsupervised learning algorithms, including logistic regression.
In this article, we have discussed how to perform logistic regression using scikit-learn, and we have used the iris dataset to illustrate the process. We have also shown how to evaluate the performance of the model and visualize the results using a scatter plot and a decision boundary.
If you are new to machine learning, logistic regression with scikit-learn is a great place to start. With a basic understanding of Python and some programming experience, you can easily get started with this powerful method and start exploring the world of machine learning.