Bagging and Pasting#

Data#

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

X, y = make_moons(
    n_samples=1000,
    noise=0.1,
    random_state=42
)

cluster1 = X[y == 0]
cluster2 = X[y == 1]

plt.scatter(*cluster1.T, label="Cluster 1")
plt.scatter(*cluster2.T, label="Cluster 2")
plt.legend()
plt.show()

# Split training and test datasets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
)
../_images/8ea7df4c15549d4ed1c82662868b99bc6e8b5b5351ddb01a2b713de8e80aab43.png

Bagging#

In the following, we define an ensemble classifier consisting of 500 decision trees, and it will be trained on 100 samples.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    # It is an esemble of 500 decision trees
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    
    # Train on 100 samples
    max_samples=100,
    
    # If set true, then the bagging technique is in use
    bootstrap=True,
    
    # Use all available CPU cores
    n_jobs=-1,
    
    # For reproduction
    random_state=42
)

Fit the model:

bag_clf.fit(X_train, y_train)
BaggingClassifier(estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluate the prediction accuracy on the test dataset:

from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.985

Out-of-Bag Evaluation#

Suppose there are in total \(n\) samples in the training dataset.