K-means clustering with Python


Introduction

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain K number of clusters. The main idea is to define K centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. At this point we need to re-calculate K new centroids of the clusters resulting from the previous step. After we have these K new centroids, a new binding has to be done between the same data set points and the nearest new centroid. As a result of this loop we may notice that the K centroids change their location step by step until no more changes are done.

Implementation

Scikit-learn provides with full implementation of K-means algorithm though KMeans class. Let's have a look at several interesting situations, which might occur during data clustering:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[ 0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
                                cluster_std=[1.0, 2.5, 0.5],
                                random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_filtered)

plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()

At first we wrongly assume that there are 3 clusters in the data and make K-means find them. Later we transform the data anisotropicly. Since K-means uses Euclidian distance to to associate points to clusters, it does not work well with non-globular clusters. The last two datasets introduce blobs of different variance and size, but this makes no difference and the classification succeeds.

Conclusion

K-means provides us with easy to use clustering algorithms. It's fast, easy to follow and is used vastly in various fields including Vector Quantization.

Comments

Popular posts from this blog

Rust Static Analysis and Blockchain Licensing

Length extension attack

CAP Theorem and blockchain