K-Means Clustering

K-Means Clustering...

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.

A Cluster refers to a collection of data points aggregated together because of certain similarities.

We’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:
1. The centroids have stabilized — there is no change in their values because the clustering has been successful.
2. The defined number of iterations has been achieved.

Unlike many other machine learning techniques, k-means is used on unlabeled numerical data rather than data that is already defined, making it a type of unsupervised learning. It is one of the most popular unsupervised learning techniques due to its simplicity and efficiency, helping us data scientists out when we don’t have the most organized data set.

1. Fast and efficient.
2. Works on unlabeled numerical data.
3. Iterative technique.

1. Must understand the context of your data well.
2. Have to choose your own k value.
3. Lots of repetition.
4. Does not perform well when outliers are present.

Centroids are computed based on the mean between data points in a cluster. The formula used to compute this distance is called Euclidean distance.

It’s simply the straight line distance formula between two points. In this case, the distance is calculated between the centroid and each data point in a cluster. Then the mean distance of all the points in a cluster is taken and used to form the centroids.

Learn More