K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known,
or labelled, outcomes.
The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters in a dataset.
A Cluster refers to a collection of data points aggregated together because of certain similarities.
We’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the
imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest
cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How K-means works:
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids,
which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the
positions of the centroids
It halts creating and optimizing clusters when either:
1. The centroids have stabilized — there is no change in their values because the clustering has been successful.
2. The defined number of iterations has been achieved.
Unlike many other machine learning techniques, k-means is used on unlabeled numerical data rather than data that is
already defined, making it a type of unsupervised learning. It is one of the most popular unsupervised learning techniques
due to its simplicity and efficiency, helping us data scientists out when we don’t have the most organized data set.
Pros:
1. Fast and efficient.
2. Works on unlabeled numerical data.
3. Iterative technique.
Cons:
1. Must understand the context of your data well.
2. Have to choose your own k value.
3. Lots of repetition.
4. Does not perform well when outliers are present.
The Mathematics Behind K-Means:
Centroids are computed based on the mean between data points in a cluster. The formula used to compute this distance is
called Euclidean distance.
It’s simply the straight line distance formula between two points. In this case, the distance is calculated between
the centroid and each data point in a cluster. Then the mean distance of all the points in a cluster is taken and used
to form the centroids.