Unsupervised Learning
Clustering
Clustering is a type of unsupervised learning that automatically groups a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. A commonly used algorithm of clustering is K-means algorithm.
K-means algorithm
K-means algorithm partition
Cluster centroids: centers of cluster.
- Randomly initialize
cluster centroids ;Just randomly choose
points from the training set is ok though different initialization will produce different clusters. - Assign points to its closest cluster centroids;
If there are no points assigned to a centroid, just reduce the number of centroids to
or reinitialize the centroids. - Move each cluster centroid to the average of all the points
that were assigned to it; - Keep doing so until the movements of centroids are small enough;
- Reinitialize all the centroids and train another set of
50-1000 times, choose the set with the lowest as the final result of clustering.
Cost function
Definition of some notations:
= index of cluster (1, 2,..., ) where example is currently assigned; = cluster centroid ; = cluster centroid of cluster to which has been assigned
Distortion cost function
The way to minimize
Choosing a
Elbow method: Choose the value at the elbow of
More generally, the choice of
Anomaly detection
Anomaly detection is another type of unsupervised learning that automatically identifies rare or anomalous events which are suspicous or harmful because they differ significantly from standard behaviors or patterns. The key algorithm in anomaly detection is density estimation that builds the model of
For multiple features
The followings are procedures of anomaly detection:
- Choose
features that you think might be indicative of anomalous examples; - Fit parameters
; - Given new example
, compute ; - Anomaly if
.
Evaluating anomaly detection system
In anomaly detection, we actually know which example is positive and which is not, that is, we have labeled data. However, we train the model using only normal examples, cross validate model with both anomalous and normal examples to determine
We only need to train the model one time as
Anomaly detection and supervised learning
type | Anomaly detection | Supervised learning |
---|---|---|
number of positive examples | small | large |
number of negative examples | large | large |
focus | detect negative examples | detect both negative and positive examples |
new types of anomalies | easy to detect | hard to detect |
The core idea of anomaly detection is to learn under what circumstances one example is normal. Therefore, anomaly detection is more robust if the types of anomaly keep changing while the standard of normal is stable.
Though supervised learning is also feasible when the number of positive examples is small, it can't learn much from positive examples. In contrast, since supervised learning learns what positive and negative examples look like from previous examples, it can't deal with the types of anomaly that didn't occur before well.
Choosing good features
Choosing good features is more important in anomaly detection than in supervised learning since supervised learning can rescale and take the best advantage of the features we give while anomaly detection treats each feature independent and equally important. To choose good features, there are several methods: