Feature engineering, unsupervised classification, and anomaly detection with the versatility of the GMM algorithm
Gaussian Mixture Model (GMM) is a simple, yet powerful unsupervised classification algorithm which builds upon K-means instructions in order to predict the probability of classification for each instance. This property of GMM makes it versatile for many applications. In this article, I will discuss how GMM can be used in feature engineering, unsupervised classification, and anomaly detection.
While the Gaussian distribution of a single or multiple variables of a dataset attempts to represent the entire population probabilistically, GMM makes an assumption that there exist subpopulations in the dataset and each follows its own normal distribution. In an unsupervised fashion, GMM attempts to learn the subpopulations within the data and its probabilistic representation of each data point . This property of GMM allows us to use the model to find points that have low probability of belonging to any subpopulation and, therefore, categorize such points as outliers.
GMM essentially extends the multivariate Gaussian distribution to fit the subpopulation case by utilizing components to represent these subpopulations and alters the multivariate probability distribution function to fit the components. As a gentle reminder, the probability density function of the multivariate Gaussian looks like this:
In GMM, the probability of each instance is modified to be the sum of probabilities across all components and component weights are parameterized as 𝜙. GMM requires that the sum of all components weights is 1 so it can treat each component as a ratio of the whole. GMM also incorporates feature means and variances for each component. The model looks like this:
Notice the parallels between multivariate distribution and GMM. In essence, the GMM algorithm finds the…