Normal/Gaussian Distributions


Normal Distributions are the most common distributions in statistics primarily because they describe a lot of natural phenomena. Normal distributions are also known as ‘Gaussian distributions’ or ‘bell curve’, because of the bell shaped curve.

bell

Samples of heights of people, size of things produced by machines, errors in measurements, blood pressure, marks in an examination, wages payed to employees by a company, life span of a species, all of these follows a normal or nearly normal distribution.

I don’t intend to cover a lot of mathematical background regarding normal distributions, still it won’t hurt to know just a few simple mathematical properties of normal distributions:

  • Bell curve is symmetrical about mean(which lies at the center)
  • mean = median = mode
  • Only determining factors of normal distributions are its mean and standard deviation

We can also get a normal distribution from a lot of datasets using Central Limit Theorem(CLT). In layman’s language CLT states that if we take a large number of samples from a population, multiple times and go on plotting these then it will result in a normal distribution(which can be used by a lot of statistical and machine learning models).

A lot of machine learning models assumes that data fed to these models follows a normal distribution. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the machine learning and Statistical models which assumes a normally distributed input data are:

  • Gaussian naive Bayes
  • Least Squares based (regression)models
  • LDA
  • QDA

It is also quite common to transform non-normal data to normal form by applying log, square root or similar transormations.

If plotting the data results in a skewed plot, then it is probably a log-normal distribution(as shown in figure below), which you can transform into normal form, simply by applying a log function on all data points.

log-normal

Once it is transformed into normal distributions, you are free to use this dataset with models assuming a normal input data(as listed in above section).

As a general approach, Always look at the statistical/probability distributions as your first step in data analysis.

Related Posts

How to train your Deep Neural Network

List of commonly used practices for efficient training of Deep Neural Networks.

Dropout with Theano

Implementing a Dropout Layer with Numpy and Theano along with all the caveats and tweaks.

L1 vs. L2 Loss function

Comparison of performances of L1 and L2 loss functions with and without outliers in a dataset.

Electricity Demand Analysis and Appliance Detection

Analysis of Electricity demand from a house on a time-series dataset. An appliance detection systems is also created using K-Means Clustering based on the electricity demand.

Phishing Websites Detection

Phishing Websites detection with Random Forest, along with the breakdown of most important features, while detecting a phishing website.

Google Summer of Code 2014

Got Selected for Google Summer of Code 2014, with Mifos as my mentoring organization.