Applying And Using the Normal Distribution for Data Science | by Emma Boudreau | Jul, 2023


Data Science Steps

Reviewing the various applications of the normal distribution for Data Science

Emma Boudreau

Towards Data Science

(image by author)

One thing that might be exceedingly difficult when getting started with Data Science is figuring out where exactly that journey begins and ends. In terms of the end to your Data Science journey, it is important to remember that strides are being made in the field everyday and there are bound to be new advancements — be prepared to learn a lot. Data Science not only consists of Science, stats, and programming, but also several other disciplines.

In order to minimize the overwhelming nature of Data Science it is important to take information in bite-size chunks. It certainly can be fun to go down rabbit holes of research and learn more about specific areas of the domain — be it data, programming, machine-learning, analytics, or Science. While this excites me, sometimes it is also great to narrow that focus, and with that learn everything we possibly can about one specific topic. For beginners, it makes sense that these interlocked domains leave a questionable place to start. One thing I would attest to is that statistics and the normal distribution are a great place to start when it comes to Data Science. I wrote an article where I outlined why this is and went into detail on how the normal distribution. We will do a brief summary of this article here, but many details will be left out.

The normal distribution as described above is a simple Probability Density Function (PDF) that we can apply over our data. This function, which we will call f, calculates the number of standard deviatians x is in the mean for f(x). Think. We need standard deviations from the mean, how would we check how many standard deviations a value is from the mean? Well first, we need to see how far it is from the mean, right? Then we need to see how many standard deviations that difference is. So that is exactly what we do in the formula. So for each x we subtract the mean and then divide the difference by the standard deviations. In statistics, lowercase sigma (σ) represents the standard deviation and lowercase mu (µ) represents the mean. In the formula below, x bar (x̄) represents the observation (the x in f(x) above.)

Pay attention mostly to the part that is highlighted.

Into the programming language

At the end of the last article, we brought this into a programming language — Julia. The choice of language is entirely up to the Data Scientist, but there are also trade-offs to consider and it is also important to consider what the industry is doing. For example, R is a relatively slow language but has analytics packages that have been refined and maintained for years by great developers and great dashboard tools as well. The most popular choice today is likely Python for its speedy connection to C libraries and ease of use. Julia is a bit of a new language, but it is my favorite programming language and one that I think most Data Scientists should be aware of. While Julia has been skyrocketing in popularity, there is always access to more jobs if you know both, as well. Luckily, most of the popular languages commonly used for Data Science tend to have a lot in common and end up being quite easy to reference to one another. Here is our normal distribution written in Python, and Julia REPLs respectively.

python

>>> from numpy import mean, std
>>> x = [5, 10, 15]
>>> normed = [(mean(x) - i) / std(x) for i in x]
>>> print(normed)
[1.224744871391589, 0.0, -1.224744871391589]

julia

julia> using Statistics: std, mean

julia> x = [5, 10, 15]
3-element Vector{Int64}:
5
10
15

julia> normed = [(mean(x) - i) / std(x) for i in x]
3-element Vector{Float64}:
1.0
0.0
-1.0

Here are the notebooks for each programming language, as well. I will be doing notebooks in all three languages in order to not only make this tutorial accessible for everyone but also promote the idea of engaging with multiple languages. These languages are rather similar and pretty easy to read, so it is pretty easy to compare and contrast the differences, see which languages you like and also explore more in-depth trade-offs to each language.

notebooks

python

julia

Setting up our functions

The first thing we are going to need is a function that will give us the normal of a Vector of numbers. This is as simple as getting the mean and the standard deviation before plugging the two and our xbars into our formula. This function will take one argument, our Vector, and then it will return our normalized Vector. For this, we of course also need the mean and standard deviation — we could use dependencies for this. In Python, we would use Numpy’s mean and std functions. In Julia, we would use Statistics.mean and Statistics.std. Instead, today we will be doing everything from scratch, so here are my simple mean and standard deviation functions inside of both Python and Julia:

# python
import math as mt
def mean(x : int):
return(sum(x) / len(x))

def std(arr : list):
m = mean(arr)
arr2 = [(i-m) ** 2 for i in arr]
m = mean(arr2)
m = mt.sqrt(m)
return(m)

# julia
mean(x::Vector{<:Number}) = sum(x) / length(x)

function std(array3::Vector{<:Number})
m = mean(array3)
[i = (i-m) ^ 2 for i in array3]
m = mean(array3)
try
m = sqrt(m)
catch
m = sqrt(Complex(m))
end
return(m)
end

Now that we have some functions to get the values that we need for our function, we need to wrap this all into a function. This is pretty simple, I will just get our population mean and standard deviation using our methods above and then using a comprehension to subtract the mean from each observation and then divide the difference by the standard deviations.

# python
def norm(x : list):
mu = mean(x)
sigma = std(x)
return([(xbar - mu) / sigma for xbar in x])
# julia
function norm(x::Vector{<:Number})
mu::Number = mean(x)
sigma::Number = std(x)
[(xbar - mu) / sigma for xbar in x]::Vector{<:Number}
end

Now let’s try out our normalization function. This is an easy one to test, we just provide a vector which we know the mean of. This is because the mean of our Vector should be zero. So in the case of [5, 10, 15] , 0 would be 10 — the mean of [5, 10, 15] . 5 would be about -1.5, one standard deviations from the mean (our standard deviations are equal to numerical 2.5 in these circumstances).

norm([5, 10, 15])

[-1.224744871391589, 0.0, 1.224744871391589]

Statistically significant values on a normal distribution generally begin to be noticed when they are nearly 2 standard deviations from the mean. In other words, if most people were about 10 inches tall and someone was 20 inches tall, this would be 3 standard deviations from the mean and quite statistically significant.

mu = mean([5, 10, 15])
sigma = std([5, 10, 15])

(15 - mu) / sigma

1.5811388300841895

(20 - mu) / sigma

3.162277660168379

Normal for analysis

The Z distribution, or normal distribution, also has many applications in data analysis. This distribution can be used for testing, but is not as commonly used as something like a T-test. The reason for this is that the normal distribution has rather short tails. As a result, it is often reserved for tests being performed on large sample sizes where the variances are known. Comparing the normal distribution to something like the T distribution, for example, we see that the tails of the T distribution are a lot longer. This means that there is a longer area of statistical significance — thus it becomes easier to detect.

Just to give an idea — the tails of the T distribution grow longer and the mean less weighted as degrees of freedom decrease. The t-distribution above probably has about 8 degrees of freedom, but a T distribution with 1 degree of freedom would be a lot flatter with wider tails. image by author

This type of test, a Z-test, will test whether or not the population means are different enough to be statistically significant. The formula is also very similar to the formulas we have seen from the PDF prior, so not much is new here. Rather than using each observation, we simply change xbar to represent the mean of our the population we want to test. This test will return something called a Z-statistic. Similarly to a T-statistic, this is ran through another function to give us a probability value. Let’s create a quick one-dimensional set of observations and see how we would perform such a test.

pop = [5, 10, 15, 20, 25, 30]
mu = mean(pop)
sigma = std(pop)

We will grab a random sample from the middle and calculate a Z-statistic:

xbar = mean(pop[3:5])

Now we simply plug this into our formula…

(xbar - mu) / sigma
0.5976143046671968

This new number is our Z-statistic. The math to get these statistic values into probability values is quite complicated. There are libaries in both languages that will help with these things. For Julia, I recommend HypothesisTests and for Python I recommend the scipy module. For this article, we are going to be using an online Z statistic to probability value calculator available here. Let’s plug our Z-statistic into it:

image by author

As we might have expected, some of our population that resides really close to the rest of the samples and the mean is not statistically significant at all. That being said, we can of course experiment with something far more statistically significant and reject our null hypothesis!

xbar = mean([50, 25, 38])
(xbar - mu) / sigma

4.820755390982054

image by author

The normal distribution certainly works well for testing. The key is to understand that this form of testing needs a large sample size and does not have applications to all data. In most cases, for beginners I would recommend starting with a distribution that it is easier to test with, such as the T distribution. Data is going to matter a lot more for Z-tests, and it can be hard to find large sources of data for beginners, moreover it can be more difficult to get a statistically significant result — even when things are statistically significant.

The normal distribution can also be used in some capacity for quick analysis during Data-Science projects. Being able to turn data into its relationship to the population can be incredibly useful for everything from Data Visualization to figuring out how varied a given population is. There is a lot we can learn about a population by investigating our observations’ relationship to the mean. If you would like to learn more about this process, I have a beginner-friendly overview that might be helpful in such a context which you may read here:

Normal for data normalization

Another great application for the normal distribution is utilizing the distribution for normalizing data. There are a few different things that can mess up a continuous feature, one of the most significant of these can be outliers. We need to get outliers out of our data so that way our data is a generalization. Remember, the key to building great data is to build a great population. What I mean by that is that we want the totality of the data — things like the mean — to be representative of what the data would normally be with some level of variance. This way whenever something is different it becomes very obvious.

Given that the normal distribution tells us how many deviations a value is from the mean, it might be easy to see how we could use this for data normalization. As stated prior, 2.0 is about where things start becoming significant. That being said, we can make a mask and use this to filter at bad values!

# julia
function drop_outls(vec::Vector{<:Number})
normed = norm(vec)
mask = [~(x <= -2 || x >= 2) for x in normed]
normed[mask]
end

With this simple mask filtering, we have added the ability to discern whether or not values lie far outside of the mean and drop them based on that. In most cases, we might also want to replace these outliers with the mean so that we do not lose the observation on other features or our target.

# python
def drop_outls(vec : list):
mu = mean(vec)
normed = norm(vec)
mask = [x <= -2 or x >= 2 for x in normed]
ret = []
for e in range(1, len(mask)):
if mask[e] == False:
ret.append(vec[e])
else:
ret.append(mu)
return(ret)

Normal for scaling

The final application of the normal distribution that is common in Data Science is the Standard Scaler. The Standard Scaler is simply the normal distribution applied over your data. This scaler can be incredibly helpful because it helps translate your data into data that is more closely related to the feature it is a part of. This is incredibly helpful for machine-learning and makes it really easy to increase the accuracy of a model given that you have a continuous feature. Using the Standard Scaler is incredibly straightforward; simply use our PDF as before and get the normalized feature.

myX = [1, 2, 3, 4, 5]
normedx = norm(x)

This is used for data provided to a machine-learning. The normal distribution is often used to process continuous features in many machine-learning models that are deployed everyday.



Source link

Leave a Comment