 EvanJ January 25th, 2017 06:47 PM

How Close To The Normal Distribution Is My Data?

I have 351 numbers. About half are on each side of the mean. 169 are higher than the mean and 182 are lower than the mean. Here is how my data compares to the normal distribution in terms of what percent of the numbers are within 0.5, 1, and 2 standard deviations of the mean:

Within 0.5 standard deviations: exactly 1/3rd of my numbers, 38.3% for the normal distribution
Within 1 standard deviation: 63.5% of my numbers, 68.3% for the normal distribution
Within 2 standard deviations: 97.2% of my numbers, 95.4% for the normal distribution

The farthest from the mean any of my numbers are is 2.5334 standard deviations above the mean, so all of my numbers are within 3 standard deviations of the mean. It's obvious that my numbers are less likely than the normal distribution to be within 1 standard deviation of the mean and more likely than the normal distribution to be between 1 and 2 standard deviations away from the mean. The highest 35 numbers (about 10 percent of 351) are an average of 1.728 standard deviations above the mean. The lowest 35 numbers are an average of 1.598 standard deviations below the mean.

Without looking at all the numbers, would you say the numbers are close to being normally distributed?

 123qwerty January 26th, 2017 06:32 AM

From your description, it sure sounds close, but you should do a chi-squared test for goodness-of-fit (or some other goodness-of-fit test) to make sure, since you didn't really provide a lot of info...

 EvanJ January 27th, 2017 10:58 AM

I have no idea how to do a chi-square test. https://en.wikipedia.org/wiki/Pearso...i-squared_test says "the expected (theoretical) frequency of type i, asserted by the null hypothesis that the fraction of type i in the population is p_{i}" (the symbols might not look right." That sounds like it refers to whole numbers like comparing the observed and expected number of heads from 20 coins, but that's not what I'm working with.

 123qwerty January 28th, 2017 06:44 PM

I would just copy code from the Internet and do it in R. :p

Joking aside...

What you're doing is quite similar, just the continuous analogue of that :D I've looked at a few websites and this link explains quite well what you should do. The basic idea is to make an empirical cdf from your data and determine the distance from your empirical cdf to the ideal normal cdf.

 EvanJ January 28th, 2017 07:02 PM

Is there a way of taking an amount of numbers, mean, and standard deviation, and having a website generate what all the numbers would be if they were normally distributed?

 123qwerty January 28th, 2017 07:09 PM

The website wouldn't know what the bins you want are... C'mon, you can generate what you want in R or Excel :p

 EvanJ January 29th, 2017 07:11 AM

I've never worked with bins. I have numbers in Excel, and I know how to make Excel do basics like standard deviations, but not statistical tests.

 123qwerty January 30th, 2017 04:41 AM

I'm sure you have; you just probably didn't know they were called bins. For example, when you create a histogram, you have intervals like (.5, 5.5], (5.5, 10.5], etc., and those are called bins.

The Excel functions you'll need for your current purpose are NORM.DIST and CHISQ.TEST. I'm sure you can find the necessary documentation online to learn to use them; if not perhaps you could a subset of the data and I can show you how it's done.

