Statistics – Understanding Testing for Normality

Testing for Normality

STATA – EXPLAINING STATISITICS – Plus Command lines for STATA

Following the creation of a data set for a specific Tooth Development Stage it is necessary to determine whether or not the data are Normally distributed. This is important because the mathematical characteristics of a Normally distributed variable determine the statistical and mathematical procedures that are permissible.

The general approach is to assess whether or not the sample data are Normally distributed.  the concept of the Norma distribution  is covered in most statistical textbooks. An example of a good explanation is in Altman DG 1991 pp 132 – 145.

The first approach is to submit the data to a statistical test for Normality.

There are several of these and most statistical software packages have the test available.

The test used her is the Shapiro-Wilk Test for Normality.

See STATA handbook

. swilk ll8gf

Variable          Obs         W                 V             z            Pro

      ll8gr         101     0.99174       0.688     -0.830     0.79682

The last value on the right gives the probability of the distribution is different from Normal. The value of 0.79682 [ 79.68% ] means that there is a 79.68% chance that the data array for LL8Gf is normally distributed. If this p value is less than 0.05 [ 5% ] then the conclusion would be that the data array is significantly different from Normal. The consequence of this is that Non Parametric Statistical Tests should be used.

The second approach is to carry out a Normal distribution plot.

In STATA the command is . pnorm ll8gf

pnorm ll8gf

This is a Normal probability plot of LL8Gf. It is clear that the points lie on a relatively straight line. The conclusion from this is that the sample data for LL8Gf are Normally distributed.

Deviations from Normal are usually apparent where the data points depart noticeably from the straight line.

pnorm ll8gf-nn

As can be seen, there are noticeable deviations from the straight line at the lower extreme and near the upper extreme.

The third approach is to graph a Probability Density Function. This is a histogram composed of the whole of the data for LL8Gf but plotted in such a way that all the data in specific age intervals, usually 0.25 of a year, are plotted as a vertical bar. These intervals are usually called bins

In STATA the command is . histogram ll8gf, frequency normal width(0.20)

normdistr_ll8gf

The number of data points within each age range is counted and then a histogram is created with the bins plotted consecutively from the lowest age (on the left) to the highest age (on the right). As can be seen a Normal distribution curve (Gaussian curve) is superimposed on the histogram. Below, the age intervals (bins) are set at 0.5 of a year (instead of 0.2 as in the above historgram. The superimposed curve is identical.

normdistr_ll8gf0.5bin

This is a classic (almost) Normal Distribution plot and Normal distribution curve.

It has three important features

1. The curve is BELL shaped and exhibits L –R symmetry ( in this case almost!)

2. The majority of the values are heaped up near the middle of the curve

3. The Left and Right tails have only a few of the data points at the low and high ends of the range of the data for LL8Gf [ the summary statistics for this data array are given below ]

normdistr_ll8gf_summ_data

The fourth way of looking at the distribution is to create a ‘Box and Whisker’ plot.

STATA

In STATA the command is  . graph hbox ll8gf, nooutsides outergap(200) alsize(50)

box_&_whisker_llgf

The vertical bars on the Left and Right of this plot indicate the lower an upper extremes of the data. These with attendant horizontal lines extending from the box are the “whiskers”.

The grey box in the middle indicate the middle 50% of values for the LL8Gf dataset. There is a separate page on Box and whisker plots and their interpretation.

Taken together, these four different methods indicate whether or not the data are Normally distributed. Many colleagues just plot the Normal distribution (the second Method) and assess the plot visually to determine whether or not the data array can be regarded as Normally distributed.

It is reassuring to use a test for Normality and given the thrust of statistical scrutiny of the DAE technique, this is our preferred method. It is important to be aware that one or two outliers may result in a probability value of less that 0.05 [ 5%] and lead to the conclusion that the data is not Normally distributed. In general, investigators giver greatest credence to Approach 2 and / or Approach 1.

~~~~~||~~~~~