Testing for Normality
STATA – EXPLAINING STATISITICS – Plus Command lines for STATA
Following the creation of a data set for a specific Tooth Development Stage it is necessary to determine whether or not the data are Normally distributed. This is important because the mathematical characteristics of a Normally distributed variable determine the statistical and mathematical procedures that are permissible.
The general approach is to assess whether or not the sample data are Normally distributed. the concept of the Norma distribution is covered in most statistical textbooks. An example of a good explanation is in Altman DG 1991 pp 132 – 145.
The first approach is to submit the data to a statistical test for Normality.
There are several of these and most statistical software packages have the test available.
The test used her is the Shapiro-Wilk Test for Normality.
See STATA handbook
. swilk ll8gf
Variable Obs W V z Pro
ll8gr 101 0.99174 0.688 -0.830 0.79682
The last value on the right gives the probability of the distribution is different from Normal. The value of 0.79682 [ 79.68% ] means that there is a 79.68% chance that the data array for LL8Gf is normally distributed. If this p value is less than 0.05 [ 5% ] then the conclusion would be that the data array is significantly different from Normal. The consequence of this is that Non Parametric Statistical Tests should be used.
The second approach is to carry out a Normal distribution plot.
In STATA the command is . pnorm ll8gf
This is a Normal probability plot of LL8Gf. It is clear that the points lie on a relatively straight line. The conclusion from this is that the sample data for LL8Gf are Normally distributed.
Deviations from Normal are usually apparent where the data points depart noticeably from the straight line.
As can be seen, there are noticeable deviations from the straight line at the lower extreme and near the upper extreme.
The third approach is to graph a Probability Density Function. This is a histogram composed of the whole of the data for LL8Gf but plotted in such a way that all the data in specific age intervals, usually 0.25 of a year, are plotted as a vertical bar. These intervals are usually called bins
In STATA the command is . histogram ll8gf, frequency normal width(0.20)
The number of data points within each age range is counted and then a histogram is created with the bins plotted consecutively from the lowest age (on the left) to the highest age (on the right). As can be seen a Normal distribution curve (Gaussian curve) is superimposed on the histogram. Below, the age intervals (bins) are set at 0.5 of a year (instead of 0.2 as in the above historgram. The superimposed curve is identical.
It has three important features
1. The curve is BELL shaped and exhibits L –R symmetry ( in this case almost!)
2. The majority of the values are heaped up near the middle of the curve
3. The Left and Right tails have only a few of the data points at the low and high ends of the range of the data for LL8Gf [ the summary statistics for this data array are given below ]
The fourth way of looking at the distribution is to create a ‘Box and Whisker’ plot.
In STATA the command is . graph hbox ll8gf, nooutsides outergap(200) alsize(50)
The vertical bars on the Left and Right of this plot indicate the lower an upper extremes of the data. These with attendant horizontal lines extending from the box are the “whiskers”.
The grey box in the middle indicate the middle 50% of values for the LL8Gf dataset. There is a separate page on Box and whisker plots and their interpretation.
Taken together, these four different methods indicate whether or not the data are Normally distributed. Many colleagues just plot the Normal distribution (the second Method) and assess the plot visually to determine whether or not the data array can be regarded as Normally distributed.
It is reassuring to use a test for Normality and given the thrust of statistical scrutiny of the DAE technique, this is our preferred method. It is important to be aware that one or two outliers may result in a probability value of less that 0.05 [ 5%] and lead to the conclusion that the data is not Normally distributed. In general, investigators giver greatest credence to Approach 2 and / or Approach 1.