Lesson #4 - Normality
Normal distributions, percentile, z calculations
In the previous lesson, we were comparing two data sets against each other to see which had a higher correlation. Here, we will be comparing multiple data points within a single data set with percentiles and z scores. We will also be diving into a key concept that is universally used in statistics, normality. This is a criteria that a distribution must follow in order for calculations to be made in this lesson.
A. Normal Distributions
Having a normal distribution in statistics regardless of the field is crucial for making sound judgments and conclusions. Here's what the perfect normal distribution should look like:
There are many other illustrations of normal distributions online that may be clearer, but this is just a quick model I made on Google Slides.
Anyways, the 1st important thing to learn about normal distributions is the shape. We call this a bell curve where the majority of the data falls in the middle, and we get less quantities of data toward the far left and right.
The 2nd thing to note is the percentages. According to the normal distribution, approximately 68% of our data should fall within 1 standard deviation of our data. We got this because 34% from our -1sd + 34% from our +1sd = 68%. And subsequently, approximately 95% for within 2 standard deviations, approximate.y 99.7% within 3 standard deviations, and approximately 100% within 4 standard deviations.
The 3rd thing is that the middle line is the 50% mark, where 50% of the data will lie on the left side of this mark, and 50% of the data will lie on the right side of this mark.
The fourth thing is that in real sets of data, we are probably never going to get this exact picture-perfect normal distribution curve. So, we can just aim for our distribution to be approximately normal.
B. Percentile
Percentile, which may be a term you have heard before, can only be safely generated if the data is approximately normally distributed. In terms of the actual definition, let's give an example. Say you have your SAT score, and Collegeboard says you are in the 50th percentile This means that you scored the same or better than 50% of others who took the SAT. If you are in the 99th percentile, then you've scored the same as or better than 99% of other SAT takers.
The normal distribution is helpful in finding the percentile since it has approximate values already given. For example, if we know that this point is exactly one standard deviation to the right, then it is at the 68th percentile. 68% of the data is the same as or below it.
Now Let's Apply Fantasy Football!
Example #1
With the same exact list of 50 WRs in 2023 and their respective points per game, I put the data into a plot and this is what it looks like:
Because it is an approximately normal distribution, we can continue on with our example. But because it is not perfect by any means, our answer must include the word "approximately". Then, with the values properly inputted into a normal distribution given the mean and standard deviation, we will get this:
Prompt #1: Of this data set, what percent of the top 50 WRs average 10.895 PPR points per game and 17.613 PPR points per game?
Work/Explanation: The question gave us two values, and they happen to be labeled on our distribution. To visualize this, we can scribble in the area that we are working with:
Now, we can rephrase the prompt to the percent of the data that falls in the blue area. So judging by the distribution, 14.254 is where the 50% mark is, and one standard deviation to the left of that is 10.895. One standard deviation to the right of that is 17.613. As a basic rule of normality that we discussed above, approximately 68% of our data fall within 1 standard deviation.
Answer: Approximately 68% of the top 50 WRs in 2023 averaged from 10.895 PPR Points per game to 17.613 PPR Points per Game.
Prompt #2: What percent of the top 50 WRs in 2023 averaged from 7.536 PPR points per game and 20.972 PPR points per game?
Work/Explanation: Taking the same steps as the previous question, let's first visualize the area that we are targeting since both of these numbers are labeled on our distribution.
Now, we can rephrase the prompt to what percent of the data falls in the blue area. If 17.613 marks our +1 standard deviation, then 20.972 marks our +2 standard deviation, and if 10.895 marks our -1 standard deviation, then 7.536 marks our -2 standard deviation. As a basic rule of normality that was stated above, approximately 95% of our data should fall within 2 standard deviations of the normal distribution.
Answer; Approximately 95% of the top 50 WRs in 2023 averaged from 7.536 PPR points per game to 20.972 PPR points per game.
Prompt: #3: What percent of the top 50 WRs in 2023 averaged from 4.177 PPR points per game and 24,331 PPR points per game?
Work/Explanation: Let's visualize the area.
If 20.972 marks our +2 standard deviation, then 24.331 marks our +3 standard deviation, and if 7.536 marks our -2 standard deviation, then 4.177 marks our -3 standard deviation. And as a basic rule of normality that was stated above, approximately 99.7% of our data should fall within 3 standard deviations of the normal distribution.
Answer: Approximately 99.7% of the top 50 WRs in 2023 averaged from 4.177 PPR Points per game to 24.331 PPR Points per Game.
Note: If 20.972 is our +2 standard deviation, then 24.331 is our +3 standard deviation. But that's where it stops. There is no 4th standard deviation in the plot. Why? Well in real-life cases, we can have values that fall beyond plus minus 4 standard deviations, perhaps 5 or even 6 standard deviations. So always including the 4th standard deviations can sometimes be inaccurate. Also, it can be unnecessary. Because only approximately .3% of the data falls beyond the +- 3 standard deviations stopping our markings here is commonly done in statistics.
Example #2
We are going to use the same data for this second example.
Prompt #1: If a WR averaged 14.254 PPR points per game then what is their percentile?
Work/Explanation:
According to our visualization, the area that we are working with is from the all the way left (the lowest value) to the middle of the distribution. The middle of the distribution is where sd = 0, meaning that 50% of the data are less than or equal to it, and 50% of the data or more than or equal to it.
Answer: A WR who averaged 14.254 PPR points per game would be at the 50th percentile, meaning that they averaged more points than or equal to 50% of the top 50 WRs in 2023.
Prompt #2: If a WR averaged 7.536 PPR points per game then what is their percentile?
Work/Explanation:
Obviously, this WR is on the lower end in terms of fantasy production. 7.536 is at the -2 standard deviation, which marks 2.35% of the data.
Answer: A WR who averaged 7.536 PPR points per game would be at the 2.35th percentile, meaning that they averaged more points than or equal to 2.35% of the top 50 WRs in 2023. In a Fantasy Football perspective, this is not a very good WR.
Prompt #3: If a WR averaged 17.613 PPR points per game, what percent of the WRs in this population averaged more points than them?
Work/Explanation: Different from the past prompts, this is a 2-step problem. First, we just do what we have been doing, which is finding what percentile they are.
We figured it out already that 17.613 marked the +1 sd. This point would calculate out to contain approximately 84% of the data. This means that a WR who averaged 17.613 PPR points per game averaged more points than or more points than or equal to approximately 84% of the top 50 WRs. Therefore, they are at the 84th percentile. Step 1 is complete.
Now for Step 2, we need to find out the percent of WRs who averaged more points than or equal to this WR. So because this WR's percentile was 84th, all we need to do is do 1-.84 = .16 = 16%. This makes sense because a percentile is a measurement of those you scored better than or equal to. So to reverse this to find those that scored equal to or better than you, we subtract the percent by 1.
Answer: Approximately 16% of WRs in the top 50 WRs in 2023 averaged more than or equal to 17.613 PPR points per game.
End of Example
C. Z Calculations
Earlier, our points were very convenient. For example, our prompts were asking us about 17.613 points, which is exactly where the +1 sd was., and 10.895 points which is exactly where the -1 sd was. But what if we wanted to find the percentile of someone who averaged 15 points? It's somewhere between 14.254 (0 sd) and 17.613 (1 sd), and we aren't completely sure where it is. And even if we do find our sd, how do we know what percentile it is if it's not a "basic rule"?
However, we can solve this problem with 2 methods.
Method 1. Use the formula: Z = ( x - u ) / σ
where Z = standard score (how many sd away from the mean) , x = observed value, u = mean, and σ = standard deviation.
So using the same data set, let's now let's plug in our values:
Z = ( x - u ) / σ
Z = ( 15 - 14.254 ) / 3.359
Z = .22
Now, we know that 15 points is .22 standard deviations away from the mean, 14.254.
Apart of our basic rules, we know that 1 standard deviation away from the mean marked the 84th percentile, but 0.22 standard deviation isn't a part of our basic rules. So to get around this, we have to use an AP Statistics reference sheet called Table A, provided by Collegeboard. It looks like this:
For our example, we want to find a z score of positive 0.22. So first, we get to the sheet that displays positive z scores. And then, I like to go to the left column and read from top to down until I find the first digit. In this case, my first digit is 0.2. Then, at that row, I go from left to right until I reach my second digit, which would be 0.02. This will then give us our percentile.
A WR who averages 15 points per game would approximately be at the 58.71th percentile, or if we want to round, then the 59th percentile.
In Fantasy Football point production, the higher percentile our players are, the better. We want our team to be above as many teams as possible in z scores.
Method 2. Using a calculator
We are going to do the same problem where we want to find the percentile of a WR who averages 15 points per game. However, you should first check if your calculator has this feature since not all do. Since TI-84s are a very popular calculator, I will show the directions using this type, but feel free to search for other directions if you don't have a TI-84.
Step 1: To begin, we will press "2nd" and then quickly follow that with "vars."
Step 2: In the distribution tab, we will select "normalcdf (" by either pressing the number 2, or by using the arrow keys to drop 1 slot down and then pressing enter.
Once you do that, the screen should appear like this:
lower:
upper:
u:
σ:
Paste
Step 3: Plug in numbers. In the lower slot, also called the lower boundary, we want the lowest number possible such as negative infinity. But not all calculators have this feature, so something like -1000 is safe. In the upper slot, also called the upper boundary, we would put 15. In u, we put the mean, 14.254. And in σ, we put the standard deviation, 3.359. After everything is inputted, we go down to "Paste" and press enter on our calculator.
To show our work ----> normcdf (lower = -1000, upper = 15, u = 14.254, σ = 3.359) = 0.5878
We get slightly different answers, compared to the 58.71th percentile, simply because one method is a reference paper and the other is a calculator. The calculators are marginally more accurate, but these answers are very close to each other and both of these are correct.
Step 4: Move decimal points. To get our percentile, we move the decimal points to times to the right to get an answer of approximately 58.78th percentile. If we round up, 59th percentile.
Now Let's Apply Fantasy Football!
Example #1
We are going to bring back our list of the top 35 RBs in terms of PPR ppg with min 7 games played in 2023. And as a reminder, this population is smaller because our attention doesn't span across 50 RBs, unlike how our attention is with the top 50 WRs. This decision is to make our population of RBs more generalizable to what we really are interested in.
And again, before proceeding, we must check to see if this list is normally distributed by shape.
And as you can see, this is unfortunately not normally distributed! It looks more like a skewed right.
Note: In an AP Stats FRQ, normally if the data is not normally distributed, we still have to proceed with the problem and complete it as if it were normal. However, we must make it clear that we need to interpret the results with caution because the data is not normally distributed. So that is what we are going to do with this problem.
Prompt: You are currently drafting a team and come across a hard choice to make: Puka Nacua or Breece Hall? Both are 6th in their respective positions in terms of averaged PPR Points per game, so who do you go with? In 2023, Nacua averaged 17.6 PPR points per game whereas the adjusted WR population average was 14.25 and SD of 3.36. Breece Hall averaged 17.1 PPR points per game whereas the adjusted RB population average was 14.41 and an sd of 3.04. Given this data, which player was better at their respective position?
Work/Explanation: For this problem, we will be using the z-score formula and our Table A Reference sheet.
Step 1. Let's find out Puka Nacua's Z score first:
Z = ( x - u ) / σZ = ( 17.6 - 14.25 ) / 3.36Z = 0.99
Because a z-score of .99 matches to .8389, Puka Nacua was approximately at the 84th percentile in the our Fantasy Football adjusted WR population.
Step 2. Now let's do Breece Hall's Z score
Z = ( x - u ) / σ
Z = (17.1 - 14.41) / 3.04
Z = .88
Because a z-score of .88 matches to .8106, Breece Hall was approximatey at the 81st percentile in our Fantasy Football adjusted RB population.
Answer: Puka Nacua was at the 84th percentile in his position, and Breece Hall was at the 81st percentile. Therefore, Puka Nacua is the better player respective to his position. However, because the distribution of the top 35 RBs was not normal, we must be cautious with these answers.
Note: Also, it depends on what we define as the perfect adjusted position population. For example, if we adjusted our populations to maybe the top 45 WRs or top 40RBs, we would get different answers.
Example #2
Prompt: We are going to keep our populations for both WRs and RBs and look at some more players to compare to each other. WR Nico Collins, who averages 17.4 PPR points per game, and RB Travis Etienne, who averages 16.6 PPR points per game, are both 7th in their position in terms of averaged PPR points per game. But who is better in their respective position?
Work/Explanation: This time, we are going to use our calculations to solve this problem.
Nico Collins --> normcdf (lower = -1000, upper = 17.4, u = 14.25, σ= 3.36) = 0.8257
Nico Collins was approximately at the 82.57th percentile, or the 83rd percentile to round up.
Travis Ettienne --> normcdf (lower = -1000, upper = 16.6, u = 14.41, σ = 3.04) = 0.7643
Travis Ettienne was approximately at the 76.43rd percentile or 77th percentile to round up.
Answer: Nico Collins was at the 83rd percentile in our adjusted WR population, and Travis Ettienne was approximately at the 77th percentile of our adjusted RB population. Therefore, Nico Collins was better in his position than Travis Ettienne was at his. But again, because our RB population was not normally distributed, we have to be cautious with these results and cannot be confident with them.
End of Example
2 Takeaways for Fantasy Football
1. Make sure the data is normally distributed
Making statements based on data that is not normally distributed can be very misleading. This can cause you to make poor Fantasy decisions, like trading for a player who is at the 75th percentile but is really more at the 50th percentile based on how skewed the data is. This is what we did with our RB population. We still calculated our answer but were cautious.
2. Greater averaged points is not always the best metric
Now let's say you are deciding between a QB who averages 18 points per game and a WR who averages 17 points per game. Many newcomers into Fantasy Football may select the QB because Fantasy Football is about winning by getting more points, and 18 points is greater than 17.
However, this is where we have to look at the percentiles. How does that QB compare to other QBs, and how does a WR like that compare to other receivers? The WR who averages 17 ppg is most likely better than a higher percent of his other WR peers. For the QB there may be a lot of other QBs who average 18 ppg or more. So, while the raw numbers like averaged PPG can be important, comparing players to other players in the same position is very critical as well.