Lesson #1- Basic Measures
Shape, Outliers, Center, Variability
These are the fundamentals of statistics. Here, we will characterize sets of data and introduce terms that you all will see in this section, the next sections, and most likely in your future studies regardless of what you pursue.
A. Shape
The shape is the first thing we have to mention when we characterize a distribution, or in other words, a set of data or numbers. When we look at distribution, we can recognize the shape by its curves and peaks. It may seem confusing now, but everything will be clear once we dive into some of the different types of shapes.
Note: I used histograms in this section, but there are many other ways to show distributions, like using dot plots or stem plots as well.
1. Symmetrical/Normal. This distribution shape means that if we were to split the distribution in half, then both sides would look the same. Now, it doesn't necessarily have to peak at the middle like this image though. Although most symmetrical distributions have peaks in the middle, at the end of the day, to qualify as symmetrical it just has to look the same on both ends.
For example, a distribution can be symmetrical AND bimodal (will get to this later on).
2. Skewed Right. This distribution shape means that the majority of the values fall towards the left side and that there aren't so many values on the right side. We call the side that does not have a lot of values the "tail." So since the tail is on the RIGHT side, we call this skewed RIGHT. In other words, there are many low values and not many high values.
Visually, think of it as the bars getting shorter from left to right.
3. Skewed Left. As you can tell, this is the opposite of skewed right. So now, there aren't many low values, or values on the left side. Therefore, we have a LEFT tail, making this skewed LEFT. And of course, now we have much more high values, or in other words, more values on the right side.
Visually, think of it as the bars getting taller from left to right.
Note: Skewed left distributions almost never reasonably appear in Fantasy Football. I'll be demonstrating this later on below.
4. Bimodal. Bimodal distributions are not very common in Fantasy Football, but there are definitely cases of them. Bi means two, so think of bimodal as two peaks. At 3.00-4.00, we have a peak, and at 7.00-8.00, we have a peak.
Remember how I said that a distribution can be symmetrical and bimodal? This is a clear example of it as if we cut this distribution in half, the left and right sides would be identical.
Note: Not all symmetrical distributions are bimodal. For example, one peak symmetrical distribution will not be bimodal. And, not all bimodal distributions are symmetrical. For instance, we can have two peaks without them being the same height, and thus not identical to each other.
Now Let's Bring in Fantasy Football!
Example #1- Analyzing Runningback Point Productions
With this distribution, I selected RBs 11-20 in terms of PPG in STD (Standard) scoring with a minimum of 7 games played in the 2023 NFL season. With this group, I tracked the number of points scored in each of their games played.
Prompt: What is the shape of this distribution?
Work/Explanation: Let's start at the left side. Although there aren't many values from 0-2 ppg, it does peak at 4-6 ppg, which is still on the left side. And as we go from left to right, we see that the overall quantity of values is decreasing more and more. The bars are getting shorter and shorter. In other words, there are many values on the left side of the distribution, but not many values on the right side, meaning it has a right tail.
Answer: The shape of the distribution is roughly skewed right.
Note: Notice how I said "roughly." This is because, in real life, we rarely get picture-perfect distributions that match the shape exactly. This is by no means a perfect skewed right, as we would want much more values from 0-4 ppg. So, therefore, saying "about" or "roughly" skewed right is perfectly acceptable.
Example #2- Skewed Rights in Fantasy Football Are Rare
So for this one, I TRIED to get a skewed right distribution, which meant high-scoring games only since the goal was to get a peak on the right side. Therefore, I selected the Top Overall FF Leader and tracked the number of points they scored each game PPR format. Some of these were historic offensive seasons like 2021 Cooper Kupp and 2019 CMC. But still, this isn't quite a skewed left distribution since the peak is in the middle, not the right side. And, if we happened to split this distribution in half, we can see that the left and right sides are similar to each other.
Therefore, the shape of this distribution is roughly symmetrical. This is all to show how hard it is to get skewed left distributions in Fantasy Football as even exclusively taking the best offensive seasons with the most points scored failed to form one.
End of Example
B. Center and Outliers
When we have a distribution, it is extremely helpful to know what the center point is. It not only gives us the context of our set of numbers but also can summarize it into one single value. The two most common forms of centers we use are means and medians.
1. Means mean the same thing as the average. These are super easy to calculate, as we simply add up all of our values divided by how many values we have. So for example, let's find the average age for RBs 1-35, in which they are ranked in PPG with a minimum of 7 games played. After some research, here's the list: (32, 28, 27, 24, 27, 29, 23, 25, 25, 29, 28, 25, 28, 25, 24, 22, 29, 29, 28, 23, 27, 25, 25, 26, 24, 26, 30, 27, 26, 23, 25, 22, 26, 22, 25). To clarify, all ages were recalibrated to during September 2023, since that is when most of us are drafting and considering age the most.
To calculate a mean: (32 + 28 + 27 + 24 + 27 + 29 + 23 + 25 + 25 + 29 + 28 + 25 + 28 + 25 + 24 + 22 + 29 + 29 + 28 + 23 + 27 + 25 + 25 + 26 + 24 + 26 + 30 + 27 + 26 + 23 + 25 + 22 + 26 + 22 + 25)/ 35 = the average age of 25.97.
2. Medians are the other measure of center. These are also very simple to find. The first step is to know how many values our distribution has, and in this case, since there are 35 Rbs, we have 35 of them. The second step is to order our distribution from least to greatest, and because we already made a bar chart here that did it for us, we don't have to worry about that. Third, in our list of numbers that are already ordered, we take the middle value. So in this case, since we have 35 values, we would take the 18th value. Our 18th value is in the 26 age bar.
With these three steps, we would find our median to be 26 years old.
But Which is Better?
The rule of thumb is that when we have outliers or any strong skews, we use medians. When we don't, we use means.
Think of outliers as values in a distribution that essentially shouldn't be there. They shy away from everyone else too much and are thus outliers. So let's look at our previous list of Runningback ages: (32 28 27 24 27 29 23 25 25 29 28 25 28 25 24 22 29 29 28 23 27 25 25 26 24 26 30 27 26 23 25 22 26 22 25)
Skews are easily seen through histograms, dot plots, or bar graphs, but the best way to find out if we have outliers is by making a box plot.
- How to read a box plot:
- Minimum: First line
- Quartile 1: Where the blue starts
- Median: Middle line between Q1 and Q3
- Quartile 3: Where the blue ends
- Maximum: Last line
Box plots tell us something very important: IQR. IQR is the distance from the 1st Quartile to the 3rd Quartile, telling us how far the middle 50% of values span across. Visually, this is how far the blue portion spans. Given by the box plot, the IQR can be found by doing 28-24=4.
So now that we know the IQR, we can see if we have any outliers by the formula:
lower outliers ≤ Q1 - 1.5(IQR)
upper outliers ≥ Q3 + 1.5 (IQR)
Now, let's see if in our data set, ages 22 and 32 are outliers or not.
1. Testing for Low erOutliers22 ≤ 24 - 1.5(4)22< ≤ 1822 is not less than 18, so 22 is not an outlier.
2. Testing for Upper Outliers
32 ≥ 28 + 1.5(4)
32 ≥ 34
32 is not more than 34, so 32 is not an outlier.
So because this distribution has no outliers and no strong skews, using a mean would be the better choice for the center.
Now Let's Bring in Fantasy Football!
Example #1- DJ Moore
Let's look at WR DJ Moore's 2023 Game Log for Receiving Yards: (64, 159, 18, 52, 68, 114, 96, 58, 44, 55, 54, 51, 230, 131, 41, 104, 25)Prompt: Would we use mean or median for the center of receiver yards? And what would they be?
Work/Explanation: From the histogram, we can see the distribution to be skewed right. And because we have a strong skew, we should use the median.
But also, let's find out if we have any outliers, especially since DJ Moore's game where he had 230 yards seems to be very far from the rest.
But also, let's find out if we have any outliers, especially since DJ Moore's game where he had 230 yards seems to be very far from the rest.
3. Also notice that there is a dot on the far right side of the box plot. These mean outliers. When we aren't given visuals, we need to calculate it by hand using the IQR formulas. But if we are given this, then we just have to look for dots to find outliers.
So from this box plot, our Q1 is 47.5 and Q3 is 109. Therefore, our IQR is the difference between these, which is 61.5.
1. Testing Lower Outlier20 ≤ 47.5 -1.5(61.5)This is not true, so no lower outliers
2. Testing Upper Outlier230 ≥ 109 + 1.5(61.5)230 ≥ 201.25This is true, so we have an upper outlier.
1. Testing Lower Outlier20 ≤ 47.5 -1.5(61.5)This is not true, so no lower outliers
2. Testing Upper Outlier230 ≥ 109 + 1.5(61.5)230 ≥ 201.25This is true, so we have an upper outlier.
Answer: So because this distribution has an outlier and is skewed right, using a median would be the better choice for the center.
End of Example
C. Variability
Variability describes the spread of our data. Are all of our numbers packed up into one tight area? Are they dispersed? Are they nowhere near each other? And just like measuring a center, we have multiple forms of variability.
1. IQR
IQR, which is what we used previously for finding outliers, is indeed a measure of variability. It tells us how clumped together the middle 50% of values are. In the first picture, we see that the middle values are fairly evenly spread out. Here, the IQR is 4, (27-23).
In the second picture, there is much less blue. This means that the middle 50% of values are very compact and dense. Here, the IQR is less than 2.
In the third picture, we can see that there is a lot of blue. This means that the middle 50% of values are very spread out. Here, the IQR is 6.
Note: We can also find IQR by looking at histograms and dot plots. To do this, we need to find the total of numbers our distribution has. In this case, (1+7+6+4+4+1= 23). The median will be the 12th number (lies in 20-30), our Q1 will be our 6th number (lies in 10-20), and Q3 will be our 18th number (lies in 30-40). So, our IQR is estimated to be 20 (either 40-20 or 30-10). Because we have a histogram that doesn't exactly tell us which value is which, we have to estimate our IQR.
2. Range
The range is a measure that tells us the total spread of our data from the smallest value to the biggest value. Here, we get a dot plot that exactly tells us what each value is so we don't have to estimate. We calculate this by subtracting the highest value by the least value. In this case, our range is 8-1=7.
We can also find the range using box plots. As we have discussed before, our minimum is demonstrated by the first line (all the way left) and our maximum by the last line (all the way right.) In this example, the range is 29-20=9.
3. Standard Deviation
Standard deviation isn't restricted to the 50% like IQR or the least and greatest number like range but rather is influenced by every single value in a data set. An easy definition of standard deviation is simply the overall spread. Low standard deviation means low spread, as in the numbers clump up together, like in the first picture here:
A high standard deviation on the other hand means that the data set is more spread out, like picture 2.
Standard deviations are more meticulous to calculate, so in AP Statistics, we usually never hand-calculate standard deviations. We either use calculators by plugging our numbers in or compare which standard deviation is greater or lower simply by looking at the distributions.
Now Let's Bring in Fantasy Football!
Example #1- Michael Pittman
Here is a box plot of WR Michael Pittman's number of receiving yards each game in 2023. Due to some values being obscure, here is the 5 number summary:
Min: 10 Q1: 48.5 Median: 68 Q3: 96 Max: 134
Prompt: What is the range of his yardage in 2023? What is the IQR?
Work/Explanation: Because there are no circles in the far left or right areas, there are no outliers in this distribution. So, we can calculate the range by subtracting the maximum, 134, by the minimum, 10. And, we get 134 - 10 = 124. Therefore, Michael Pittman's range of yards he acquired each game in 2023 was 124 yards.
To get the IQR, we subtract Q3 by Q1. So, 96 - 48.5 = 47.5.
Answer: Michael Pittman's 50% of receiving yards each game fell between 48,5 yards and 96 yards, making his IQR 47,5 yards.
Example #2- Comparing Michael Pittman and Gabe Davis
Prompt: By the looks of these two distributions, which receiver has a higher standard deviation in terms of their full PPR points scored in each game played?
Work/Explanation: From the first distribution, we see a peak in the middle as everything else is relatively short. This is not very spread out. In Gabe Davis's distribution, we see two peaks. Logically, the peaks mean clustered data, so this should mean that Gabe Davis's data is more compressed and thus has a lower standard deviation right?
However, look at the massive gap that lies between the peaks. Although this distribution is not evenly spread out, it is overall more spread out than Michael Pittman's PPR points distribution.
So from an eye test, Gabe Davis looks to have a higher standard deviation in his PPR points scored in each game played when compared to Pittman's. And when we actually happen to calculate the standard deviations via a formula, we get 9.3 for Gabe Davis and 5.87 for Michael Pittman.
Answer: Supported by an eye test and completed calculations, Gabe Davis's distribution has a higher standard deviation.
End of Example
Now Let's Combine Everything
For the last problem of this section, I tracked the age of each of the Top 5 QBs in each season (ranked in ppg with min 7 games played) from the years 2013-2023. I also recalibrated their ages to fit their respective time. So for example when tracking Qbs in the 2019 NFL season, I put Lamar Jackson's age as 22 years old since that was his age back then, even though Lamar Jackson is currently 27.
Prompt: Describe the distribution of the ages of the Top 5 Qbs in each season from 2013-2023. Make sure to include the shape, center, any outliers, and the variability. Although there are multiple measures of center and variability, one of each is sufficient. Work/Explanation: Shape: We see a peak towards the left side of the distribution in ages 23-26. After the peak, there is a dip. From left to right, we see the height of the distribution decrease, meaning that the quantity of high-valued numbers, or old quarterbacks, are not as abundant as low valued numbers, or relatively young quarterbacks. Therefore, the shape is skewed right.
Center: Because the distribution is skewed right, the median would be better. We have 55 values in our data set, so the median would be the 28th value. The 28th value lies in 26. Therefore, the median is 26 years old.
Outliers: 44 looks like an outlier, but we can accurately find out by using calculations. Q1 is our 14th value which lies in 24. Q3 is our 42nd value, which lies in 32. So, our IQR is 32-24=8.
Lower Outliers: Upper OutliersMinimum <_ Q1 - 1.5(IQR) Maximum >_ Q3 + 1.5(IQR)22 <_ 24 - 1.5(8) 44 >_ 32 + 1.5(8)22 <_ 12 44 >_ 44This is not true, so we have no lower outliers. This is true, so we have an upper outlier.
For outliers, 44 is an upper outlier. 44 years old is too old to be considered as part of the data, so it is an outlier. And of course, this was Tom Brady.
Variability: For variability, the IQR (previously calculated by Q1 - Q3, or 32 - 24) is 8 years. This means that the middle 50% of Top 5 QB ages in each season from 2013-2023 had an 8-year overall spread. Answer: In the distribution of the ages of the Top 5 Qbs in each season from 2013-2023, the shape is skewed right, the median is 26 years old, there is an upper outlier, and the IQR is 8 years.
Prompt: Describe the distribution of the ages of the Top 5 Qbs in each season from 2013-2023. Make sure to include the shape, center, any outliers, and the variability. Although there are multiple measures of center and variability, one of each is sufficient. Work/Explanation: Shape: We see a peak towards the left side of the distribution in ages 23-26. After the peak, there is a dip. From left to right, we see the height of the distribution decrease, meaning that the quantity of high-valued numbers, or old quarterbacks, are not as abundant as low valued numbers, or relatively young quarterbacks. Therefore, the shape is skewed right.
Center: Because the distribution is skewed right, the median would be better. We have 55 values in our data set, so the median would be the 28th value. The 28th value lies in 26. Therefore, the median is 26 years old.
Outliers: 44 looks like an outlier, but we can accurately find out by using calculations. Q1 is our 14th value which lies in 24. Q3 is our 42nd value, which lies in 32. So, our IQR is 32-24=8.
Lower Outliers: Upper OutliersMinimum <_ Q1 - 1.5(IQR) Maximum >_ Q3 + 1.5(IQR)22 <_ 24 - 1.5(8) 44 >_ 32 + 1.5(8)22 <_ 12 44 >_ 44This is not true, so we have no lower outliers. This is true, so we have an upper outlier.
For outliers, 44 is an upper outlier. 44 years old is too old to be considered as part of the data, so it is an outlier. And of course, this was Tom Brady.
Variability: For variability, the IQR (previously calculated by Q1 - Q3, or 32 - 24) is 8 years. This means that the middle 50% of Top 5 QB ages in each season from 2013-2023 had an 8-year overall spread. Answer: In the distribution of the ages of the Top 5 Qbs in each season from 2013-2023, the shape is skewed right, the median is 26 years old, there is an upper outlier, and the IQR is 8 years.
End of Example
2 Takeaways for Fantasy Football
1. Pay attention to skewed data
When comparing players to each other, we often only use means. For example, many apps only give us average points per game or average yards per game. But when we have skewed data, we should rather use medians as medians aren't affected by outliers or strong skews, and would thus be a better indicator of athlete performance.
For example, DJ Moore has a mean of 80 receiving yards per game vs. DK Metcalf's 69 yards per game. But when we use medians, we find that DJ Moore's median is 58, which is less than DK Metcalf's 78. Pay attention to your distributions and don't always rely on means.
2. Evaluate ages wiselyThe data told us that the ideal age for quarterbacks in Fantasy Football from 2013-2023 was 26 years old.
So yes, we should look at quarterbacks' ages and hope that they are around 26 years old, but we should also note that age isn't all that matters and that there are outliers. For example, Tom Brady led in ppg in the 2021 NFL season at 44 years old, which is very far from 26 years old. Overall, ages are very important, but they aren't everything.
When comparing players to each other, we often only use means. For example, many apps only give us average points per game or average yards per game. But when we have skewed data, we should rather use medians as medians aren't affected by outliers or strong skews, and would thus be a better indicator of athlete performance.
For example, DJ Moore has a mean of 80 receiving yards per game vs. DK Metcalf's 69 yards per game. But when we use medians, we find that DJ Moore's median is 58, which is less than DK Metcalf's 78. Pay attention to your distributions and don't always rely on means.
2. Evaluate ages wiselyThe data told us that the ideal age for quarterbacks in Fantasy Football from 2013-2023 was 26 years old.
So yes, we should look at quarterbacks' ages and hope that they are around 26 years old, but we should also note that age isn't all that matters and that there are outliers. For example, Tom Brady led in ppg in the 2021 NFL season at 44 years old, which is very far from 26 years old. Overall, ages are very important, but they aren't everything.