Lesson #3 - Linear Regression Part #2
Prediction calculations, Residuals, Extrapolation
Now that we know what the data tells us, now what? This is a continuation of the previous linear regression lesson where we will use the learned concepts from lesson #2 and take it even further in this section. This includes making predictions, analyzing differences, and common misconceptions in statistics.
A. Predictions
With each scatterplot, we can make an equation, or a formula, in the form of Å· = a + bx. This is very similar to y = mx + b where m is the slope and b = y intercept. But the difference is that Å· = a + bx is solely for predictions, where a = y-intercept and b = slope, and Å· = predicted outcome. This is also the equation for each LSRL (Least Squares Regression Line), or in other words the line of best fit in our scatterplots. There are many types of lines of best fit, but the LSRL is what is studied in AP Statistics, where the goal is to make a linear form line that minimizes the squared distance sum from all the points and the line.
With this equation, we can predict what the outcome is by plugging in our x, b, and a value. However, as previously mentioned, this is not the real actual outcome.
B. Residuals
The residual means the distance of how far the actual outcome is from the predicted outcome by the LSRL. Positive residuals mean that the actual outcome is greater than the predicted outcome. Visually, the LSRL is under the actual point. Negative residuals mean that the actual outcome is lower than the predicted outcome. And visually, the LSRL is over the actual point. To calculate a residual, we do actual outcome vs predicted outcome.
Now Let's Apply Fantasy Football!
Example #1
We're going to bring this scatterplot back! So, if you look at the top right and inside the right circle, you'll see some values that we will input for our equation. So using this information that was already calculated for us, we can make our equation:
Å· = a + bx
Å· = 5.81 + .196x
Prompt #1: Using the LSRL equation that was given above, in our population of the Top 50 WRs (ranked in points per game w/min 7 games played) what is the predicted Avg PPR Points per Game for a Wide Receiver who had an Avg of 60 Yards Before Catch per Game?
Work/Explanation:
Å· = a + bx
Å· = 5.81 + .196 (60)
Å· = 17.57
Answer: A wide receiver in our population who had an avg of 60 yards before catch per game was predicted to average 17.57 PPR points per game.
Note: However, we can also find our predicted outcome simply by tracing our fingers, or our eyes, to the spot. In this case, I went from left to right on the x-axis to reach 60 and then went upwards until I reached the line to find the predicted point. This point is marked by the green circle.
But with this said, and although this method is much quicker, we won't get an as accurate value because our eyes can't tell the difference between say 17.4 and 17.57, for example.
Prompt #2: With this same population in the scatterplot, find the residual of Avg PPR Points per Game for a WR who averaged 60 yards before catch per game.
Work/Explanation: Remember, the residual = actual outcome vs. predicted outcome. We already got our predicted outcome above, which was 17.57. So now, all we have to do is to get the actual outcome.
All the blue circles represent our actual real outcomes. So to look for the actual outcome for our prompt, we want to go from left to right on the x-axis until we reach 60, and then go upwards until we see not the blue line, but an actual blue circle.
But as you can see, we don't know the EXACT y-value of this point unless provided with the data set. We can't calculate this y-value though, so we just have to estimate. Let's just estimate that this WR averaged 16 points per game for the y value.
Answer: Because the formula of the residual is the actual outcome minus the predicted outcome, we simply do 16 - 17.57. This gives us a residual of -1.57. This means that the predicted outcome of the LSRL, given that the WR averaged 60 yards before catch per game, overshot the actual outcome by 1.57 PPR Avg Points per Game.
End of Example
C. Interpolation
The word interpolation might seem complex, but it is actually a very simple definition. Interpolation refers to using an LSRL for predictions within our relative boundaries of existing and known data points for x values. The main point to remember is that our makes our predictions are more accurate with interpolation.
For instance, let's use the same scatterplot we have been using. Note that the dots look more condensed, but this is just because of increased the x and y-axis parameters. The data itself is still the same.
To exemplify interpolation, the LSRL between the orange dotted boundaries would be accurate since the x values are not grossly far from the data. We can safely make predictions with our LSRL equation as long as our x value is within those boundaries.
D. Extrapolation
Extrapolation is the opposite of interpolation. This is where our predictions are beyond, or outside, the safe interval in terms of the x-axis. As a result, our predictions using the LSRL would be less accurate and unwise to calculate. Note in the scope of this page, we aren't going to calculate the strict boundaries as in Collegeboard's AP Statistics, we can simply use the eye test and estimate to our ability.
The green points represent our predicted values for each of their respective x values (yards before catch per game). However, they all fall out of the safe boundaries for accurate predictions. Therefore, we treat these predictions as valuable.
For example, according to the LSRL, a WR who averaged 0 yards before catch per game should still have 5 points per game. This is extremely unlikely, and what is even more unlikely is a WR averaging negative yards before the catch yet still having positive fantasy points.
Now Let's Apply Fantasy Football!
Example #1
We are going to bring this scatterplot back that shows the association between Avg Points per Game and PCT for the Top 32 QBs (ranked in ppg w/ min 7 games) in 2023.
Prompt #1: With interpolation and extrapolation in mind, would it be reasonable to make a prediction for a QB who has a PCT of 65%? Why?
Answer: Yes, it would be reasonable to make a prediction for a QB who has a PCT of 65% because the x value of 65% falls within the relative boundaries of the known data points.
Prompt #2: Would it be reasonable to make a prediction for a QB who has a PCT of 25%?
Answer: No, it would not be reasonable. The LSRL cannot reliably make predictions with x values that severely fall out of the boundaries. 25% is very far from the rest of the known data points in the plot.
Example #2
A B
C
D
Prompt: Which of these answer choices have boundaries in which we can safely make our predictions within? (Interpolation)
Work/Explanation: Let's go back to some of the criteria for interpolation. First, we want to place our boundaries along the x values of our points. So knowing this, B is incorrect because the boundaries are along the y values of the known data points. Secondly, we want to surround our known data points with the boundaries. Therefore, C is incorrect because the boundaries cut into the data points when its supposed to be outside the data points. Thirdly, we want to make sure that our boundaries are not too wide, as they should be relatively close to our data points. Therefore, D is incorrect because the boundaries are too far. For instance, with option D's logic, an RB with a snap share of 0% is somehow still predicted to average nearly 5 ppg, which is almost impossible, and is thus not a trustworthy prediction.
Answer: Option A is the best answer as it follows the criteria of interpolation. The boundaries are along the x axis, around our known data points, and not too far from them. We can make accurate predictions about the RBs in this population within these boundaries.
End of Example
2 Takeaways for Fantasy Football
1. Target players with positive residuals
Players with positive residuals mean that they overperformed. We want these players because their production was better than was predicted by the LSRL given the overall correlation.
So for example, Deebo Samuel is great because he overperforms his predicted amount of points based on yards before catch per game by about 5 PPR points.
2. Don't make unrealistic predictions
Don't be misled by the LSRL. The further our prediction is from our boundaries of unknown data points, the worse they will be. So for example, if you are in a dynasty league where your attention is on very young and inexperienced players, we cannot make safe predictions using a LSRL of them when they are in their 7th year. This will not be as accurate as making predictions on their 2nd year for any metric in Fantasy Football.