Lesson #2- Linear Regression
Correlation, R^2 value, Scatterplots
Linear regression is all about making predictions based on the relationship between two variables, specifically the explanatory and response variables. Data here is typically shown in a scatterplot. We will analyze this data, including assessing the traits of the relationship as well as familiarizing ourselves with scatterplots.
A. Correlation
The explanatory variable will be your x variable on a graph, as it is what may help predict changes in the response variable y. In statistics, we call this relationship the correlation. Correlations are represented by the correlation coefficient "r," in which r is limited to the boundaries of -1 to +1. Higher absolute values mean stronger associations, positive r values mean positive associations and negative r values mean negative associations. Anyways, here are 4 characterizations of a correlation:
1. Direction
1a. Positive- As the explanatory variable x increases, the response variable y increases. For example in Fantasy Football, the more TDs a QB has, the more fantasy points generated. 0 < r ≤ 1
1b. Negative- As explanatory variable x increases, the response variable y decreases. For example in Fantasy Football, the more interceptions thrown a QB has, the fewer fantasy points generated. -1 ≤ r < 0
1c. No Correlation- As explanatory variable x increases, the response variable y does not increase or decrease. r = 0 (very rare)
2. Form
2a. Linear- Usually with forms in general, we can just use the eye test. In this graph, we can imagine there being a straight line piercing through the middle of the data. There is an even amount of values above and under our line, which is what we want in a linear correlation for form.
2b. Nonlinear- These can be exponential or logarithmic. The point is that with nonlinear forms, we can see obvious curves. This one is exponential as we see a flat line that suddenly rises quickly. To rearrange nonlinear plots to make it linear, we would perform log calculations and check using residual plots.
3. Strength
3a. Strong Association- Strength is a measure of how close our values are to our line of best fit. So with a strong association, like this one, our points remain near our line of best fit, barely deviating from it.
Another way we can see it is how skinny our ray of points are. The more skinny, the stronger our association.
Strong associations will have a correlation coefficient r > ± 0.6
3b. Moderate- We see that our points are deviating from our line of best fit more, as the ray is getting wider.
Moderate associations will have a correlation coefficient 0.4 ≤ r ≤ 0.6
3c. Weak- Our points are now very far from the line of best fit and our ray is undeniably even wider now.
Weak associations will have a correlation coefficient of r < 0.4
4. Unusual Features
Unusual features are points that fall out of order or the pattern of our scatterplot. On this website, we won't go over the calculations of how to find unusual features, but there are other methods to do so.
4a. Eye Test
With the eye test, we want to pay attention to the y values, or the heights of our points. In this scatterplot, there are no obvious and clear points that fall out of order as they all relatively obey the positive association. None of them seem too high or too low.
Here, we do have an unusual point. The red circle represents an unusual point. The association is positive, meaning that from right to left our points should be increasing in y value. But this one is severely under. With an x value of about 70, it should have a y value around 15-24, yet is only at 5. Therefore, this is an unusual point.
4b. Removal of Suspected Unusual Point
After the removal of a suspected unusual point, our r value should increase in absolute significantly. This is because the unusual point was weakening the correlation simply by being so out of place. But if we remove our suspected unusual point and the correlation coefficient barely changes, then that point is not unusual.
With the unusual point, the correlation coefficient was .57. This would classify as a moderate association. But after we removed this point, we now get a correlation coefficient of .75, making this association strong, which is better. This was a big jump in the right direction for r, meaning that the circled point was in fact unusual.
But let's say we thought this point in the top right was unusual instead. So, after removing that point, our correlation coefficient decreased by .06. Our strength is still at a moderate level. Not only is this not a major change, but this also actually made our correlation worse. Removing unusual points should improve our correlation. So, therefore, because removing this point in the top right corner insignificantly affected our r value AND decreased it, it was not unusual.
Now Let's Bring in Fantasy Football!
Example #1 - Do Targets Matter for WRs?
We all can probably figure out that the more yards and catches by a wr, the more fantasy points they will generate. But what about the metric that makes catches and yards possible? We are talking about targets.
1. The first thing I did was to collect the top 50 Wrs ranked by their Full PPR points per game in 2023, with a minimum of 7 games played. Even though there are definitely more than 50 Wrs in the NFL, most of our attention is on the top 50, which is why I chose this size. And, the reason for a minimum of 7 games played is because 7 is a reasonable sample size in an 18-week season. This ensures that the wide receiver's performance really does match up to their average ppg, avoiding any possible chances of "one-hit wonders" in our data set.
2. Next, for all of the players that I had on my list, I tracked their average targets per game. Then, I made a table that had two columns, with one being "targets per game" and the other being their "ppg." I also made an additional column with their names so that I wouldn't accidentally count one twice or miss anyone. At last, I input the numbers in.
It is crucial that the right "targets per game" value matches with the right "ppg" value based on the specific wide receiver, as doing so incorrectly would make our correlation inaccurate.
Note: Here's a snippet of the table I made. All tables and scatterplots in this section are made through Google Spreadsheets. It is very easy to work with, so many of you could find your own correlations with any specific metrics that you all might be very curious about.
3. After our table is ready, we simply highlight our spreadsheet cells, hover over insert, and press chart. If it comes up with something other than a scatterplot, simply go to the 3 dots, press edit, go to chart type, and select scatter chart. After those three steps, I got this:
Let's now analyze the overall correlation between Full PPR PPG and Avg Targets per Game for the Top 50 WRs ( with min 7 games played) in the 2023 NFL Season).
Prompt: What is the direction, form, and strength of this correlation? Visually, without removing suspected points, are there any unusual points on this plot here?
Work/Explanation:
Direction - The direction of the correlation between full avg PPR points per game and avg targets per game is positive. This is because as the number of targets per game increases, the avg points per game for these wide receivers also increase.
Form- The form of the correlation is linear, as there is no obvious curve pattern.
Strength- To find strength, we need the value of the correlation coefficient r. We see that R^2 = .645. So, to get r, we simply square root it, thus getting r = .80. Because .80 > .6, this is a strong association.
Unusual Points- There are no unusual points on this scatterplot.
Final Answer: Full PPR Avg Points per Game and Avg Targets per Game for Fantasy Football's 2023 top 50 WRs have a positive, linear, and strong correlation with no unusual points.
Example #2 - Using R^2 for Comparisons
Even though haven't fully discussed R^2 yet, think of it as the correlation coefficient r's older brother. The r value is a merely measure of the correlation, and nothing more. But once we square r and get R^2, we now have a measure of the variability accounted for by the regression model. In easier terms, we have a % measurement of how effective the indicator, x, is for the results, xy The higher the R^2, the better.
Circled here is the R^2 value. Because it is .645, we can interpret it as 64.5% of the Top 50 WR's Full PPR Point Averages can be accounted for by the line of best fit, with the explanatory variable being their average targets per game. Therefore, from this data, targets are a good indicator of WR point production.
Example #3- Let's See Some More Scatterplots:
3A:
Note: We had 50 Wrs in our previous example, but only 35 RBs here. The reason I slimmed the population size down is because as fantasy owners, our eyes aren't on as many runningbacks as wide receivers. We typically have more WRs in our team than we do RBs since the quantity of relatively high-scoring RBs is not as much. So in this case, to make our population more generalizable and accurate, I reduced the size to only the top 35 RBs.
Prompt: Which was the better indicator of Fantasy point production for the top 35 Runningbacks of the 2023 NFL Season? Rushing Yards Before Contact or Rushing Yards After Contact?
Answer: Because the R^2 value of the "Before" scatterplot is higher, .538 > .285, this metric accounts for 25.3% more of the variability of RB average points per game, thus making it the better indicator for point production.
3B.
With the same population of RBs, let's now see the roles of Touches per Game and Snap Share on point production.
Prompt: Which was the better indicator of Fantasy point production for the top 35 Runningbacks of the 2023 NFL Season? Touches or Snap Share?
Answer: Because the R^2 value of the "Touches per Game" scatterplot is higher, .383 > .343, this metric accounts for 4% more of the variability of RB average points per game, thus making it the better indicator for point production.
3C.
Prompt: Let's go back to our population of WRs. In 2023, out of Dropped Passes per Target, Redzone Target Percents, and Avg Yards Before Catch per Game, which of these three metrics is the best indicator for WR point production?
Answer: Out of these three, the relationship with the greatest R^2 value is between Avg PPR Points per Game and Avg Yards Before Catch per Game, making Avg Yards Before Catch per Game the best indicator of WR point production here.
3D.
Now let's go into Quarterbacks. For the same logic with the RBs, I reduced the population size to the top 32 QBs since the overall quantity of relatively high-performing QBs is not as much as WRs.
Prompt: Drops by receivers and interceptions are both QB metrics that decrease point production. But which of these was the better indicator in 2023?
Answer: Because the R^2 value of both scatterplots is equivalent, there is no better indicator in this scenario.
3E.
Prompt: Which was the better indicator of Fantasy point production for the top 32 Quarterbacks of the 2023 NFL Season? IAY/PA or PCT (%)?
Answer: Because the R^2 value of the "PCT (%)" scatterplot is higher, .246 > .022, this metric accounts for 22.4% more of the variability of QB average points per game, thus making it the better indicator for point production.
End of Example
2 Takeaways for Fantasy Football
1. Great Metrics
For Wide Receivers, don't just look at how many yards or catches they have, but also their targets. As discussed below, targets have a strong and positive correlation with fantasy points, so go for receivers with high amounts of targets.
For Runningbacks, we found that yards before contact were a great indicator of fantasy point production, so aim for RBs high in this category too. This isn't something that's often shown in fantasy apps, but it is easily acceptable on many websites.
2. Correlation vs Causation
This wasn't previously discussed, but it is a crucial concept to know about analytics and statistics. Even though a certain metric may be strongly correlated with another, that doesn't mean there is a causation happening.
These correlations may just be purely coincidental, therefore making them misleading. So for example, even though targets are highly correlated with fantasy points for WRs, we can't go as far as to say that they cause more fantasy points. Ultimately, to leave it off, correlations with two variables are very helpful to grade players, but they certainly don't tell the whole story.
For Wide Receivers, don't just look at how many yards or catches they have, but also their targets. As discussed below, targets have a strong and positive correlation with fantasy points, so go for receivers with high amounts of targets.
For Runningbacks, we found that yards before contact were a great indicator of fantasy point production, so aim for RBs high in this category too. This isn't something that's often shown in fantasy apps, but it is easily acceptable on many websites.
2. Correlation vs Causation
This wasn't previously discussed, but it is a crucial concept to know about analytics and statistics. Even though a certain metric may be strongly correlated with another, that doesn't mean there is a causation happening.
These correlations may just be purely coincidental, therefore making them misleading. So for example, even though targets are highly correlated with fantasy points for WRs, we can't go as far as to say that they cause more fantasy points. Ultimately, to leave it off, correlations with two variables are very helpful to grade players, but they certainly don't tell the whole story.