7. Significance Tests for Z and T
Statistical Significance, Method for significance tests, P value
Legit or coincidence? This is the question that significance tests seek to answer, with z being for proportions and t for means. Explore into the calculations of p-values and emerge into the principle of statistical significance to reconsider your view on Fantasy Football.
A. Statistical Significance and P-value
When we find results, we don't know if the results are truly what they are, or if they happened by chance. To solve this mystery, we perform significance tests to see if our results are statistically significant. Our results are statistically significant when the probability of getting our results was so low that it probably was not a coincidence.
As an example, let's say that you suspect a factory manufacturing these special coins is flawed. The special coins are said to land on tails 30% of the time. To test that, you randomly select 100 coins from the factory and flip all of them one time. Since they have a 30% chance tail rate, 30 of them should land on tails. However, out of the 100 coins that you flipped, 70 of them landed on tails. This is a sample proportion of p = 0.70. Given that the probability of them landing on tails is said to be 30%, it is very rare that 70 out of the 100 landed on tails. When calculated, the p-value (stands for probability value) that this happens is about 1.31 e^-18, which is super low. The general rule is that for results to be statistically significant, the p-value should be less than 0.05. As a reminder, statistical significance means that our results occurred from something more than just random coincidence. To understand this better, we want to limit the factor of luck and random chance in our results, and when there's only about a .0000001% chance of us getting these results by random chance, that's a good sign that our results are legit, and did not occur due to coincidence. So, the factory is likely to be malfunctioning.
On the flip slide, let's say that when we flipped the 100 coins, 33 of them landed on tails, getting a sample proportion of p = 0.33. This is more than the stated sample proportion of 0.30, (30-tailed coins), so does this mean that these special coins are flawed? Well, the p-value of this is actually 0.25. This is much higher than 0.05, meaning that these results are not statistically significant. 25% is a large p-value, meaning that there is a high chance that we could have gotten these results purely by random coincidence. So, we cannot state that the factory is malfunctioning.
The bottom line is that the lower the p-value, the more likely we have convincing evidence in favor of our hypothesis.
B. Method for Significance Test for Proportions
We have to perform a methodical procedure that consists of 4 main steps.
1. The first step is "State." Here, we state 2 hypotheses, the null hypothesis H, H0, and the alternative hypothesis, Ha. The "o" and "a" are actually supposed to be the subscripts, but the idea is the same. Now, the null hypothesis is what is already given to us, but the alternative hypothesis is what we are investigating. For example, this represents the "State" for our special coin example (For learning purposes, we are using different values. 80 of them landed on tails this time).
State:
H0: p = 0,30
Ha: p ≥ .30
whereas p = the true population proportion of the special coins in this factory that land on tails.
2 The second step is "Plan." The first half of the "plan" is to communicate what type of test this is. We won't go over all of the different significance tests of statistics, but for practical purposes, this is going to be called a 1 proportion Z Test. The second half is to check for conditions, of which there are three of them. #1 is called random, where the sample must be randomly selected. #2 is called independence, where your sample size must be at most 10% of the population. #3 is called large counts, where your sample size is multiplied by the probability, and the (1-probability), must be equal to or bigger than 10. This can be written as n(p) ≥ 10, n(1-p) ≥ 10.
Plan: We will use a 1 proportion Z Test for this significance test.
Random condition: We randomly selected 100 of the special coins in the factory. This is met.
Independence condition: We selected 100 special coins, so the total amount of population coins must be 10x that, which is 1000. There are surely more than 1000 special coins in the factory. This is met.
Large Counts condition: n(p) ≥ 10, n (1-p) ≥ 10 --> 100(0.3) ≥ 10, 100(0.7) ≥ 10. This is met
All conditions are met.
3. The third step is called "Do." We now calculate our standardized test statistic, to ultimately get our p-value. For this problem, a 1 proportion Z test, we will use this formula:
p̂ = the sample proportion, which in our case would be .80. P0 = the population proportion which also equals the null hypothesis value. For this example, this is .30. n is how big our sample is, which would be 100.
Do:
After getting our standard statistic z of 10.91, we can calculate the p-value by plugging it into a normcdf distribution above, which ended up being 5.29 e^-28, which is a very low number.
4. The 4th step is "Conclude." Here, we explain our results by including what our p-value is and what the significance level given to us is. If there is no significance level given to us, then our baseline is 0.05. At last, we will say that we either have convincing evidence or not convincing evidence about whether or not to reject the null hypothesis, which in this case, is whether or not the true proportion of special coins that land on tails is equal to 0.30.
Conclude: Because the P-value of 5.29 e^-28 is less than a significance level of 0.05, we reject the null hypothesis H0. We have convincing evidence that the true proportion of all special coins in this factory that land on tails is more than 0.30. Overall, there is a high chance that these special coins are incorrectly manufactured.
C. Method for Significance Tests for Means
Context: We randomly select 100 7th graders and for each of them, we track the maximum number of pushups performed. After 3 years when they are now 10th graders, once again track the maximum number of pushups performed for each of them. The mean difference is +6.2, meaning that as they got older, they on average did more pushups. The sample standard deviation is 2.3 pushups. We want to find out if there is convincing evidence that they were really able to do more pushups as they got older in life.
Likewise, we also utilize 4 steps with a few adjustments.
1. State
H0: ud = 0Ha: ud > 0
whereas ud = the population mean difference of maximum pushups performed by the 10th graders and 7th graders. 2. Plan
Matched Pairs T-test (different) (Matched Pairs means that our sample of 7th graders and 10th graders are the same people. Instead of simply simultaneously getting 100 7th graders and 100 10th graders, we waited for our 7th graders to grow up to conduct the study again. This is crucial to measure growth in performance. It is a "before and after" design.)
Random condition (same): We randomly selected 100 7th graders. This is met. Independence condition (same): We selected 100 7th graders, so the total amount of population coins must be 10x that, which is 1000. There are surely more than 1000 7th graders in the world. Large Counts condition (different): n ≥ 30 --> 100 ≥ 30. This condition is met. (Using central limit theorem)
All conditions are met. 3. Do (different formula):
H0: ud = 0Ha: ud > 0
whereas ud = the population mean difference of maximum pushups performed by the 10th graders and 7th graders. 2. Plan
Matched Pairs T-test (different) (Matched Pairs means that our sample of 7th graders and 10th graders are the same people. Instead of simply simultaneously getting 100 7th graders and 100 10th graders, we waited for our 7th graders to grow up to conduct the study again. This is crucial to measure growth in performance. It is a "before and after" design.)
Random condition (same): We randomly selected 100 7th graders. This is met. Independence condition (same): We selected 100 7th graders, so the total amount of population coins must be 10x that, which is 1000. There are surely more than 1000 7th graders in the world. Large Counts condition (different): n ≥ 30 --> 100 ≥ 30. This condition is met. (Using central limit theorem)
All conditions are met. 3. Do (different formula):
t* is our test statistic, x diff is 6.2, u diff is 0, sd is 2.3, and n is 100.
Our p-value is 1.06 e^-49, which is a very low number.
4. Conclude (same concept): Because the p-value of 1.06 e^-49 is less than a significance level of 0.05, we reject the null hypothesis H0. We have convincing evidence that the true population mean difference for pushups between 10th graders and 7th graders is more than 0. Individuals are likely to increase their maximum pushup counts from 7th grade to 10th grade.
Now Let's Apply Fantasy Football!
There's a lot of talk about drafting 2nd-year players fresh out of their rookie seasons, mainly because of the supposedly increased production seen throughout history. Here, we will actually dive into that. Is the jump between the player's rookie seasons and sophomore seasons in Fantasy Football really true, or is that just a myth?
To find this out, I first looked at every single NFL draft from 2006 to 2022. The reason I didn't do it til 2023 is because we wouldn't have second-year data for those players. Anyways, with each draft, I labeled the top 20 Fantasy Football Offensive players (exclusively QBs, Wrs, RBs, TEs) in order of actual draft overalls from lowest to highest. The reason I limited it to the top 20 players instead of all players in a draft is that in Fantasy Football, we mostly only care about the NFL rookies drafted in rounds 1-2 with lots of hype, not necessarily the late-rounders. Then, I randomly took 2 of these 20 players (done by using an RNG with boundaries 1-20 twice) and added them to my spreadsheet. If a player I selected didn't play at least 6 games out of the 17 or 18 possible, then their stats were not accurate. So, in that case, I randomly chose a different player. I repeated this procedure with each NFL draft class until done with 2022.
Once all of my players were added to my spreadsheet, I tracked their averaged PPR fantasy points per game from both their rookie year and their 2nd year. Then, I calculated the difference in their PPR points per game with their 2nd seasons and their first seasons. Here's a snippet of what it looks like:
Prompt: Is it really worth it to draft 2nd players in Fantasy Football? In the population of the top 20 Fantasy offensive positioned rookies (ranked in selected pick overall) in each NFL draft from 2006 to 2022, the sample mean difference between their 2nd season and 1st season in Averaged PPR Points per Game was 2.1576, the standard deviation was 4.008, and the sample size out of the population of 340 players was 34. Is there convincing evidence that 2nd year PPR points per Game is higher than 1st year PPR points per Game for the NFL's top rookies?
Work/Explanation:
State:
H0: ud = 0
Ha: ud > 0
whereas ud = the population mean difference of averaged PPR Fantasy Points per Game between the 2nd year and 1st year for the top 20 fantasy offensive rookies from each draft from 2006 to 2022.
Plan:
We will be performing a Matched Pairs T-test
Random Condition: A stratified random sample was performed where 2 out of the 20 players were randomly chosen from each draft class. Met
Independence Condition: Our sample contains 34 players out of a population of 340 players. 34 ≤ .10(340). Met
Large Counts Condition: sample size of 34 ≥ 30. Met
Do:
With these values, we got a p-value of .0018.
Note: If your calculator does not have access to "tcdf" then we can still find the p-values as long as we have out t* and "degrees of freedom" by using a reference sheet. (Degrees of freedom are calculated by subtracting our sample n quantity by 1. Here, since 34 players were in our sample, our df is 33.)
This is Table B, which is an alternative method to finding our p-value if we don't have access to a calculator to perform it for us. Here, we start with df, finding the number closest to 33. We stop at 30. Then, we go across to find the value closest to our t*, which would be 3.030. Finally, we go upwards to find our p-value, which would be somewhere in between .0025 and .001 because our actual t* of 3.14 is slightly bigger than 3.030, so we go slightly to the right of it.
With Table B, we don't get as accurate numbers but it is regardless a proper way of finding p-values for significance tests if we don't have access to higher-end calculators.
Conclusion:
Because the p-value of .0018 is less than a significance level of 0.05, we reject the null hypothesis H0. We have convincing evidence that the true population mean difference for averaged PPR Fantasy Points per Game between the same players in their 2nd season and rookie seasons is more than 0. There is statistical significance that the top-drafted NFL rookies increased their fantasy point production from their 1st and 2nd years.
Answer: There indeed is convincing evidence that 2nd year PPR points per Game are higher than 1st year PPR points per Game for the NFL's top rookies.
End of Example
2 Takeaways for Fantasy Football
1. The 2nd years are legit
For early selected rookies, our data shows that there is statistical significance of Fantasy Football improvement in players' 2nd years. Some of these players for this upcoming season going into their 2nd years are Bijan Robinson, CJ Stroud, Bryce Young, and Jahmyr Gibbs. So, I would be excited for those players and be expecting more points from them compared to what they had the previous year.
2. Don't be misled by headlines. Look for statistical significance.
If you are presented with data that isn't statistically significant, you shouldn't care about them too much. You don't know if those results happened by random chance or if there is actually any merit in them.
For instance, your friend may say that older RBs are better in Fantasy Football since Derrick Henry (30 years old) and CMC (28 years old) are very productive. But for his claim to have real merit, he needs to get more than just 2 RBs in his sample size, calculate a test statistic, interpret a p-value, and perform many more steps to properly conduct a significant test.
The bottom line is that there are many tips unsupported by statistical analyses out there in the media, so just beware of what kind of information you are getting. The data must be statistically significant!