I know,
I know. You probably don’t want more number crunching right about now, but I’m
sorry readership this is just who I am! Although I’ve hid this post from the
front page as it’s really not that important. Since I discussed the Hershey
course in my last post, it was interesting to hear some of the theories
regarding the faster times. Naturally, I wanted to dig in deeper on the
analysis and see if I could put together somewhat scientific (but still very
rough) calculations that would help to further examine State Championship
performances.
I took
a look at some of the major meets from the Pennsylvania landscape between 2008
and 2015 (the years of the Poop-Out design) to try and get a sense of how times
were progressing over time. At each course, what I thought of as key
statistics. I considered three “tiers” of elite individual times and two
“tiers” of elite team average times and counted the number of runners/teams
that fell into each bucket. For example, At Hershey I set the tiers for
individuals at Sub 16, Sub 16:30 and Sub 17. For Team Average I used Sub 16:30
and Sub 17 minutes. From there, I could build out a table such as this:
Sub
16
|
Sub
16:30
|
Sub
17
|
Team
Avg Sub 16:30
|
Team
Avg Sub 17
|
|
2008
|
0
|
6
|
43
|
0
|
1
|
2009
|
2
|
17
|
52
|
0
|
4
|
2010
|
5
|
24
|
65
|
0
|
2
|
2011
|
0
|
25
|
75
|
0
|
4
|
2012
|
4
|
33
|
108
|
2
|
5
|
2013
|
3
|
23
|
70
|
1
|
3
|
2014
|
10
|
42
|
118
|
1
|
4
|
2015
|
12
|
46
|
125
|
2
|
9
|
As you
can see, there is a fairly consistent increase in almost all performances. The
major outlier is 2013 which, perhaps due to weather, perhaps due to just simple
variation (or perhaps due to the fact that 2013 was the one season in the past
8 I was retired for?), is much lower than the years on either side across the
board.
Looking
at this table, and the other tables I created, it appeared to me that the best
measure to really focus on was the “Tier 3” performances (for Hershey on the
table above, this would be the “Sub 17” group). It provided a larger year to
year sample that was less likely to be influenced by the elite front running
talent in a given year.
For
example, in 2010, we had 5 sub 16s, the 3rd highest we have had in
any season including 2011 through 2013. However, 2011 still produced higher
Tier 2 and Tier 3 results than 2010 (as did 2012). In 2010, we had three
different national qualifiers individually (Zach Hebda, Chris Campbell, and
Wade Endress). We also had multiple top 20 finishers in the region and a former
Footlocker Finalist, state champ Ryan Gil. In 2011, we had only one national
qualifier, but he did not participate in the PIAA (Dustin Wilson). Looking back
at the results between 2008 and 2013, the link between national qualifiers and
sub 16 runs at Hershey is fairly consistent (things get pretty wild for ’14 and
’15). Special talents like these will perform well on any course, almost
regardless of specified training, strategy, peaking, etc. so I decided not to
focus on these numbers in quite the same way.
So
using the “Sub 17” counts as my “Y” variable and the years as my “X”, I built
out a simple graph and did a regression. Basically, I tried to come up with a
formula that best represented what we were seeing.
As you
can see, the line of best fit is fairly linear and has an R squared value of
0.8023 (this measures how accurate the formula you are projecting is, for
reference a perfect line would be a 1.0). If you remove 2013 from the line, you
actually get an R squared of 0.9534. That equation in the adjusted model is
almost perfectly linear and would imply about 142 runners would have broken 16
on the course in 2016 all else equal. Of course, you can’t just throw out minor
outliers because it makes your data prettier, but seeing as this isn’t exactly
a scientific paper, I did it anyway.
So the
next logical question was, how does this compare to other courses? Take for
example one of the deepest and most competitive running districts in
Pennsylvania, the District One Championships at Lehigh. This meet is run at the
same site each year with the same selection of teams. Just like with the state
course, I added all classification results together to try and eliminate the
influence the move from 2 to 3 classifications would have had. I also adjusted
the time thresholds as a sub 17 at Hershey is not the same as a sub 17 at
Lehigh. That produced this table:
Sub
15:30
|
Sub
16
|
Sub
16:30
|
Team
Avg Sub 16
|
Team
Avg Sub 16:30
|
|
2008
|
4
|
34
|
83
|
2
|
16
|
2009
|
0
|
5
|
19
|
0
|
2
|
2010
|
2
|
11
|
44
|
0
|
3
|
2011
|
1
|
11
|
39
|
0
|
3
|
2012
|
5
|
23
|
54
|
2
|
6
|
2013
|
6
|
32
|
60
|
1
|
9
|
2014
|
4
|
11
|
38
|
0
|
4
|
2015
|
3
|
18
|
49
|
1
|
5
|
And
this graph:
Now
2008 was a very bizzare year. It’s easily the fastest in the course’s history
over that stretch and it’s not exactly clear to me why. Yes, it is believed
that the course in 2006, 2007 and 2008 was shorter (start line was farther
forward although I thought they made up for this by running farther around the
statue, I can’t remember for sure though), but 2006 and 2007 were two of
District 1’s most dominant years (in 2006 the District took 6 of the top 7 team
spots and in 2007 the District took the first 6 individual spots) and 2008 was,
quite frankly, a tough one to swallow. Maybe they just left it all on the
course at Lehigh that year or maybe it was better paced (a bit more even rather
than the wild low 4:40s of previous years?). Regardless, I thought it was
definitely worth taking a look at the graph without 2008 in there.
That gets you here, but we are still looking at a much less linear progression. If you normalize this graph so the average from 2009 to 2016 is the same as the average for Hershey, the slope of the regression line is about 6.698 vs. the 11.333 (12.594 without 2013) that we saw for Hershey Parkview. That means the number of people in Tier 3 at Hershey has been growing at roughly twice the rate as those at Lehigh. So although both are increasing (due to the expected natural progression of runners over time), there seems to still be something extra in the water at Hershey.
Worth
noting, I also did a quick check for the District 7 Championships at Cooper’s
Lake. Their numbers have actually been decreasing over time with their peak
stretch coming from 2008 to 2010. Removing 2011 (a dreadful day where no one
broke 17 minutes in any race), the progression has been strongly negative
(slope of -3.726 with 0.8659 R squared). I checked for the PCL championships at
Belmont Plateau as well. This was almost completely non-linear as the R squared
was 0.0081. Conversely, the Mid Penn Championships at Big Spring had the best correlation
out of any non-Hershey table I put together with a strongly positive
relationship of 0.6938. If you adjust their numbers to the same average as
Hershey, I got a positive slope of 20.007 (Hershey’s you might remember was
about 11).
Now all
of these non-state meets I listed have one thing in common: they bring the same
set of team’s every year. That’s a good thing in my opinion as you don’t have
to worry as much about variables for program differences, school size
differences, etc. But I wanted to get an invitational in the mix as well just
for fun. So I took a look at the Carlisle Reebok Challenge results over time
and compared them to the merged Foundation Invitational (held at Hershey)
results.
Here's
Carlisle’s graph of Sub 17s:
And here is Foundation:
Most of
the time these invites are about the same point in the season and attract
similar talent. The Hershey course is still showing much steeper and much more
linear improvement on a year to year basis. Now it’s impossible to deny all the
outside variables that are interfering with this calculation. For example, the
Foundation Invite was relatively new in 2008 and the competition there in 2009
and 2010 wasn’t particularly jaw dropping. Carlisle was the premier Invite in
PA for a long time especially during that same 2008-2010 window. So it is only
natural that the talent at Foundation would progress better over time than that
at Carlisle, regardless of how effective runners become at navigating the
Hershey hills. But all the same, I found it interesting to see that even the
Foundation Invite has an encouraging R squared value of 0.7359 (which is higher
than every meet I checked out besides the state meet).
One
last chart to share and then I’ll be done.
Ok so
let me explain this one. This is based on the sum of all the “Tier 3” times for
the meets I analyzed outside of Hershey (that would be D1, D7, PCLs, Mid Penns
and Carlisle). I also adjusted the totals by an equal weight across all years
so that the average would equal the average Tier 3 totals for the state meet
during the same period (so we could compare the slope in addition to the
correlation). Now obviously this is far from an exact statistical process
(throwing together totals from a variety of meets is probably not the most
efficient way to do things, but it was certainly the easiest), but I just
wanted to get a feel for how things looked.
If you
look at this graph, you can see the R squared is just 0.0118. That’s fairly
random. And there doesn’t seem to be any real trend up or down in the numbers
(slope of just 0.8692). As a reminder, the State Graph had a slope of about 11.
That would imply that the state times are trending up significantly faster than
the average results across the state.
Again,
this is far from any type of scientific statistical exercise. Feel free to
point out some or all of those flaws in the comment section after you have had
a chance to read the post. But I just figured I might as well share what I saw
with you guys because I found it kind of interesting.
So I
ask again … is the Hershey Course … Too Easy?
No comments:
Post a Comment