etrain11: the aftermath: The Official Unofficial State Course Analysis Post

By Jarrett Felix

I know, I know. You probably don’t want more number crunching right about now, but I’m sorry readership this is just who I am! Although I’ve hid this post from the front page as it’s really not that important. Since I discussed the Hershey course in my last post, it was interesting to hear some of the theories regarding the faster times. Naturally, I wanted to dig in deeper on the analysis and see if I could put together somewhat scientific (but still very rough) calculations that would help to further examine State Championship performances.

I took a look at some of the major meets from the Pennsylvania landscape between 2008 and 2015 (the years of the Poop-Out design) to try and get a sense of how times were progressing over time. At each course, what I thought of as key statistics. I considered three “tiers” of elite individual times and two “tiers” of elite team average times and counted the number of runners/teams that fell into each bucket. For example, At Hershey I set the tiers for individuals at Sub 16, Sub 16:30 and Sub 17. For Team Average I used Sub 16:30 and Sub 17 minutes. From there, I could build out a table such as this:

	Sub 16	Sub 16:30	Sub 17	Team Avg Sub 16:30	Team Avg Sub 17
2008	0	6	43	0	1
2009	2	17	52	0	4
2010	5	24	65	0	2
2011	0	25	75	0	4
2012	4	33	108	2	5
2013	3	23	70	1	3
2014	10	42	118	1	4
2015	12	46	125	2	9

As you can see, there is a fairly consistent increase in almost all performances. The major outlier is 2013 which, perhaps due to weather, perhaps due to just simple variation (or perhaps due to the fact that 2013 was the one season in the past 8 I was retired for?), is much lower than the years on either side across the board.

Looking at this table, and the other tables I created, it appeared to me that the best measure to really focus on was the “Tier 3” performances (for Hershey on the table above, this would be the “Sub 17” group). It provided a larger year to year sample that was less likely to be influenced by the elite front running talent in a given year.

For example, in 2010, we had 5 sub 16s, the 3^rd highest we have had in any season including 2011 through 2013. However, 2011 still produced higher Tier 2 and Tier 3 results than 2010 (as did 2012). In 2010, we had three different national qualifiers individually (Zach Hebda, Chris Campbell, and Wade Endress). We also had multiple top 20 finishers in the region and a former Footlocker Finalist, state champ Ryan Gil. In 2011, we had only one national qualifier, but he did not participate in the PIAA (Dustin Wilson). Looking back at the results between 2008 and 2013, the link between national qualifiers and sub 16 runs at Hershey is fairly consistent (things get pretty wild for ’14 and ’15). Special talents like these will perform well on any course, almost regardless of specified training, strategy, peaking, etc. so I decided not to focus on these numbers in quite the same way.

So using the “Sub 17” counts as my “Y” variable and the years as my “X”, I built out a simple graph and did a regression. Basically, I tried to come up with a formula that best represented what we were seeing.

As you can see, the line of best fit is fairly linear and has an R squared value of 0.8023 (this measures how accurate the formula you are projecting is, for reference a perfect line would be a 1.0). If you remove 2013 from the line, you actually get an R squared of 0.9534. That equation in the adjusted model is almost perfectly linear and would imply about 142 runners would have broken 16 on the course in 2016 all else equal. Of course, you can’t just throw out minor outliers because it makes your data prettier, but seeing as this isn’t exactly a scientific paper, I did it anyway.

So the next logical question was, how does this compare to other courses? Take for example one of the deepest and most competitive running districts in Pennsylvania, the District One Championships at Lehigh. This meet is run at the same site each year with the same selection of teams. Just like with the state course, I added all classification results together to try and eliminate the influence the move from 2 to 3 classifications would have had. I also adjusted the time thresholds as a sub 17 at Hershey is not the same as a sub 17 at Lehigh. That produced this table:

	Sub 15:30	Sub 16	Sub 16:30	Team Avg Sub 16	Team Avg Sub 16:30
2008	4	34	83	2	16
2009	0	5	19	0	2
2010	2	11	44	0	3
2011	1	11	39	0	3
2012	5	23	54	2	6
2013	6	32	60	1	9
2014	4	11	38	0	4
2015	3	18	49	1	5

And this graph:

Now 2008 was a very bizzare year. It’s easily the fastest in the course’s history over that stretch and it’s not exactly clear to me why. Yes, it is believed that the course in 2006, 2007 and 2008 was shorter (start line was farther forward although I thought they made up for this by running farther around the statue, I can’t remember for sure though), but 2006 and 2007 were two of District 1’s most dominant years (in 2006 the District took 6 of the top 7 team spots and in 2007 the District took the first 6 individual spots) and 2008 was, quite frankly, a tough one to swallow. Maybe they just left it all on the course at Lehigh that year or maybe it was better paced (a bit more even rather than the wild low 4:40s of previous years?). Regardless, I thought it was definitely worth taking a look at the graph without 2008 in there.

That gets you here, but we are still looking at a much less linear progression. If you normalize this graph so the average from 2009 to 2016 is the same as the average for Hershey, the slope of the regression line is about 6.698 vs. the 11.333 (12.594 without 2013) that we saw for Hershey Parkview. That means the number of people in Tier 3 at Hershey has been growing at roughly twice the rate as those at Lehigh. So although both are increasing (due to the expected natural progression of runners over time), there seems to still be something extra in the water at Hershey.

Worth noting, I also did a quick check for the District 7 Championships at Cooper’s Lake. Their numbers have actually been decreasing over time with their peak stretch coming from 2008 to 2010. Removing 2011 (a dreadful day where no one broke 17 minutes in any race), the progression has been strongly negative (slope of -3.726 with 0.8659 R squared). I checked for the PCL championships at Belmont Plateau as well. This was almost completely non-linear as the R squared was 0.0081. Conversely, the Mid Penn Championships at Big Spring had the best correlation out of any non-Hershey table I put together with a strongly positive relationship of 0.6938. If you adjust their numbers to the same average as Hershey, I got a positive slope of 20.007 (Hershey’s you might remember was about 11).

Now all of these non-state meets I listed have one thing in common: they bring the same set of team’s every year. That’s a good thing in my opinion as you don’t have to worry as much about variables for program differences, school size differences, etc. But I wanted to get an invitational in the mix as well just for fun. So I took a look at the Carlisle Reebok Challenge results over time and compared them to the merged Foundation Invitational (held at Hershey) results.

Here's Carlisle’s graph of Sub 17s:

And here is Foundation:

Most of the time these invites are about the same point in the season and attract similar talent. The Hershey course is still showing much steeper and much more linear improvement on a year to year basis. Now it’s impossible to deny all the outside variables that are interfering with this calculation. For example, the Foundation Invite was relatively new in 2008 and the competition there in 2009 and 2010 wasn’t particularly jaw dropping. Carlisle was the premier Invite in PA for a long time especially during that same 2008-2010 window. So it is only natural that the talent at Foundation would progress better over time than that at Carlisle, regardless of how effective runners become at navigating the Hershey hills. But all the same, I found it interesting to see that even the Foundation Invite has an encouraging R squared value of 0.7359 (which is higher than every meet I checked out besides the state meet).

One last chart to share and then I’ll be done.

Ok so let me explain this one. This is based on the sum of all the “Tier 3” times for the meets I analyzed outside of Hershey (that would be D1, D7, PCLs, Mid Penns and Carlisle). I also adjusted the totals by an equal weight across all years so that the average would equal the average Tier 3 totals for the state meet during the same period (so we could compare the slope in addition to the correlation). Now obviously this is far from an exact statistical process (throwing together totals from a variety of meets is probably not the most efficient way to do things, but it was certainly the easiest), but I just wanted to get a feel for how things looked.

If you look at this graph, you can see the R squared is just 0.0118. That’s fairly random. And there doesn’t seem to be any real trend up or down in the numbers (slope of just 0.8692). As a reminder, the State Graph had a slope of about 11. That would imply that the state times are trending up significantly faster than the average results across the state.

Again, this is far from any type of scientific statistical exercise. Feel free to point out some or all of those flaws in the comment section after you have had a chance to read the post. But I just figured I might as well share what I saw with you guys because I found it kind of interesting.

So I ask again … is the Hershey Course … Too Easy?

etrain11: the aftermath

The Tracks

The Official Unofficial State Course Analysis Post

No comments:

Post a Comment

The Tracks

document.write(ssyby);

The Official Unofficial State Course Analysis Post

No comments:

Post a Comment