Do The Thrashers Have Large Talons?

Wednesday, April 11, 2007

Playoff Prediction Model

In a previous post I discussed how I gathered up data from 75 playoff series from the five previous playoff years. In that post I provided some basic descriptive statistics of which factors were the best predictors.

I've taken the next step and put all of those factors into a multi-variate regression model in order to assess the relative value of each of these factions. For the non-statistical readers. regression is a great tool because it allows us to see which factors are most important in relation to other factors in explaining something (in this case which teams won the playoff series).

My overall model has a R-squared of .31 and three variables are significant (Shot %, Power Play Opportunities and Goals Against Average). In plain English, the model indicates that a) there is a lot of random crazy unpredictable stuff that happens in the playoffs and b) but there is also a predictable element as well. Three things (ST%, GAA, and PP chances) are the core of that predictable element.

After running my model on recent playoff history I then used the coefficients to try and predict each playoff series. Keep in mind the model can only tell us about the predictable part of the playoffs--there will always be a significant random/chance/luck element. To the extend that regular season numbers can guide us, the model makes the following predictions.

Who Will Win Each Series?
Eastern Conference
ATL 95% NYR 5%
BUF 100% NYI 0%
NJD 51% TBL 49%
OTT 46% PIT 53%

Western Conference
DET 41% CGY 58%
ANA 100% MIN 0%
VAN 0% DAL 100%
NAS 31% SJS 69%

Quarter Finals
BUF 70% PIT 30%
NJD 37% ATL 63%
ANA 42% CAL 58%
SJS 73% DAL 27%

Conference Finals
BUF 91% ATL 9%
SJS 65% CAL 34%

Stanley Cup Finals
BUF 48% SJS 52%


Unfortunately the model is not much help in predicting how long each series will go. When I ran a regression on number of games in a series nothing came up significant and the model explained zero variance.

Edit: How well did it work in the past?
OK, I went back and ran the model to see how well it did in the past 5 playoff seasons. Of course, since I used these years to create the model it better do something. Even with the small R-squared.

What is a good point of comparision? Well if we picked playoff series by a coin toss we would expect be right just 50% of the time. If we went with the home team every time we would get 68% of the series right.

Total Playoff Series Correct Predictions
2001 11/15 series 73% Using Home Ice 10/15 66% Difference +7%
2002 13/15 series 87% Using Home Ice 13/15 87% Difference 0%
2003 09/15 series 60% Using Home Ice 09/15 60% Difference 0%
2004 11/15 series 73% Using Home Ice 11/15 73% Difference 0%
2006 11/15 series 73% Using Home Ice 08/15 53% Difference +20%
Total 55/75 series 73% Using Home Ice 48/75 64% Difference +7%

So the model is about the same as simply using home ice until last year's playoffs where it performed much better. Why? Perhaps, it is because the playoffs were called much more like the regular season, thus making regular season statistics more useful. I noticed that scoring declined from regular season levels by roughly -15% in previous years, but last year playoff scoring barely dropped at all (-3%).

Additional Edit:
The Dependent Variable is won/lost playoff series coded 1/0.
Dependent Variables are:
Which team has home ice?
Which team has better offense?
Which team has better defense?
Which team has better PP%
Which team has better PK%
Which team has better SV%
Which team has better Shot %
Which team has more PP Opportunities?
Which team has fewer Times Shorthanded?
Which team has better goal differential?
Which team has better special teams goal differential?


  • Something tells me that Islander and Cannuck fans aren't too excited about those numbers. I'll be fascinated to see how accurate this is.

    By Blogger Jennifer, at 1:34 PM  

  • Have you tried generating these numbers based on previous years regular season data and seeing how accurate it came out? I am gonna go ahead and guess the computer is wrong on Edmonton, but I am interested in where Carolina, Buffalo, Ottawa, and Detroit were supposed to end up.

    By Blogger edgordon, at 2:57 PM  

  • Hey this is pretty cool analysis (this coming from someone who uses statistics daily). Any chance you'd be willing to share your model? Is it an excel file or something more advanced? If you don't mind sharing, my email is tharptx at yahoo. If you don't want to share, thanks anyways for the post. -TH

    By Anonymous tharp, at 4:15 PM  

  • BUF 100% NYI 0%
    ANA 100% MIN 0%
    VAN 0% DAL 100%

    You have 3 100% series in that set. I'm not sure about you, but I'd be willing to bet a lot that at least one of those will go the other way.

    If you include
    ATL 95% NYR 5%
    You're stating the probability that BUF, ANA, DAL and ATL win their series is 95%!

    Would it be possible for me to take a look at your data table?
    e-mail them to me if you like:

    By Blogger JavaGeek, at 4:42 PM  

  • I was just wondering, like, are you fucking retarded?



    By Blogger Sens, at 4:51 PM  

  • javageek:

    I agree that I those probability seems a bit high to me too, but that's what the output shows.

    Sens: Wow, and I thought Buffalo fans were annoying and classless.

    By Blogger The Falconer, at 5:32 PM  

  • What's the dependent variable? I don't get this. There are three predictors (S%, GAA, and PPA). For any series, there are two teams, so all 6 stats (3 per team) are included as predictors. But I can't figure out what the dependent variable is? Is it simply "who won/lost" coded 0/1 or something like that? Can you explain?

    By Blogger Scott, at 7:31 PM  

  • I'll post the Dep. Var and Ind. Var in the thread.

    By Blogger The Falconer, at 7:38 PM  

  • Did you do a weighted regression? For a 0/1 d.v. that would be appropriate.

    In coding your i.v.s you've actually thrown away information that might be useful. You've used "which team has better...," variables; the actual (quantitative) differences between the teams should be more informative. This would force an analysis in the vein of a logistic regression (which is more appropriate for this circumstance).

    Finally: Based on what I read, your model is built solely on playoff data. Yet you use regular season i.v. values to make this season's predictions for the playoffs. That switcheroo requires statistical justification.

    Hate to be annoying. I'm classless too I guess - at least that what you've told me!

    By Blogger Scott, at 8:18 PM  

  • Scott:
    A) I probably will use logit in the future, but for the moment I'm more familiar with regression and thankfully it is very robust.

    B) I could rerun the model using variable values instead of dummies but it introduces some other problems such as varation in league offense which changes from year thus it might require standardization to make the years comparable. A 2.65 GAA might be great one year but not in another year. I might recode the data to express the values in relative terms such as: Is Team A's defense 10% or 15% superior to Team B?

    C) my model is NOT built on playoff data but regular season data. I'm asking the question: To what degree does the regular season data predict post-season winners. For example, I use regular season data from 2006 and look at how it relates to playoff winners in 2006. Once I've established that there is some connection between the two I then use the coefficients to forecast playoff winners for 2007 which is an appropriate forecasting application.

    Hopes this answers your questions. I didn't explain very much in the post because I didn't want to scare off all my readers with technical stuff.

    By Blogger The Falconer, at 9:42 PM  

  • You mean you didn't you a logistic or logit model? I would suspect these predictions are off. I agree with the previous poster...have you tried using last season reg.season data to predict the SC champ?

    I would also like to see the data.

    You can send to:


    By Blogger John, at 11:48 AM  

  • I have a strange feeling a good percentage of your p-values in this regression are greater than 0.5 (and all but one is probably greater than 0.05). This means you cannot tell if this variable is helping explain the variations or not. Are you chasing randomness?

    I did the same regression with data from 1995-1996 to 2002-2003 (7 seasons - 105 data points). I looked at all possible variable combinations (automatically with software).

    I then choose the option with the smallest Mallows C-p it had a R-sq adjusted of 8.7%
    It included 3 very standard variables that you would expect and that makes sense: offense, defense and SV%.
    Home win % =
    Offense was worth 30%
    Defense was worth 20%
    Goal tending was worth 16%
    Home ice advantage was worth 19%.
    The nice thing is that this matches what many analysts who have watched hockey for decades believe.

    If you prefer logistic regressions:
    Hpct = HomeGF^2/(HomeGF^2+HomeGA^2)
    Apct = AwayGF^2/(AwayGF^2+AwayGA^2)
    home team odds =

    These models are still poor as they have about 25 data points per variable. (Yours had about 6 data points per variable by my count).

    By Blogger JavaGeek, at 3:52 PM  

  • java:
    Sorry but I meant to email you back but I've been terribly busy the last several days. I'll post this for any interested parties who are lurking out there.

    re: variables in the model. The adjusted R^2 is lower of course because I have so many variables, if I drop everything except the significant variables the Adjusted R^2 and non-Adjusted become much closer of course.

    My training in statistics comes from the social sciences and we were lectured never to use step-wise regression--theory should always guide your selection of variables for your model so I presented the model with everything thrown in as I've been trained.

    However, if you're trying to estimate/forecast it is more reasonable to prune down the model to just the relevant variables.

    Honestly, I'm not too worried about over determination in the model because I have 150 cases (two teams involved in each series--needed to measure the effect of home ice). I had 11 variable and 150 cases so roughly 13 cases per variable. Of course I'd love to have retest on more cases when I get the time to gather data.

    I would agree that the predictive value of the model is rather modest at this point. Home Ice alone predicts 68% of series winners and the combined model only gets 73% so an improvement of just 5%. Still it is better than just using home ice.

    Maybe all I've done is curve fitting--which is one of the reasons I put this up before the playoffs to see how well the model performs. If I've just been curve fitting a few series that home ice advantage didn't predict then I expect that my model will crash and burn this playoff season. But half the fun of trying to predict the future is taking risks. Should the model fare poorly this playoff season I might just learn something in the process, which is valuable in and of itself.

    By Blogger The Falconer, at 5:48 PM  

Post a Comment

Links to this post:

Create a Link

<< Home

Who links to my website?
View My Stats