Thursday 19 July 2012

Compiling Match Odds (Part III)

Thought it was about time that I pulled my finger out of my fat arse and carried on with this (occasional) series of posts on compiling match odds. I promise that I have indeed washed said finger before addressing the keyboard.

I suppose the bad weather has helped me to move this series on a little, as I would normally be out barbequeing a low-grade sausage to within an inch of its life at this time of year, but the incessant rain has driven me indoors instead.

For those who are not familiar with my previous posts on compiling match odds, I have also written these:
I have effectively also shown how to generate match odds using Poisson with these two posts:

If you haven't read any of these, then don't worry as this post today can be read in isolation - although this one does follow-on from the discussion about ratings in general in my previous post, which you may want to view first before diving into the stuff below.

Right, last week, during my “How To Rate the Ratings” post, I discussed a difficulty with ratings methods which tend to plop a number out at the end of the algorithm that is then often interpreted to be a home win, away win or a draw - but without any reference to the actual match odds.

So, for example, if a rating method arrives at a figure for a match as +250, then we could say it is going to be a home win, -300 is going to be an away win, whilst any range between will probably be a draw. At least that’s the theory.

This may or may not be correct, but either way, we really need to be sure (or as sure as we can be), don't we? And we also need to know how to convert such rating values into match odds. In this way we can compare our rating with the odds currently on offer and decide whether we have found a value bet or not.

So today, we'll try to move such ratings away from their original abstract numbers like +250 or -300 and towards more concrete and usable percentage probabilities. From there, of course, we can then derive some odds.



Rate Form (Elo):
In the last post, I stated (rather piously) that I wasn't going to bother detailing exactly how the Rate Form method was implemented; but I've changed my mind now as, if I'm going to show various methods for compiling match odds, then it might be a good idea to take Rate Form from the beginning right through to the end. It gives a better, overall view.




Sorry, that's the wrong Elo



Before I do so, however, I should point-out that the version I'm detailing here is perhaps the most basic version you can get. There is no consideration in this example for the status of the match, how many goals are scored, league points carried over from previous seasons or previous leagues or anything like that. Some people also add-in things like shots on goal and corners. This version has no additional weightings or filtering at all; it’s just the bare bones method to demonstrate how to generate match odds from it. If you want to go on and enhance this version further with some of the things I've mentioned, or some of your own ideas, then do go right ahead.

To recap, the original, eponymously-named, Elo system was developed for rating chess players, but this was turned into the Rate Form system for football by Tony Drapkin and Richard Forsyth. Here it is:

  • At the beginning of the season, assign each team in the league 1,000 points.
  • For any given match, both teams give a percentage of their points towards a shared pot. The home team gives 7%, the away team give 5%.
  • The winner gets the whole pot which they then add to their overall points tally.
  • Should the match end in a draw, the teams share the pot, often meaning the home team will lose a little bit and the away team will gain a little bit.

That's essentially it. The smart thing about this rating method is that big teams like Man Utd and Chelsea will accrue more points than the smaller teams, and so when these smaller teams do get to play the bigger teams, they get a chance to win a bigger pot than normal. In this way, the quality of the team is accounted for.

Once these ratings have matured (and again, I'll leave the definition of "matured" up to you), then by subtracting the away team's ranking from the home team's (which will have a home advantage value added to it), a final Rate Form value will be arrived at, which can be used to determine the likely final outcome.

Some people say that +100 means a home win and -200 means an away win; but I've also heard other people say that +250 or even +500 is a home win – although I suspect that with such a high score, only teams like Man Utd, City and Chelsea will pop-out of the ratings.

My view, however, is that what "some people say" is completely irrelevant. I don't know about you, but I don't really care what "some people say", I'd be more interested in what people know. I'd be more interested in concrete facts such as "This probability is greater than that probability".

Okay, as you can see I've started waffling and moaning now, so I'll stop that and continue on. The real question is how do we turn a Rate Form rating figure, such as +858 points, into match odds? Well, as usual, there are many answers to this question, some of which are more involved and complex than others. But if any of you have read this blog before, then you’ll know that I tend to favour the line of least resistance, so happily we don't need to be brain surgeons to sort this out. No, we only need access to some historical data from which we can run the Rate Form system, and then we can compare the ratings with actual results. This way, we can see exactly how often a particular Rate Form value results in a home win, a draw or an away win.

For example, if we ran our Rate Form system on the historical data and found 1,000 matches where we had a rating value of +180, then we might end-up with something like this:



Rate Form
Total Matches
1
X
2
+180
1,000
655
152
193


Do note that this is a fake example, but this shows us that, when a Rate Form rating of +180 was calculated on 1,000 matches, it resulted in 65% ending with a home win, 15% in a draw and 20% in an away win. If we did this for all valid and likely Rate Form values with a reasonably large data set, then we could start to imply a set of odds for each rating value.


Rate Form
Total Matches
1
X
2
+180
1,000
655
152
193

Match Odds
1.52
6.56
5.17


Obviously there is no overround included in these figures.

Hmm, that’s all well and good, but there are obvious problems. First-off, when using Rate Form, the system can produce these kind of ratings:

           Arsenal (1919.11)  v Liverpool (1537.26)

This means, the calculation to arrive at a Rate Form figure for the match is:

(Arsenal  (1919.11 ) + home advtge (100))  - Liverpool (1537.26)  = 481.85


These are distinct values indeed, and even with 10,000 matches our results are going to be spread far too thin from which to make any sense. Taking this rating above, how many matches are we going to find that have resulted in exactly a rating of 481.85 so that we can populate our home win/draw/away win percentages? Not too many, I suspect. One or two at most.

So what should we do about this? Well, I suppose we could just lop-off the fractional part and see where we get but, even then, the spread of potential rating values is most likely still going to be too wide.

You'll have to make your own decisions here, but one other option you might consider is to group some rating numbers together, perhaps in clumps of 25 so that we get less of them. Therefore instead of +100, +101, +102, etc, we would have +100, +125, +150 etc. If a match is rated with an example value of +115, then we’d place it in the +100 group (less than +125 but greater than or equal to +100), and so on.

Frankly, I’ve not done myself any favours here, as The Rate Form system here is harder to create match odds from than, say, Goal Supremacy or Game Form, where their rating methods don't create fractional numbers and, more importantly, don’t have such a large spread of rating values. Anyway, for the moment, we’ll persist with our grouping idea and see how we get on.

It's at this time, we should dispense with the fake, made-up data and start dealing with real historical data. I have run the Rate Form system on 7,481 matches and, after grouping Rate Form by each 25 points, the headline results are:

  
RF Values
Total Matches
1
X
2
161
7,481
3,416
2,001
2,064


45.66%26.75%27.59%


Okay, this isn’t looking too bad, and the home win/draw/away win percentages are not a million miles away from the long-term averages - but what about the breakdown of actual points?

I don't know how to display a large table of Excel data in Blogger, so I'm going to leave what I've done as a download which you can find HERE. If you look a the "Summary" tab in this downloadable workbook, you can see each individual rating value (161 of them), the number of times that value was rated in a match and the 1X2 spread. And here we can still see some anomalous results. 


For example, if you look at the +200 rating, this shows a 51.46% probability of a home win, and yet a stronger rating of +275 only shows a 47.49% probability of a home win. That cannot be right. Why would that be?

Well, this is due to insufficient data, which is effectively causing some “noise”, or kinks in the data. No need to fret though, as we can overcome this problem and smoothen out the results by running some basic regression analysis on our results.


Sit back and relax...



Regression Analysis:
Okay, don’t run away scared by this rather technical-sounding title. This doesn't involve a psychiatrist's couch, you'll be pleased to know. Regression analysis is nothing more than a fairly boring statistical method. Personally, I took the time and trouble to learn the basic maths behind regression analysis for myself, but I won’t trouble you with them. Instead, we can relax and let Microsoft Excel take all the strain. It’s really very simple indeed this way.

If you look at the "Regression" tab of the workbook, you'll see I have grouped all the Rate Form values from  +1475 all the way down to -1400 along with each corresponding home win percentages that were acquired from running the system against the 7,481 matches (columns B and C in the worksheet). I've also done exactly the same thing for the away wins (columns H and I).

Then I highlighted the data in the B & C columns (B4:C119) and then I went to the "Insert" ribbon within Excel and selected the little arrow underneath "Scatter". I then selected the first picture (top left). This plops a diagram onto the currently open worksheet, showing all the rating values against the actual percentages attained. As you can see, there seems to be an upward line in the data (which is what we're looking for).  If I right-click on the mass of dots within the diagram, a context menu appears, from which we can select "Add Trendline...".


Within the subsequent box that opens, I can select the radio button marked "Display equation on chart" and also the one marked "Display R-squared value on chart". I then closed that box. This now gives me a nice trendline on the chart itself - but more importantly I also have two other items on the diagram. The first one is a ready-made equation that I can use to create my probabilities for the home win, and the second one (the R2  = is a measurement of how closely this data matches the ideal trend line. An absolutely perfect relationship would be a value of 1, although anything above .60 is probably okay. If the R2 value is around the .50 mark, then you should perhaps go back to the drawing board.

Using the given equation, we simply need to substitute the x value with our Rate Form value, so if we have a +150 rating value, then the equation becomes:

     P = (0.0002 * 150) + 0.4521.

This gives us a probability of 0.4821 (or 48.21%). Therefore a +150 rating is equivalent to home odds of 2.07. If we truly believe that this rating is accurate and the bookies are offering odds of 2.20, then perhaps we have a value bet on our hands.

Right, so we do exactly the same for the away data as we do for the home data. Select it all, and create a scatter plot based on that data. Do note, that when creating a trendline, you do also have the option of adding-in a different type of trendline. Linear looks to be the correct type for the data I have shown, but this may not be so for all cases. Do experiment. Remember, we are looking for the highest R2 value we can get.


Once we have the away odds created, we can then either do the same for the draw values, or alternatively we could just subtract the home and away probabilities for each rating value from 1, giving the remaining amount for the draw. It might be good, however to actually create the scatter plot for the draw as we can then see how efficient the trendline is. Don’t be surprised if it’s not very efficient at all! Anyway, I'll leave that part up to you (as an exercise perhaps).



Improvements:
Okay, so what can we do to improve the efficiency of what we have done here today? Well, there are a number of things you may want to look at:
  • Decide whether the basic Rate Form method that I’ve outlined is really the best one to use. As mentioned above, it’s possible to look into other ways of producing Rate Form ratings by introducing other variables and by refining the approach. This is well worth pursuing.
  • Look again at whether the 25 point grouping is the best approach for shrinking the data set down. 
  • Cut the outer ranges that we’ve included in our ratings. Presently I have shown extreme ranges in the data such as +1475 and -1400. These will not occur that often and will be skewing the results we’re getting. If we just concentrated on the more common values, not only will we be shrinking our data set  down (which is still too big) but we should also increase the efficiency of our regression.  

All of these should help to improve things and help to bring your R2 value up further towards 1. Do keep an eye on that value.

Okay, we’ll I’m going to leave it here. Hopefully I haven’t confused too many of you, either through not explaining all this properly or by making it too complex. Apologies if I have done either of those things.

However, hopefully I have shown at least a few of you who didn't know before just how you can create match odds from seemingly unrelated rating values. From there you should now have a fighting chance of being able to find some value bets for yourself out there.

Good luck.


1 comment:

  1. Can you please xplain how did you come up with the number of home advantage in this case 100? I don;t undesrtand this part.

    The line is (Arsenal (1919.11 ) + home advtge (100)) - Liverpool (1537.26)

    Why is the home addvantage 100 in this case and how we can calculate it with new data?

    Thanks a lot, the article is very interesting indeed

    ReplyDelete

Note: only a member of this blog may post a comment.