{"id":4352,"date":"2020-10-15T17:50:05","date_gmt":"2020-10-15T17:50:05","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/15\/predicting-horse-racing-outcomes\/"},"modified":"2020-10-15T17:50:05","modified_gmt":"2020-10-15T17:50:05","slug":"predicting-horse-racing-outcomes","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/15\/predicting-horse-racing-outcomes\/","title":{"rendered":"Predicting Horse Racing Outcomes"},"content":{"rendered":"<div>\n<h2>Background<\/h2>\n<p>The horse racing community has been using quantitative data to develop betting algorithms for decades.\u00a0 Indicators including horse bodyweight, age, and previous lap times are all utilized along with the domain-specific Speed Index to predict future race outcomes. Our team was asked to answer a new question in horse racing: Can subjective data taken by an SME improve the predictive ability of horse racing models?<\/p>\n<p>Our partner in this task was RaceQuant, a horse racing syndicate that makes proprietary wagers on races in Hong Kong\u2019s two venues: Happy Valley and Sha-Tin. Prior to each race, participating horses are walked around an open area next to the track known as the \u2018paddock.\u2019 Racing insiders stand by and orate their observations of each horse to a respective betting syndicate. RaceQuant had contracted an SME to take note of specific features, and One Hot Encoded the observations.\u00a0 These observations can include subjective takes like \u2018confident\u2019 or \u2018nervous,\u2019 and seemingly span the entire English vocabulary.\u00a0 For the study, features were broken down to between 5 and 7 distinct values.\u00a0 As an example, some horses will have bandaged legs.\u00a0 This feature would then be broken down into categories of [front-leg single, front-leg double, rear-leg single, rear-leg double, all] and subsequently One Hot Encoded for data privacy.\u00a0 The observations spanned one season of racing and 800 individual races.<\/p>\n<p>In addition to the features provided by the SME, all observations also included which one of the 23 primary trainers was the trainer for the horse. Features indicating the rank of each horse by bodyweight in the race were given, as well as columns indicating how the horse\u2019s bodyweight had changed since its prior race.\u00a0 As with the data provided from the Paddock, the trainer and bodyweight data was One Hot Encoded.<\/p>\n<p>In order to judge the relevance of the provided data, RaceQuant established two benchmarks: The Tips Indexes provided by the Hong Kong Jockey Club and the Odds data for each horse in each race.\u00a0 The Tips Indexes provided by the Hong Kong Jockey Club are published on its website at www.hkjc.com before 4:30 pm on the day preceding a race day and before 9:30 am on each race day morning.\u00a0 For the analysis, the team utilized the race day measure of the Tips Index.\u00a0 \u201cA Tips Index is not a &#8220;tip&#8221; itself but a reference figure based on selections by the racing media and generated by applying a statistical formula. Each starter&#8217;s Tips Index represents the degree of favor it has among tipsters.\u00a0 For easy comprehension, a Tips Index is expressed in the form of an odds-like figure. The smaller the figure, the higher is the degree of favor a starter enjoys among tipsters.\u201d\u00a0 Historically, the Tips Indexes ~24% of the winning horses.<\/p>\n<p>The Odds data was taken five minutes prior to the start of the race.\u00a0 Odds are subject to change and are fully correlated with the amount of money placed on the horse.\u00a0 Additionally, the true odds payout is not set until the start of the race. Consequently, the value of a ticket purchased can move up and down accordingly.\u00a0 Lastly, the racetrack collects ~22% of total wagers as commission.\u00a0 Therefore, the summed probabilities of winning for the horses racing is 1.22 or 122% as derived from the listed odds.\u00a0 Historically, the Odds favorite wins ~33% of the time.<\/p>\n<p>RaceQuant had two primary questions regarding the data: First, can the Paddock data alone or in conjunction with the Tips Index and Odds be used to predict a higher percentage of winning horses relative to the established benchmarks on an \u201cunseen\u201d number of races.\u00a0 Second, would a model using this data provide a higher ROI than the benchmarks after making one unit bets on the model\u2019s favored horse over a given period of time.\u00a0<\/p>\n<p>With two benchmarks to compare results, the team focused on probabilistic classification models to decide whether or not a horse is a winner.\u00a0 A motivating factor in this decision was that the RaceQuant desired full transparency in the models so future efforts could focus on features with statistically significant results.\u00a0 Given this, along with the limited data, the agreed upon choice was to utilize Logistic Regression models in the research.\u00a0 Bolton and Chapman\u00a0 had already proven the usefulness of a multinomial logit model for handicapping horse races in their 1986 white paper.\u00a0 However, given the highly subjective features within the data set the team also developed a Random Forest Classifier.\u00a0<\/p>\n<p>An obstacle was the nature of the data, more specifically, that the models lacked the ability to identify the \u2018race-by-race\u2019 nature of the data. This meant each horse would effectively be racing against all of the horses in the dataset, including itself if there were &gt; 1 observations. To solve this problem, we pulled the win probabilities each model assigned a given horse, and then ranked the horses in a given race based on the model-assigned probability of winning.<\/p>\n<p>To summarize, two models were trained, a Logistic and Random Forest classifier, for both the Paddock data alone and the combined Paddock and Tips\/Odds dataset. For the test set, the probabilities of winning for each horse were determined from the models, and the expected winner of each race was selected as the horse with the highest win probability. The total number of correct predictions was calculated, and the picks from our models on the test set were compared with our two benchmarks: betting on the horse with the most favorable odds and betting on the horse with the most favorable Tips Index.<\/p>\n<h2>Results<\/h2>\n<p>When the Logistic Regression and Random Forest were trained and tested on only the Paddock dataset, these models underperformed their benchmarks. Moreover, the return (loss) from utilizing strictly the Paddock data performed no better than random selection.<\/p>\n<p><img loading=\"lazy\" width=\"297\" height=\"200\" src=\"https:\/\/lh4.googleusercontent.com\/686ZqIfU1Kgar-6wz11hHnx6RuezYUvBOicIqKf1zzImT32hCPd0GMWPyQUZhOcWLWRP6-dWV54aQsZX4whQ4UWEhV7QYfPKSFDhQSAfuXUtsWOiCdHz7GBkae-IzcAkG53uL9pL\"><\/p>\n<p>While all four strategies seem to track closely for the first two months, eventually both models diverge and never recover.\u00a0 Between November and December of 2019 the random forest outperforms on the test set.\u00a0 These results may simply be noise,\u00a0 as this deviation is well within the critical value to have appeared at random.\u00a0 Nevertheless, there does appear to be structure within the results that tracks in line with the Tips and Odds dataset; specifically in January, late March, and early May.<\/p>\n<p>One of the first decisions on how to proceed with the analysis was whether to use the Tips Index, the Odds favorite or both as comparative indexes.\u00a0 To answer this, the first step was to determine whether or not there was a statistical difference in the performance of the Odds favorite as compared to the Tips favorite.\u00a0 Using Fisher\u2019s exact test, the team determined that performance results are dependent on the proxy used with a p-value of 0.00032 and a higher return for the Tips Index.<\/p>\n<p>While strictly utilizing the Tips Index in the model may have led to enhanced initial performance, the decision was made to include both; the team hoped that any divergence between the Tips Index and the Odds favorite may be of value to the model.\u00a0\u00a0<\/p>\n<p>Promising results were observed when the Paddock data was supplemented with the Odds and Tips Index of each horse.<\/p>\n<p><img loading=\"lazy\" width=\"332\" height=\"224\" src=\"https:\/\/lh3.googleusercontent.com\/N4B3LbnW1eGiyGahfSFSW5G4FkyC-dgFXg9nlMBarEUjuX9vth32I2NfW8EshNyJpfYtB7hNyO3IbYCDNYYf5B6ZvrPsPU3RmcIcWP1eLWa4UpTRyNvLyCQy0LjpfmMkdFT5cmqS\"><\/p>\n<p>In the same timeframe, both the Logistic Regression and Random Forest classifiers made selections that resulted in smaller losses than the benchmarks. Additionally, the win rate of each model had a significant increase. Single unit bets made only on the TIPS favorite accurately predicted the winner in 22.6% of the test races, while betting only on the odds favorite accurately predicted the winner in 26.4% of the test races. Logistic Regression, however, predicted 28.3% of the winners while Random Forest predicted 28.9%.\u00a0 Additionally, the models were able to track the better performing index, rising as Tips returns diverged from Odds in November, and increasing again in February as Tips returns turned upwards.<\/p>\n<p>Based on Fisher\u2019s exact test it can be concluded that performance is, in fact, dependent on the selection strategy of the models against both the Tips index and Odds favorite.\u00a0 While additional data will be needed for a higher level of confidence, it appears that the Paddock data is, in fact, a confounding variable which leads to enhanced prediction accuracy and reduces overall loss under a one unit betting strategy.<\/p>\n<p>To determine the efficacy of the model\u2019s strongest predictions, four thresholds [60%, 70%, 80%, 90%] were implemented, and the model was run against each. Each test required the model\u2019s choice to have a predicted win probability at or above the threshold in order to place a wager.\u00a0 If no horse in a race exceeded the threshold then no bets were placed.\u00a0<\/p>\n<p><img src=\"https:\/\/lh5.googleusercontent.com\/DobRbCJbISh8cJ0XtdLChdISsYHmtw3AAUP78McUVZwpeJsdg3vruBH_I0GAeQVt_YtiR9yjooPqrQUn6Wk5tEJkc92yYRooD_F2TyeOdYDIQDf6Qq1NU7sYt-fjoeDcjY2hN9_R\"><\/p>\n<p><img loading=\"lazy\" width=\"225\" height=\"145\" src=\"https:\/\/lh5.googleusercontent.com\/kK6W9wVE_dRD4FvdqkU6hRg-Aaaz46dgsaJYLUqJhgwJxp2U81v-ffpxnmX1njtbtinWrj_-_yPRWyWveJI2qQ5BZ3Pb6BitjIkUrR2rvS3l7LR06je5RlB1CAUVj9pyTdBaGpwn\"><\/p>\n<p>While changing the threshold did not result in a profitable model, the model with &gt; 90% only exhibited loss due to the racetracks commission on each bet placed.\u00a0 Another interesting takeaway was that the &gt;70% model actually underperformed the base model.\u00a0 This highlights the value of correctly selecting horses in well-matched races as well as long-shots to drive ROI.<\/p>\n<p>A major concern from this analysis was the distribution of the probability to win for the horses selected by the model to win a respective race.\u00a0 That is, how frequently a horse with probability to win of x had the highest percentage chance, as predicted by the model, to win its race.\u00a0 The most frequent observations occurred between ~80% and ~87.5%.\u00a0 Working backwards, this yields a break-even odds (when using 85%) of 3\/17 (1.18 in decimal and -566.67 American\/moneyline).\u00a0\u00a0<\/p>\n<p>It is clear that the model\u2019s win percentage predictions are substantially higher than in reality. \u00a0 The primary cause of this problem was the binomial nature of the modeling.\u00a0 When the win probabilities from the Logistic model are graphed for all horses, a large number of horses are given virtually no chance to win the race.\u00a0 As the expected win probability increases past 5%, a peak is seen at ~35% followed by a steady drop-off.\u00a0 However, even with the steady drop-off sufficient horses exhibit the features of a \u201cwinning\u201d horse to receive a high probability.\u00a0 As the project moves forward and a multinomial is exchanged, these probabilities will decline.<\/p>\n<p><img loading=\"lazy\" width=\"208\" height=\"126\" src=\"https:\/\/lh6.googleusercontent.com\/QIfR2QGsfVVUTaBfRjqNaPk2SdQdS__s6ARJjiH_7OKdZRP5RwDxLRMQU442dIofd08qABjIV6kjErBYtsZDBHjI1NTza6RzFHeLN2oVQ36LJqA3m1sZr1_M0dyCRtuI4jY1OlwY\"><\/p>\n<p><img loading=\"lazy\" width=\"217\" height=\"134\" src=\"https:\/\/lh4.googleusercontent.com\/g0ltT1I87Wfv7OB1pOA0PqXjpE7fpo2emKc9Tc9XY6pLAWC0NXSgL3pk_jpo-6g4RKESViRiFFnfIVOipasjsP7jx1Yrieri15OobpmEQLd4GLO_aAn5zQ1lCEPccrqjX6D9Oy09\"><\/p>\n<p>Interestingly, the Random Forest classifier had more difficulty in identifying strictly losing horses.\u00a0 The Random Forest classifier also had fewer ambiguous values (those between 40% and 60%).\u00a0 While neither of the models produced promising distributions based on implied odds, the logistic model captured more distinctly \u201closing\u201d characteristics.\u00a0 Moving forward, the Kelly Criterion, also known as the scientific gambling method, will take these probabilities as inputs and allocate funds accordingly.\u00a0 Until these probabilities are brought in line, no asset allocation can be completed.<\/p>\n<h2>Takeaways<\/h2>\n<p><span>The initial challenge faced surrounded the data itself.\u00a0 While the TIPS Index dataset spanned five years, the Paddock data only included observations for one season of racing.\u00a0 This limited our observations to only 800 races ~ 8000 observations in total and 1414 distinct horses.<\/span><\/p>\n<p><span>Given the focus on strictly predicting horses that finished first, the team elected to develop a baseline model using only the data provided and compare the results to strategies of selecting winners using the TIPS Index and predictions made by the \u201cwisdom of crowds\u201d inherent in the Odds favorite.\u00a0 This choice was made to a large extent so that the results could be clear to the client, and the team would be able to present options of how to remedy issues that were encountered in the data.\u00a0 As this project would eventually be moved to an internal team, transparency around decisions was vital.\u00a0\u00a0<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span>While the model did outperform both indexes, the sensitivity of the models required significant improvement before they could be implemented. The need for more observations was clear.<\/span><\/p>\n<p><span>Several techniques could have been used to increase observations of the minority class in this situation. However, within the horse racing community there are domain-based practices of resampling with origins based in the Ranking Choice Theorem developed by Luce and Suppes (1965).\u00a0 More recently, Chapman and Stalin have described how \u201cthe extra information inherent in such rank ordered choice sets may be exploited and provided an\u201cexplosion process\u201d that applies to a class of models of which the stochastic utility model is a member.\u00a0 By applying the Chapman and Stalin explosion process, it is possible to decompose rank ordered choice sets into \u2018d\u2019 statistically independently choice sets where \u2018d\u2019 is the depth of explosion.\u00a0 Bolton and Chapman\u2019s \u201cSearching For Positive Returns At The Track\u201d provides the example of three such sets: \u201c[horse 4 finished ahead of horse 1, 2, and 3], [horse 2 finished ahead of horse 1 and horse 3], and [horse 1 finished ahead of horse 3], where no ordering is implied among the \u201cnon-chosen\u201d horses in each \u201cchoice\u201d set.\u201d<\/span><\/p>\n<p><span>These exploded choice sets are statistically independent, so they can be easily added to the dataset and are equivalent to completely independent horse races.\u00a0 This leads to an increase in the number of independent choice sets available for analysis and, ultimately, would provide more precise parameter estimates.\u00a0 \u201cExtensive small-sample Monte Carlo results reported in Chapman and Stalin document the improved precision that can accrue by making use of the explosion process.\u201d\u00a0 This becomes tremendously valuable due to both the time and cost of obtaining a sufficiently robust dataset to use for modeling.<\/span><\/p>\n<p><span>Due to the limited amount of quantitative data, feature engineering appeared to be a prescient path to take in order to bolster prediction accuracy.\u00a0 Average lengths distance behind the winner appeared to be a good starting point.\u00a0 For horses that won, this number would be the negative lengths between the horse and second place.\u00a0 However, the frequency distributions of races ran by a horse are heavily right-skewed, as many have run only one race.\u00a0 Additionally, this metric would be biased against horses that have never run before.\u00a0 In a random forest model, this issue could be slightly mitigated by using a dedicated large number to represent such horses but for the logistic model, a solution could not be so easily implemented.\u00a0 However, there were several remedies available:<\/span><\/p>\n<ul>\n<li><span>Utilize available metrics such as Speed Index (when available) to predict the horse\u2019s finish time at the race distance and ensemble the two models<\/span><\/li>\n<li><span>Identify the horse in the current dataset with &gt; n finishes most similar to the horse in question using KNN and assume the same value for the variable<\/span><\/li>\n<li><span>Average the horse\u2019s past n performances during training over the same distance<\/span><\/li>\n<\/ul>\n<p><span>Given that the data for use did not include quantitative metrics other than the number of lengths that a horse placed behind the winner the only applicable solution was to use KNN.\u00a0 However, the implementation of such a\u00a0 technique over all the variables in the dataset might not actually be necessary in this situation.\u00a0 It has already been discussed that bodyweight, trainer, and jockey are important variables from the Paddock dataset.\u00a0 The team\u2019s hypothesis is that adding readily available information such as Speed Index, which is known to have a strong correlation to finish time, might be enough to estimate the average range of time that a horse would take to finish a race.\u00a0 Testing this theory using AIC or BIC over the aforementioned variables and additional variables from the Paddock dataset would be a very informative next step.<\/span><\/p>\n<p><span>After analyzing the average lengths behind the winner data the team saw a fairly frequent trend of high variance in horses that finished below fourth place.\u00a0 Even winning horses would have runs that had a drastically negative impact on their average distance to the winner.\u00a0 The thesis behind this variability was that once a jockey knew his\/her horse would not place the jockey would ease off of the horse and reserve as much of the horse\u2019s energy for another race or for training.\u00a0 When the filter was applied so that only the top four horses were accounted for on each race, the standard deviation fell by over 62%.\u00a0 Because of this, the team was back to not only the first question of how to manage horses that have not run but also how to manage horses that have not placed in the top four.\u00a0 Several approaches were presented on how to manage this issue.\u00a0 The first was to One Hot Encode various distance gaps and add an additional label to signify a placing below fourth.\u00a0 To accomplish this task would require a SME who could validate the methodology and additional statistical analysis.<\/span><\/p>\n<p><span>Before delving into such an undertaking, the team decided to track whether or not wins-to-races ran as well as places-to-races ran was a strong indicator of future performance.\u00a0 The team did this by separating the training and testing dataset by race date and then calculating total races won to total races for all horses in the training set.\u00a0 These numbers were then applied to each horse in the testing set with unseen horses receiving a value of zero.\u00a0\u00a0<\/span><\/p>\n<p><span>The result of this approach was a higher predicted win probability for horses with a high index but a drop-off in total accuracy due to the bias against unseen horses.\u00a0 Additionally, because of the rules of racing in Hong Kong, dominant horses may have additional weight added in subsequent races. The model was unable to account for these changes.<\/span><\/p>\n<p><span>Overall, the team concluded that additional quantitative information is necessary to effectively utilize feature engineering for the model.\u00a0 While the Paddock data alone performed no better than a random choice approach, when coupled with the TIPS and Odds data the Paddock observations appear to act as a confounding variable that substantially increase overall model returns.\u00a0 Nevertheless, due to the lack of observations and subsequent variance observed in the model, the team would need to gather more data to establish a definitive stance.<\/span><\/p>\n<p>Sources:<\/p>\n<p>&#8216;Searching For Positive Returns At The Track: A Multinomial Logit Model For Handicapping Horse Races&#8221; Bolton &amp; Chapman 1986<\/p>\n<p><a target=\"_blank\" href=\"http:\/\/www.hkjc.com\/\" rel=\"noreferrer noopener\">www.hkjc.com<\/a>\u00a0The Hong Kong Jockey Club<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/nycdatascience.com\/blog\/student-works\/capstone\/predicting-horse-racing-outcomes\/<\/p>\n","protected":false},"author":0,"featured_media":4353,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/4352"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=4352"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/4352\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/4353"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=4352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=4352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=4352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}