Predictive Analytics in French Trot Horse Racing

Goal

To develop a predictive model for forecasting outcomes in French Trot Horse Racing. The focus was on creating an accurate, efficient, and interpretable model to predict race winners, leveraging data cleaning, feature engineering, and machine learning techniques.

Dataset

The dataset was extensive, comprising over 1.2 million data points, encompassing detailed information on horses, races, jockeys, and trainers. It included diverse variables such as demographics, race conditions, horse performance history, and trainer and jockey statistics. The dataset required rigorous cleaning and pre-processing to ensure data quality and relevance, with particular attention to redundant variables, type casting, value updates, and handling of missing data.

What We Did

Data Cleaning:

Removed overlapping variables, performed type casting, and updated values for clarity.
Updated values for clarity and consistency, this include handling missing values, assigning NaN or categorical labels as appropriate.

Feature Engineering:

Developed key race variables like 'Nb_Participants' and 'Season', and outcome variables including 'Placed' and 'WIN Variable'.
Past Performance Calculations:

Calculated average past performance scores for horses, jockeys, and trainers. This involved aggregating historical race data to derive performance metrics, providing a comprehensive view of each participant's track record.
Combined these performance scores with current race conditions to create predictive variables. This synergy of past performance and current conditions offered a nuanced understanding of potential race outcomes.

Field Condition Combinations:

Analyzed various field conditions, including track type, weather, and race characteristics, and combined them with performance metrics. This approach allowed for a more dynamic and context-sensitive prediction model.
Engineered complex variables like 'Field Competitiveness' and 'Horse's Preferred Surface Type', which integrated various aspects of race conditions and participant history.

Model Training:

Selected XGBoost for model training, based on a balance of performance and computational feasibility.
Employed feature selection and parameter tuning focused on log loss optimization.

Output Post-Processing:

Scaled model probabilities at the RaceID level and processed outputs for win predictions.

Performance Metrics:

Evaluated the model using accuracy, precision, recall, and log loss score.

Results

Achieved an accuracy of 0.893, precision and recall of 0.295, and a log loss score of 0.231.

Successfully predicted the winner in 29.5% of races (631 out of 2140 races).

The results underscored the effectiveness of the engineered features and the model's capability in reliably predicting race outcomes.