On hearing what I do, people tend to ask me the same question: Can you develop a system that predicts football match results? Or Olympic medal outcomes? Personally, I don’t put much faith in predictions. Still, if we had a large amount of historical data and relevant indicators, we could certainly design a system to help us come up with more accurate assumptions. In this article, we’ll consider a model that can store the results of matches and tournaments.
This model is primarily focused on European football (soccer) matches, statistics, and results, but it could easily be tweaked to accommodate many other sports. My main motivation for this article was this year’s two big football events: the UEFA Euro 2016 Championship that just happened, and the 2016 Summer Olympic Games that are happening right now.
What Do We Know Before the Tournament Starts?
Before the tournament starts, we know almost everything about it — except the most important thing: who will win. Let’s briefly state exactly what we already know:
- The dates the tournament starts and ends
- The locations where the matches will take place
- The exact times the matches will start
- Which teams have qualified for the tournament
- The players on each of these teams
- Each player’s past performance and their current form
What Match Details Do We Want to Store?
Tournaments consist of multiple matches. Before we store any match details we need to:
- Relate each match with the tournament
- Record the tournament stage when the match was played (e.g. group stage, semi-finals)
We also need to store details for single matches, including:
- The teams involved in the match
- Starting lineups and substitutions
- Match events (in football these are: goal, penalty, foul, yellow card, etc.)
- Final score
- Players’ actions during the match
We’ll use this data to capture all important match events. Comparing a player’s performance before and during the match could lead to certain conclusions. Maybe we wouldn’t be able to predict the final results of their performance (i.e. a win or a loss), but statistics certainly could help us make assumptions with a degree of reliability.
Introducing the Model
The model is divided into four main areas:
Indicators and Performance
The tables outside these areas are dictionaries (
position), catalogues (
player) and a single many-to-many relation (
We’ll describe the uncategorized tables first, and then take a close look at each area.
The Uncategorized Tables
These tables are important because tables from all four areas use them as dictionaries or catalogues.
sport table lists all the sports we’ll store in our database. We’ll probably have only one sport here, men’s football, but this table gives us the flexibility to add similar sports (e.g. women’s football) if needed.
sport_event table, we’ll store the events connected with our sport(s). One example would be the “2016 Olympic Games”.
phase table is a dictionary that holds all possible tournament stages. It contains values like “group stage”, “round of 16”, “quarter-finals”, “semi-finals”, “final”.
team table is, as you’d guess, a simple list of all teams. Possible values are “Croatia”, “Poland”, “USA” etc. If we use the database to store information about club or league competition, we would also have values like “Barcelona”, “Real Madrid”, “Bayern”, “Manchester United” etc.
player table, we’ll store records for all players belonging to the relevant teams.
plays table is our sole many-to-many relation, and it relates players and teams. A player can belong to more than one team at the same time ( e.g. the national team and a club), but during a tournament they will obviously play for only one team.
Finally, we have the
position table. This simple dictionary will store a list of all the required positions. In football, these include goalkeeper, center-half, striker, etc.
Note: If you just want to store the results of single matches, you don’t need to use this section.
A tournament consists of more than one match; both UEFA Euro 2016 and the football events in the 2016 Summer Olympics are tournaments. As we said before, we can store a single match in our database, but we can also relate matches to their relevant tournaments. The tables in the Tournament section are:
tournament– This contains all the basic tournament data: the sport, start date, end date, etc. We also need to store the tournament name and a description of where it’s taking place. The
sport_event_idattribute is optional because a tournament doesn’t have to be associated with a larger event (such as the Olympics).
group– This lists all the groups in that tournament. UEFA Euro 2016 had six groups, A to F.
participant– These are the teams playing in the tournament; each participant can be assigned to a group. Most tournaments begin with a group stage and then continue to a knockout stage (e.g. UEFA Euro, UEFA World Cup, Olympic football). Some tournaments will have only a group stage (e.g. national leagues), while others will have only a knockout stage (e.g. national cups).
in_team– This table provides a many-to-many relation that stores information about the players registered for that tournament and their expected positions.
tournament_schedule– In my opinion, this is the most interesting table in this section. The list of all games played during this tournament is stored here. The
tournament_idattribute denotes which tournament each match belongs to, and the
phase_idattribute defines the phase during which the match will take place. We’ll also store the match location and the time when it begins. Both participants will be described by text fields. When the group stage finishes, we’ll know all the matchups for the elimination round. For example, at the beginning of UEFA Euro 2016, we knew that winner of Group E (1E) will play against the Group D runner-up (2D). After all three rounds in the group phase were played, this pair was Italy vs. Spain.
Match details area is used to store data for single matches. We’ll use two tables:
match– This contains all the details about a single match; this match can be related to a tournament, but it could also be a single game. So the
tournament_schedule_idattribute is optional, and we’ll store the
locationattributes again here. If the match is part of a tournament, then
tournament_schedule_idwill be assigned a value. The
team_2_idattributes are references to the teams involved in the match. The
goals_team_2attributes contain the result of the match. They are mandatory and should have “0” as the default value for both.
in_match– This table is a list of all the players who are registered for that match; players who don’t participate will have a NULL in the
started_atattribute, while players that came in as substitutions will have
started_at> 0. If a player was replaced, they’ll have an
ended_atattribute that matches the
started_atattribute of the player who replaced them. If the player stayed in for the entire match, their
ended_atattribute will have the same value as the
This section is intended to store all the details or events that happened during the game. And the tables are:
event– This is a dictionary that lists all the events we want to store. In football, these are values like “foul committed”, “foul suffered”, “yellow card”, “red card”, “free kick”, “penalty”, “goal”, “offside”, “substitution”, “player ejected from match”.
match_event– This relates events with the match. We’ll store the
event_timeas well as player information related to that event (
related_event– This is what brings event information together. To explain, let’s look at an example when Player A fouls Player B. We’ll insert a record in the
match_eventtable that indicates that Player A committed a foul and another one that indicates that Player B suffered a foul. We’ll also add a record to the
related_eventtable, where the ‘committed foul’ will be the parent and the ‘suffered foul’ will be the child. We’ll also record the results of the foul: a yellow card, a free kick or a penalty kick, and maybe a goal.
Indicators and Performance
This section should help us analyze players and teams before and after the match.
indicator table is a dictionary with a predefined set of indicators for each player before each match. These indicators should describe the player’s current form. This list could contain values like: “number of goals in last 10 matches”, “average distance covered in last 10 matches”, “number of saves for GK in last 10 matches”.
performance dictionary is very similar to
indicator, but we’ll use it to store only values that are related to the single match: “distance covered”, “accurate passes”, etc.
performance_indicator tables share an almost identical structure:
in_match_id– refers to the player taking part in a certain match
performance_id– references the
indicatoror ”performance dictionaries
value– stores the value for that indicator (e.g. a player covered 10.72 km distance)
description– holds an additional description, if needed
What Happened During the Match?
With all this data entered in, we could easily get match details, events, and statistics for every match in our database.
This simple query would return basic details for an upcoming match:
SELECT team_1.`team_name`, team_2.`team_name`, `match`.`start_time`, `match`.`location` FROM `match`, `team` AS team_1, `team` AS team_2 WHERE `match`.`team_1_id` = team_1.`id` AND `match`.`team_2_id` = team_2.`id`
To get a list of all the in-play events during a certain match, we would use the query below:
SELECT `event`.`event_name`, `match_event`.`event_time`, `player`.`first_name`, `player`.`last_name` FROM `match`, `match_event`, `event`, `in_match`, `player` WHERE `match_event`.`match_id` = `match`.`id` AND `event`.`id` = `match_event`.`event_id` AND `in_match`.`id` = `match_event`.`in_match_id` AND `player`.`id` = `in_match`.`player_id` AND `match`.`id` = @match ORDER BY `match_event`.`event_time` ASC
There are numerous additional queries that I can think of; it’s easy to do an analysis when you have the data. If you’ve measured and stored a large number of indicators and player performance data, you might be able to relate these parameters with a final result. I personally don’t believe in such predictions; there’s the luck factor during matches, plus numerous other factors you can’t know until the game starts. Still, if you have a large dataset and a lot of parameters, your chance of making more accurate predictions increases.
The model presented in this article allows us to store matches, match details, and a history of each player’s performance. We can also set form indicators for each player before the match. Storing enough details should provide us with more parameters on which to base our assumptions. I’m not saying that we could predict the result of the game, but we could have some fun with it.
We could also easily tweak this model to store data for other sports. These changes shouldn’t be too complex. Adding a
sport_idattribute to the dictionaries should do the trick. Still, I think it would be wise to have a new instance for each different sport.