Walkthrough of data wrangling project with Python
Objective of the project:
- Transform the JSON files to dataframes to facilitate easy exploration and analysis.
- Automate the data wrangling steps so that they can be applied to multiple datasets.
- Develop query methods to help producing aggregated data by team or player.
- Adopting object oriented programming to help managing the codes.
I used python for the project. The codes are available in the following GitHub repository.
https://github.com/MuhammadMarufSazed/Data_wrangling_soccer_with_Python
Overview of the dataset: The spatio-temporal data set for soccer competitions (https://springernature.figshare.com/articles/dataset/Metadata_record_for_A_public_data_set_of_spatio-temporal_match_events_in_soccer_competitions/9711164) is a dataset that incudes several JSON files representing several European soccer competitions. The datasets consist information related to different events, foul, goal, pass, assist, etc., in soccer matches in these competitions. For demonstration of data wrangling steps, I used the dataset that has events data for Spanish league (file name events_Spain.json).
Interested reader can also go through the paper (link below) for understanding the methodology and analysis. I would like to thank the authors Luca Pappalardo, Paolo Cintia, Alessio Rossi, Emanuele Massucco, Paolo Ferragina, Dino Pedreschi, Fosca Giannotti for the wonderful paper and making the dataset publicly available.
Paper: https://www.nature.com/articles/s41597-019-0247-7
Structure of the given datatset: The json file is stored as a list in python. The list contains 628659 dictionaries that represent a single event.
with open("events/events_Spain.json") as f:
data_spain = json.load(f)
Each of the event dictionary contains several information about the event. The eventName identifies the nature of the event: whether it is a pass or foul or something else. The subEventName narrows further by specifying the nature of the Pass. In this example it is a cross. The tags represent comment about the event. In this case the tags mean that it was a left footed accurate high cross. For other events, e.g. foul, we have a completely different set of tags (not necessarily length of 3) with different values of the ids of these tags.
Data Wrangling-First Step: Since each of these dictionaries have the same set of keys: eventId, subEventName, tags, etc., we will combine these multiple dictionaries to make a single dictionary.
The purpose of combining the dictionaries into a single dictionary is to make it easy for us to convert it into a pandas dataframe. At this stage, apart from the list corresponding to tags, all the other lists contain one element per event.
Data Wrangling-Second Step: We convert the dictionary into a pandas dataframe. However, the resulting datafrme does not allow us utilize the information in tags column that easily. So, we go to the third step.
Data Wrangling-Third Step: Then added columns for each tag name to make sure that we get to use the useful information that is available inside tags. To implement these steps, I developed an event class (a python class) to make it easy to manage the codes and use it for datasets for rest of the other competitions.
At this stage we have a dataframe of shape (628659, 72). Now we have a dataframe that we can use to extract data in a much more easier way.
Exploration of the dataframe with query methods: I am assuming that a lot of the curiosity at this point will be around the team level or player level data aggregated over all the matches. For this, we will have to do a lot of filtering and aggregation (sum/count) over multiple columns. It also might be that after doing a filtering for a certain set of events, we might be interested to check how teams or players did with respect to some other event or subevent. To make this whole process easier, I developed a few “query” methods inside the events class that can be used to get those results easily and quickly. We can search for different events and subevents along different sets of tags. All these queries will produce dataframes corresponding to a specific query. Then we can combine these different dataframes to make a large dataframe with information that we are interested in at team level or player level. However, if we do not want to work with the aggregated data, say we want to analyze the sequence of events related to Lionel Messi, we can certainly do that by using the dataframe that we got after the third step.
Conclusion: Each of the datasets contain rich set of information that can be used for machine learning, cluster analysis, data exploration and visualization, and statistical modelling. Data wrangling is an important step for working with these kinds of datasets. The query methods allow user to query for different information according to their need. Adopting an object oriented programming approach helped manage the codes better.
Appendix:
Demonstration of query methods:
spain_object = event(data_spain, )
#transforming the given list of dictionary to dataframe
spain_object = events(data_spain, ‘Spain’, tags_name, events_label)
spain_object.process(teams_df, players_df)
Question: How many goals did different teams scored from Shot?
#query
by = ‘team’
tags = [‘Goal’]; events = [‘Shot’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = False)
Question: How many assists did different teams produced from Pass and Free Kicks separately?
#query
by = ‘team’
tags = [‘assist’, ]; events = [‘Pass’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = False)
#Setting combine = False will produce a different dataframes for different events.
Question: How many assists did different teams produced in total from Pass and Free Kicks together?
#query
by = ‘team’
tags = [‘assist’, ]; events = [‘Pass’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = True)#Setting combine = true will produce a dataframe that adds the individual dataframes for Pass and Free Kick.
Task: Generate a single dataframe that has team-wise aggregate information for different events, subevents, and tags.
#Multiple queries
by = ‘team’
tags = [‘Goal’, ‘assist’, ‘keyPass’, ‘opportunity’, ‘counter_attack’]; events = [‘Pass’, ‘Shot’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = False, combine = True) #1st query
tags = [‘lost’, ‘won’, ‘sliding_tackle’]; events = [‘Duel’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = False, combine = False)
tags = [‘accurate’]; subevents = [‘Corner’]
spain_object.query_eventsub_tag(tags, events = None, subevents = subevents, by = by, output = False, combine = False) #2nd query
spain_object.query_eventOrSub([‘Air duel’, ‘Cross’, ‘Free Kick’], output = False, level = ‘subevent’) #3rd query#create a combined dataframe from the list generated in the previous step
spain_team_data = spain_object.concat_df()