Walkthrough of data wrangling project with Python

Muhammad Maruf Sazed
Analytics Vidhya
Published in
6 min readDec 27, 2020

--

Photo by Sai Kiran Anagani on Unsplash

Objective of the project:

  1. Transform the JSON files to dataframes to facilitate easy exploration and analysis.
  2. Automate the data wrangling steps so that they can be applied to multiple datasets.
  3. Develop query methods to help producing aggregated data by team or player.
  4. Adopting object oriented programming to help managing the codes.

I used python for the project. The codes are available in the following GitHub repository.

https://github.com/MuhammadMarufSazed/Data_wrangling_soccer_with_Python

Overview of the dataset: The spatio-temporal data set for soccer competitions (https://springernature.figshare.com/articles/dataset/Metadata_record_for_A_public_data_set_of_spatio-temporal_match_events_in_soccer_competitions/9711164) is a dataset that incudes several JSON files representing several European soccer competitions. The datasets consist information related to different events, foul, goal, pass, assist, etc., in soccer matches in these competitions. For demonstration of data wrangling steps, I used the dataset that has events data for Spanish league (file name events_Spain.json).

Interested reader can also go through the paper (link below) for understanding the methodology and analysis. I would like to thank the authors Luca Pappalardo, Paolo Cintia, Alessio Rossi, Emanuele Massucco, Paolo Ferragina, Dino Pedreschi, Fosca Giannotti for the wonderful paper and making the dataset publicly available.

Paper: https://www.nature.com/articles/s41597-019-0247-7

Structure of the given datatset: The json file is stored as a list in python. The list contains 628659 dictionaries that represent a single event.

with open("events/events_Spain.json") as f:
data_spain = json.load(f)
A python list of length 628659 representing events (each in dictionary format) occurred during all the matches in the league.
The dictionary represents a single event, which is a Duel, and the SubEventName identifies it as a ground attacking duel. The tag IDs indicate that it was a take on left (id: 503), the duel was won (id: 703), and was accurate (id: 1801).
Example of an event in python dictionary format. The dictionary represents a single event, which is a Pass, and the SubEventName identifies it as a Cross. The tag IDs indicate that it was a left footed (id: 401), high cross (id: 801), and was accurate (id: 1801).

Each of the event dictionary contains several information about the event. The eventName identifies the nature of the event: whether it is a pass or foul or something else. The subEventName narrows further by specifying the nature of the Pass. In this example it is a cross. The tags represent comment about the event. In this case the tags mean that it was a left footed accurate high cross. For other events, e.g. foul, we have a completely different set of tags (not necessarily length of 3) with different values of the ids of these tags.

Data Wrangling-First Step: Since each of these dictionaries have the same set of keys: eventId, subEventName, tags, etc., we will combine these multiple dictionaries to make a single dictionary.

First step: we want to combine all the dictionaries into single dictionary. Each key of the resulted dictionary has a list of length 628659, which is the length of the number of event in data_spain list.

The purpose of combining the dictionaries into a single dictionary is to make it easy for us to convert it into a pandas dataframe. At this stage, apart from the list corresponding to tags, all the other lists contain one element per event.

Data Wrangling-Second Step: We convert the dictionary into a pandas dataframe. However, the resulting datafrme does not allow us utilize the information in tags column that easily. So, we go to the third step.

Second step: the combined dictionary is converted to a pandas dataframe. As we can see, the tags column has elements that have multiple dictionaries representing the tags associated with that specific event. Moreover, the length and values of these dictionaries are different. Shape of the dataframe: (628659, 12)

Data Wrangling-Third Step: Then added columns for each tag name to make sure that we get to use the useful information that is available inside tags. To implement these steps, I developed an event class (a python class) to make it easy to manage the codes and use it for datasets for rest of the other competitions.

Third step: Combined all the possible tag name across different events and converted them into columns. The columns take values 0 or 1, where 1 means the tag is associated with the event and 0 means not. So, if for a specific event, “Pass”, the Goal column has 1 as value, that would mean that the shot resulted in a goal. This dataframe was combined with the dataframe that we got from the previous step. Shape of the datarame: (628659, 72)

At this stage we have a dataframe of shape (628659, 72). Now we have a dataframe that we can use to extract data in a much more easier way.

Exploration of the dataframe with query methods: I am assuming that a lot of the curiosity at this point will be around the team level or player level data aggregated over all the matches. For this, we will have to do a lot of filtering and aggregation (sum/count) over multiple columns. It also might be that after doing a filtering for a certain set of events, we might be interested to check how teams or players did with respect to some other event or subevent. To make this whole process easier, I developed a few “query” methods inside the events class that can be used to get those results easily and quickly. We can search for different events and subevents along different sets of tags. All these queries will produce dataframes corresponding to a specific query. Then we can combine these different dataframes to make a large dataframe with information that we are interested in at team level or player level. However, if we do not want to work with the aggregated data, say we want to analyze the sequence of events related to Lionel Messi, we can certainly do that by using the dataframe that we got after the third step.

Conclusion: Each of the datasets contain rich set of information that can be used for machine learning, cluster analysis, data exploration and visualization, and statistical modelling. Data wrangling is an important step for working with these kinds of datasets. The query methods allow user to query for different information according to their need. Adopting an object oriented programming approach helped manage the codes better.

Appendix:

Demonstration of query methods:

spain_object = event(data_spain, )

#transforming the given list of dictionary to dataframe

spain_object = events(data_spain, ‘Spain’, tags_name, events_label)
spain_object.process(teams_df, players_df)

Question: How many goals did different teams scored from Shot?

#query

by = ‘team’
tags = [‘Goal’]; events = [‘Shot’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = False)

Question: How many assists did different teams produced from Pass and Free Kicks separately?

#query

by = ‘team’
tags = [‘assist’, ]; events = [‘Pass’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = False)

#Setting combine = False will produce a different dataframes for different events.

Number of assists from Pass and Free Kicks for different teams aggregated for the whole season

Question: How many assists did different teams produced in total from Pass and Free Kicks together?

#query

by = ‘team’
tags = [‘assist’, ]; events = [‘Pass’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = True, combine = True)

#Setting combine = true will produce a dataframe that adds the individual dataframes for Pass and Free Kick.

Number of total assists from Pass and Free Kicks for different teams aggregated for the whole season

Task: Generate a single dataframe that has team-wise aggregate information for different events, subevents, and tags.

#Multiple queries
by = ‘team’
tags = [‘Goal’, ‘assist’, ‘keyPass’, ‘opportunity’, ‘counter_attack’]; events = [‘Pass’, ‘Shot’, ‘Free Kick’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = False, combine = True) #1st query
tags = [‘lost’, ‘won’, ‘sliding_tackle’]; events = [‘Duel’]
spain_object.query_eventsub_tag(tags, events = events, subevents = None, by = by, output = False, combine = False)
tags = [‘accurate’]; subevents = [‘Corner’]
spain_object.query_eventsub_tag(tags, events = None, subevents = subevents, by = by, output = False, combine = False) #2nd query
spain_object.query_eventOrSub([‘Air duel’, ‘Cross’, ‘Free Kick’], output = False, level = ‘subevent’) #3rd query

#create a combined dataframe from the list generated in the previous step
spain_team_data = spain_object.concat_df()

Combining dataframes resulted from different queries.

--

--