Muhammad Maruf Sazed
4 min readSep 4, 2021

--

Can we tell whether team or player performances are different? Is there a way to use scientific approach to get the answer? In the era of big data, it is easy to crunch numbers, create great visualizations, and come up with KPIs to show the difference between teams or players. Websites and television channels are expert at that. However, the interpretation could be subjective and in some cases it might not be possible to correctly interpret the numbers/visualization. What if there were ways to interpret the numbers in meaningful ways so that there is less subjectivity. Here I will try to use a method that was first introduced to address problems in the field of astronomy, the Peacock Test. The idea is to compare two multivariate samples and check whether these two samples are from the same distribution.

Comparing Barcelona and Levante’s passing: Can we come to an objective conclusion based on visualization?

The basic question: is there any difference between distribution of passes between the two teams? For demonstration, I am going to compare Barcelona and Levante’ passing styles for the season 2017–18. To be specific, I am going to do this comparison based on the location of passes which will cover the origin of the pass (in xy-coordinates) and the destination of the pass (in xy-coordinates). So, a pass is represented by two coordinates (e.g., point A = (50,50) and point B = (55,70)). Here, point A is the origin of the pass and point B is the destination of the pass. To simplify the analysis, I deducted the coordinates of point A from point B. So this gives us the point (5, 20). It is almost like changing the location of the passes. So, we will hopefully not lose much information. It is as if we converted the data in such a way that all the passes originate from the point (0, 0). So, for the given example, we can pretend that the pass originated from (0, 0) and reached (5, 20). We plot this data for Barcelona and also calculate the region that contain most of the passes.

The red region contain the area with majority of the passes. The small red area contains 90% of the passes, then it bigger one contains 95% of the passes, and the largest region contains 99% of the passes.
‘The red region contain the area with majority of the passes. The small red area contains 90% of the passes, then it bigger one contains 95% of the passes, and the largest region contains 99% of the passes. The blue points represent all the passes by Barcelona in that season.

The red region contain the area with majority of the passes. The small red area contains 90% of the passes, then it bigger one contains 95% of the passes, and the largest region contains 99% of the passes.

Then we do the same for Levante data. We can see that Levante’s distribution is slightly more dispersed (when the region contains 99% of the data) than what we have for Barcelona. But what if somebody comes and say that this difference is not really noticeable. Or, what if this difference is due to chance only. The bottom line is, we should make sure that our conclusion is based on some objective reasoning. This will lead us to the Peacock test.

The Peacock Test: Can we conclude that passing of Barcelona and Levante are significantly different?

At this point we have turned our problem, which is related to soccer, into a statistical/scientific problem. If we are able to do our analysis well, then the conclusion is going to be more objective than it would be if we were making conclusions based on the data visualizations only. We acknowledge that there will always be difference in two distributions due to randomness. The question is, whether there is “large” enough evidence in the data to suggest that the two teams are different in terms of their passing. Since we do not know the distributions, we will have to use non-parametric method. Essentially, we are going to compare two two-dimensional non-parametric distributions. We will do that with the help of Peacock test. We will first use the R package Peacock.test (function: peacock2) to calculate the test statistic D. Then we calculate Z = sqrt(n1*n2/(n1+n2))*D. Then the p-value can be computed by

In our case, the p-values is 8.218657e-57, which is almost zero. Therefore, we can conclude that there is significant difference between the distribution of passes of Barcelona and Levante.

Conclusion

The test result suggests that the two teams’ pass distributions are significantly different. This technique can be used to compare Barcelona’s performance from one season to the other (under a different manager with a different playin style). Also, this can be used to see whether a team or player’s performance is different against certain teams.

We always have tools and techniques at our disposal. The important question is how to set up a problem so that it becomes suitable for the tools or technique that we want to use. Also, this is an example where techniques developed in one field can be applied in a completely different field if necessary adjustments are done.

--

--