Seaborn Visualizations Tutorial

Andrew Cole
Mar 22, 2020
8 min read

A walkthrough of Seaborn’s toolbox using NHL Statistics

By Andrew Cole

If you’re like me, a world without sports is basically no world at all. However, enter 2020 and the time of COVID-19, and here we are, watching replays of the 2003 NCAA Tournament’s second round pretending that we are just as invested as if it were the 2020 tournament (which should be happening as I type this). This unfortunate pandemic also means that we are missing my personal favorite time of the year, the NHL playoffs. So, to make up for the lack of life which sports brings to so many of us, I decided to put together an overview of something which brings life to a select few of us, NHL statistics data.

Seaborn is one of Python’s most powerful and essential visualization packages, and there are endless possibilities for telling visual stories through your data. All NHL data was gathered from MoneyPuck.com. The GitHub repository for this notebook can be found here.

The Data

We begin by cleaning the information we have a little bit. We will select data from skaters in all situations (5v5, man advantages, shorthanded, etc.). Next, because there are 31 NHL teams and this is a lot to deal with for these instructional purposes we will limit the data to that only from teams in the Central Division: Chicago Blackhawks, Nashville Predators, St. Louis Blues, Colorado Avalanche, Minnesota Wild, Winnipeg Jets, & the Dallas Stars.

The DataFrame we will be left working with looks like this:

We have the statistics from 200 players with 153 statistic features. We will only be focusing on basic statistics like goals, points, penalties, etc.

Seaborn

To import the library:

import seaborn as sns

We rename seaborn as ‘sns’ to make it easier when we call it for visualizations later on.

Scatter Plots — sns.relplot()

As with any dataset, we want to take a look at statistical relationships. Perhaps the best way of looking at a bivariate relationship is through the use of the scatter plot. Each point will show the joint distribution of an observation in one statistical feature with its second feature’s location. Let’s first take a look at how points (1 goal = 1 point, 1 assist = 1 point; 1 goal + 2 assists = 3 points)

We can see that, as you’d intuitively expect, more games played results in more points being scored. This is just your basic scatter plot, so how can we make this graph give us even more information? Let’s add a hue.

Hue Plot — sns.relplot()

By adding the ‘hue’ argument to our sns.relplot() code, we are able to see how the points/games played distribution looks per team. Can we go even further? This plot only tells us about the points and games played per team, but there’s more information to be learned! What about positions? Who is scoring more points on those individual teams? Let’s add some subplots to find out.

Now we have the same graphs as above, broken down by team AND position. Each row represents a different team, and each column represents a different position (Offense or Defense). For example, we can see that Chicago has one forward who scores over 80 points in just over 60 games played, while most of Chicago’s defensemen score less than 20 points regardless of how many games they play (with one exception).

Histogram — sns.distplot()

Histograms are one of the most powerful visualizations which any analyst can create. Histograms graphically summarize the distribution of all the data. In a bit simpler terms, a histogram shows how often each value in a dataset will occur. The x-axis contains the variable metric, while the y-axis contains the relative frequency of the observation’s value.

The histogram above shows us that overwhelmingly, the majority of the league scores between 0 and 20 points. The smoothed line which we see is the kernel density estimation (KDE) — a technique which estimates unknown probability distributions of the variable based on the samples we already have. In simpler terms, if new player data was introduced to the set, there is the highest likelihood that it would fall under the tallest peaks of the smoothed line. The tick marks which we see at the bottom of the graph are known as the rug. The rug simply shows us where the individual data observations are located on the graph. You can eliminate both the KDE and the rug from the histogram by setting the code arguments to False.

Boxplot — sns.catplot(kind = ‘box’)

Another type of plot which helps give us an idea of what our data looks like is the Boxplot. Specifically, boxplots help us identify where the medians, ranges, and variabilities of data lie. The boxplot for points by team can be seen below:

The boxes we see shows three quartile values of the distribution (the big colored boxes), the mean for that group (the horizontal line through the middle of the team box), and outliers (the points above the graphs). For example, Colorado has the majority of its point scorers in the range of 0–32. The point above Colorado’s box is the outlier, meaning that the Avalanche have a singular point scorer significantly higher than the rest of the team, therefore making it the outlier.

To make this plot even more descriptive, we can again add ‘position’ as a hue to show outlier information among teams per position (offense or defense).

Violin Plot — sns.catplot(kind = ‘violin’)

Violin plot’s are a less popular but even more descriptive visualization method. Boxplot’s do not actually take into consideration the data’s distribution. If the data changes (like adding the entire league’s data instead of just the Central Division) the median and ranges do not, but a Violin plot will reflect this change. The violin plot will ‘widen’ to represent a higher density of observations around that value.

We can see now that Chicago’s defensemen score in the 0–40 points range, whereas Minnesota’s defensemen score a much larger range of points. The wider a violin plot, the more dense the data is at that observation value.

We can also create a more condensed version of this plot by adding a ‘split’:

The graph is showing the same thing, it’s just simplified by adding the defense & offense violins into one. Chicago has forwards who score in a varying large range, while it’s defensemen are all concentrated between 0 and 20 points.

Swarm Plot — sns.catplot(kind = ‘swarm’)

A swarm plot is basically just a scatter plot where the X-axis represents a categorical variable.

We can see through this swarm plot that Winnipeg has the highest goal scorer of the division, but most of their team’s point production is clustered below 10. Let’s flip the categories and see goals by position & team.

Now we can see that forwards clearly score more goals than defensemen, and the highest goal scorer to date plays for Winnipeg.

Jitter Plot — sns.catplot(jitter = True)

A jitter plot is very similar to our swarm plots, but it allows for us to remain a bit more organized. This is your normal dot-plot, but it adds a ‘jitter’ — a spacing — between points for better visualization. Let’s use a Jitter plot to take a look at number of penalties by position. We can see that all personnel who took more than 16 penalties were Forwards

Jointplot — sns.jointplot()

A jointplot is seaborn’s method of displaying a bivariate relationship at the same time as a univariate profile. Essentially combining a scatter plot with a histogram (without KDE). Let’s take a look at a jointplot to see how number of penalties taken is related to point production.

If we look at the main scatter plot, we can‘t really make out much of a distinction. It is inherent to think that a small number of penalties would mean more time spent on the ice, which means more opportunities for scoring. However, the scatter plot itself does not show a strong relationship in either direction. But, the jointplot gives us the benefit of showing the distributions along the top and right spines. By looking at those, we can see that as number of penalties increase, there are less players populating those regions. The same can be said about points. Therefore we can deduce that there is a slight positive relationship between the two.

Hexplot — sns.jointplot(kind = ‘hex’)

Another way of visualizing a bivariate relationship, in particular when we have a large amount of data, is the hexplot. A hexplot splits the plotting window into several hexbins and then the number of observations which fall into each bin corresponds with a color to indicate density. A darker color hexbin means that there are more observations, or more density, within that region. The observation frequency bar graphs can be seen along the spines as an additional reference for information. We will use a hexplot to analyze how number of goals scored is related to number of shot attempts.

Again, this graph can be somewhat inherent. As the great Wayne Gretzky/Michael Scott once said, “you miss 100% of the shots you don’t take”. We would believe that as shots increase, so do number of goals. The hexplot reiterates this notion. We see the darkest hexbin in the bottom right, as it is the most dense because scoring a goal in the NHL is no easy feat and the majority of players will be centered around this area. As we increase hexbins to the right and upwards, the color slowly begins to fade which indicates that there is a decreasing positive relationship between shot attempts and goals scored.

Kernel Density Estimation — Jointplot — sns.jointplot(kind = ‘kde’)

A similar bivariate plot to the hexbin is the Kernel Density Estimation jointplot. A KDE jointplot also uses color to determine where observations are the most dense, but instead of placing them into a pre-defined hexbin, a continuous plot is made using probabilities should new data be introduced. Let’s look at the same Shot Attempts/Goals relationship.

We can see that the same information is given to us as in the hexbin plot, but this shows a probabilistic view of where the observations are. Instead of the hexbin showing us that most players fall in the low goals/low shot attempts category, we can see an increasing positive relationship in the probability that more shot attempts will equal more goals, with the largest concentration of players falling in the 0–50 shot attempts range and the 0–5 goals range.

Correlations

A bivariate relationship can tell us a lot, but just looking at the distributions & scatterplots may not be enough to give us all the information we need about what is happening underneath the surface numbers of the data. A correlation shows us the degree in which one variable’s value influences another. A strong correlation (1.00) indicates that when one variable changes, there is a 100% positive movement in the other variable as well (-1.00 for the opposite side of the scale). Let’s take a look at how important certain variables in the NHL are in terms of correlation. Here is the code to check correlations:

Looking at these numbers shows us a lot, notice that the z-axis moving down and to the right represents perfect correlation of a variable with itself. We can see certain variables like points are heavily correlated with shots on goal, shot attempts, and ice time, as the correlation coefficients are all well above 0.5. But statistics and datasets are usually not as intuitive as sport statistics so let’s see how we can make this correlation chart more friendly to the user.

Heatmap — sns.heatmap()

A heatmap is just a friendlier way of visualizing the correlation table which we produced above. If a correlation coefficient is higher, signaling a more significant correlation between two variables, the color will be darker. Again, reference the z-axis of dark blue to represent a perfect 1:1 correlation between a variable and itself.

This heatmap just draws our eyes in an easier way to the best & worst correlations. For example, we easily see that shot attempts and shots on goal have the strongest correlation (no duh), and hits has the least correlation with goals.

Pairplot — sns.pairplot()

Finally, perhaps one of the strongest and most useful tools for any analyst is the Pairplot. A pairplot visualizes the distribution of single variables as well as the bivariate relationship it has with other variables. Simply, we will be creating a bivariate scatter plot for every variable in the DataFrame, and then putting them into one screen.

This behemoth of a graph has a TON going on, but it is also extremely helpful for getting a good overall view of what we are looking for. We read it the same as we would a bivariate scatter plot. If we see that there are strong positive/negative relationships between two variables, we know that those variables and their relationships are worth investigating further.

That’s it! For now…

Seaborn is an incredibly powerful tool for making complex data into easily-digestible information. The possibilities are seemingly endless, but hopefully, this serves as a good starting place for all the possibilities. So, with that, everybody please stay safe, stay healthy, stay inside, and we’ll all turn out alright :).

#Seaborn #Hockey #DataAnalysis #DataVisualization #datascience