Pivoting to Efficient Data Summaries

Andrew Cole
Apr 12, 2020
4 min read

Quick guide for implementing pivot tables to your data

One of the most powerful tools across all professions and industry is the Pivot Table. In most traditional analytics, Microsoft Excel serves as a must-have skill and a pivot table is the core of data exploration. They are dynamic, relatively straight forward, and provide vital summaries at both surface and in-depth levels of the data.

https://giphy.com/gifs/UrbnbuU24p1zgyDMUH/html5

Coming from an excel-based analytical background myself, thinking in terms of pivot tables is always at the front of my thought process when working with data in Python. Using a pivot table in your Pandas DataFrame is actually relatively straightforward and can save lots of time and effort when summarizing your data. Let’s take a closer look at how to implement pivot tables directly into your data.

We will be using a data set which shows historic video game sales for multiple continents. Let’s first import our necessary data libraries and take a look at what data we have:

Importing Packages & Data

Let’s take a look at the snapshot of our DataFrame. We have the rank of the video game name, its platform, the year of record, the game’s genre, the game’s publisher, and the cumulative sales from North America, Europe, Japan, Global, and Other. The sales figures for all of the ‘sales’ are in millions. Looking at our initial DataFrame, we see that Wii Sports is the highest-ranking game with global sales equivalent to 82.74 million.

This dataset, from Kaggle, is already cleaned and full so we don’t have to worry about any data cleaning and can jump right to the pivot tables themselves.

Basic Pivot Tables Using Indexes

The syntax for creating a pivot table is pretty straightforward.

We specify the data, the index, and which columns we want to aggregate. The resulting table appears as:

Reminder: these values are in millions

This is just a snapshot of the first ten years of sales, but we can see how simply and quickly the data is aggregated for the given columns. This DataFrame is now easily digestible and can be visualized as a sales trend over time.

We easily see that, over time, Japan was the most significant purchaser of video games in the world until about 1995 when all other sectors begin to converge in sales. The amount of sales overall decreased significantly after the late 1980s, after the initial wave of video game introductions passed and they became a societal norm.

Let’s make a slight tweak in our table, changing the index from ‘Year’ to ‘Genre’.

By making this simple tweak in the code, we can get a completely different descriptive statistic. Now we can see the success of various video games by genre AND geographic location. We see that Platform & Shooter games are the highest sold video games genre globally. In Japan, Role-Playing games are the most sold.

Multi-Index Pivot Tables

If we want to aggregate across multiple indexes, the pivot function requires just a simple addition to the ‘index’ argument. The only thing we need to make sure of is that we pass in our indices in a list format. Let’s use the pivot table now to take a look at sales by Publisher, and then Genre.

We can see how easily this is done, as we first see an index of Publisher company, and then genre, followed by the respective sales in Europe, Japan, and North America. Note that the index argument takes [‘Publisher’, ’Genre’] as its argument. We specified the ‘values’ argument to only include sales from Europe, Japan, and North America. Naturally, the resulting amounts will be much lower because there are a significant amount of publishers so the sales are much more sparse.

Pivot Tables w/ Different Aggregations

By default, pivot tables will aggregate the values using a sum function. However, we don’t always want this to be the case. Perhaps we want to check averages, medians, or standard deviations. These can all be accomplished with a simple alteration to the ‘aggfunc’ argument in the pivot table syntax.

We enter the ‘aggfunc’ argument as a dictionary, specifying the aggregated column as the key and the aggregate function we want as the value (using numpy functions). For our example, we use mean for Europe sales, sum for Global Sales, and median for Japan sales. Normally, we wouldn’t mix these aggregate functions because it can make for a confusing table, but for the sake of this experiment it will suffice just fine.

Relationships within Pivot Tables

Lastly, let’s take a look at using variable indices on both the x and y axis. We want to try and look at Global Sales per genre, per year. We could create a multi-level index like the previous tables, but because there are so many years and so many genres this would make for a veeeery long and undigestable table. Instead, we specify the columns as a feature index and then declare the value which we are aggregating.

This table now has the year (index argument) on the y-axis, with the genres (columns) on the x-axis. The aggregate function is summing Global Sales. We can easily see now, for example, that Shooter games sold 99.36 million dollars globally in 2011.

Note: the ‘fill_value’ argument can be adjusted to equal 0, NaN, or any other aggregate function like the median or mean of a column. Standard NaN fill process applies.

Hopefully, this will help you summarize and explore your data in a more effective and efficient manner, or even help you pivot (see what I did there) from Excel analytics to Python!

#Statistics #PivotTables #DataAnalysis #ExploratoryDataAnalysis #datascience