top of page

How to Teach Your Panda SQL in 10 Minutes

  • Writer: Andrew Cole
    Andrew Cole
  • Apr 5, 2020
  • 2 min read

A quick overview of integrating SQL queries to your Pandas DataFrames


Managing your data in a comprehensible and organized fashion is an absolute essential skill for any data scientist or analyst. Pandas DataFrames allow for clean views and operations on very large datasets, while it also is perhaps one of the most widely used Python packages. SQL is perhaps an even more widely applicable tool used for database management and for grabbing data directly from several data tables within a database schema.

Any time that data is being captured, organized, or operated on, there are normally large amounts of criterion which much be accounted for. Pandas typically requires more steps, utilizing various package methods and functions to clean the data in a multi-step process. SQL queries require a bit more big picture thinking, as constructed queries grab data and aggregate it according to criterion all in one step.

As someone who started with Python and then made my way over to SQL, I tend to lean towards using Pandas when cleaning data, but integrating SQL queries into my data cleaning process has helped alleviate several nuances from my data-working toolbox. Let’s take a look at integrating queries into your Pandas DataFrames.

Importing Libraries (sqlite3)

Within the Jupyter Notebook environment, my preferred package for SQL querying is ‘sqlite3’. Let’s import the necessary libraries for data cleaning with both pandas and SQL:

import Pandas as pd
import sqlite3
from pandasql import sqldf

If your system does not have pandasql already installed, you must do so by installing via terminal:

pip install pandasql

Basic Pandas Query

Pandas does already support a very basic query option. Normally we must slice and index through a pandas DataFrame, but we can easily get the same results with a simple query.

Let’s start by using traditional slicing syntax:

df2 = df1[df1[(df1[df1[‘Column1'] != df1['Column2']]) OR (df1['Column1'] == 'X')]]

Note how this query is pretty complex because it requires multiple layers of slicing and a seemingly infinite number of brackets.

This can easily be simplified using a ‘.query()’ method:

df2 = df.query("Column1 != Column2 | Column1 = 'X'")

Note how we condensed the necessary syntax significantly and can now clearly understand what criteria we are trying to query. Also, note the use of ‘|’. It is important to note that when writing query syntax we can substitute ‘|’ for ‘OR’ while using ‘&’ for ‘AND’.Syntax

Querying using pandasql

The above query was pretty straightforward as there were only two conditions which we were querying on. In real database management, there will be tons of criteria, filled with aggregate functions, table joins, and subqueries. These can get a little messy to follow if we just use the .query() method, so we will instead create a function to run our queries in a more organized manner.

We begin by creating a lambda function which will pass in global variables to your query object so that we don’t have to do this every single time we run a query.

pysqldf = lambda q: sqldf(q, globals())

This function takes a query, q (which we will write at a later time), as well as a general globals() argument to expedite the query process and save computational expenses.

Now all we have to do is write the actual query we want to use:

q = SELECT c.Name, c.Age, c.Address, o.Income, c.Height
    FROM customers c
    JOIN occupation o
    USING(CustomerId)
 
 
 

Comentarios


bottom of page