Getting Data with pyODBC

Moving data from a structured database via either a SQL query or a HiveQL query to a local machine is many times desired for deeper exploratory analysis. For small and medium sized data sets, transferring data straight to RAM is ideal, without the intermediate step of saving the query results to a file. pyODBC is a python library that enables database connects — this will require that the ODBC driver is installed along with any other required database-specific drivers.

Below is an example of using pyODBC library with SQL Server:

import pyodbc

db_cnx = pyodbc.connect(r'''Driver={SQL Server};

db_cursor = db_cnx.cursor()
query = '''select * from database.table limit 100;'''
query_results = db_cursor.fetchall()

First we create a database connection object by instantiating the pyodbc.connect python class with all the connection details. Here the server_address will be the actual address to your server, and the database_name is actual name of the database for the connection.

Then we create a cursor object and begin to use the execute method to run our queries. To return query results, we have to call the fetchall method, which returns a tuple of tuples back to python, i.e. query_results = ((row0), (row1), (row2), …).

Warning: There is the possibility of crashing your computer by completely filling the RAM.  When working with new data, I tend to watch the RAM utilization on the Activity Monitor during data pulls, killing the process if necessary.

Simple Database Interface Class

Below is a database interface class that I personal used for SQL Server queries before finding out that pandas has build-in support for sql queries. Nearly all my exploratory work is done in python using pandas, thus this class was written to convert the tuple of tuples into a pandas dataframe object by default.  The current version as shown parses the HiveQL query to get the correct column names, and does not support ‘select * …’ queries at the moment if the data_frame flag is set to True.

import pandas as pd
import pyodbc

class DatabaseInterface(object):
    def __init__(self, database, server):
        self.database = database
        self.server = server
        self.driver = '{SQL Server}'
        self.connection =\
                               ''' % (self.driver, self.server, self.database))
        self.cursor = self.connection.cursor()

    def execute(self, query, data_frame=True):
        results = self.cursor.fetchall()

        if not data_frame:
            return results
            column_names = self._parse_query(query)
            return self._build_data_frame(results, column_names)

    def _build_data_frame(self, data, column_names):
        dictionary = {str(column): [] for column in column_names}
        for data_row in data:
            for i, data_point in enumerate(data_row):
        return pd.DataFrame.from_dict(dictionary)

    def _parse_query(self, query):
        parsed_string = query.split()
        column_names = []
        for string in parsed_string:
            string = string.strip().replace(',', '').replace('[', '').replace(']', '')
            if string.upper() == 'SELECT':
            elif 'FROM' in string.upper():
                if '*' in column_names:
                    raise Exception('cannot build dataframe with arbitrary column names...')
                return column_names
                # split removes table names from joined queries...

Example usage…

db = DatabaseInterface(my_database,
sql_query = '''select column_a, column_b from database.table'''
sql_results = db.execute(sql_query)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s