May 13, 2016

Database CRUD Operation in Python- SQLite and Python

In this post we will use Python to work with data in SQLite database files. We will create table, insert records into it, update table and finally perform SELECT based on some condition.

Problem statement
: - Process this text file and count the number email messages per organisation (i.e. Domain name of the email address consider as unique organisation name)

Database Schema:-
Table with two columns ORGANISATION and COUNT of Emails.
< CREATE TABLE EmailCounts (ORGANISATION TEXT, COUNT INTEGER) >

SQLite database:- SQLite is a light weight relational database management system. In contrast to many other database management systems, SQLite is not a client–server database engine. Rather, it is embedded into the end program.(Source: Wiki)

How to connect SQLite database ? :-

Python provides support for SQLite database in sqlite3. First, we need to import this in and followed by connect method is used to connect database. The connect operation makes a “connection” to the database stored in the file "emaildb.sqlite" in the current directory. If the file does not exist, it will be created.
Using conn.cursor() we get a hadle of file (like a file handle we obtain using open() method). Once cursor(handle) is obtained, we can use to perform operations on the data stored in the database using execute() method. Below sample code demonstrate the same in sequence.

import sqlite3

conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS EmailCounts ')
cur.execute('CREATE TABLE EmailCounts (ORGANISATION TEXT, COUNT INTEGER)')

conn.close()

Now we will process the this text file which contains raw data and insert data into database. Consider following code lines for processing the raw data in file.
Raw data:-
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
... ......   . ..
......  . ..............

We need to open file mbox.txt and read each line, when we get Line starting from From ...  process it and filter out domain name. If That domain name is not in Database Insert into database else update the count(increment count by 1).

Sample program to count emails corresponding to organisation

#http://www.pythonlearn.com/code/mbox.txt

import sqlite3
import sys

conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()

#drop table in database, if exists
cur.execute('DROP TABLE IF EXISTS EmailCounts ')

#create table EmailCounts
cur.execute('CREATE TABLE EmailCounts (ORGANISATION TEXT, COUNT INTEGER)')

if ( len(sys.argv) < 2 ) :
    print 'Invalid arguments, Input file missing, exiting !!'
    exit
filename = sys.argv[1]
filehandle = open(filename)
print "Processing input data file......"
for line in filehandle:
    if not line.startswith('From: ') : 
  continue
    emailPart = line.split()
    #stephen.marquard@uct.ac.za
    email = emailPart[1]
    org1=email.split("@")
    
    #stephen.marquard
    orgVal = org1[1]
    #print org1[1]
    cur.execute('SELECT COUNT FROM EmailCounts WHERE ORGANISATION = ? ', (orgVal, ))
    row = cur.fetchone()
    if row is None:
        cur.execute('''INSERT INTO EmailCounts (ORGANISATION, COUNT) 
                VALUES ( ?, 1 )''', ( orgVal, ) )
    else : 
        cur.execute('UPDATE EmailCounts SET COUNT=COUNT+1 WHERE ORGANISATION = ?', 
            (orgVal, ))
    # This statement commits outstanding changes to disk each 
    # time through the loop - the program can be made faster 
    # by moving the commit so it runs only after the loop completes
    conn.commit()

# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT ORGANISATION, COUNT FROM EmailCounts ORDER BY COUNT DESC LIMIT 10'

print "\n-------------------------"
print "Organization---Counts"
print "-------------------------"
for row in cur.execute(sqlstr) :
    print str(row[0]), row[1]

#close cur/handle 
cur.close()

Sample Output:-
[zytham@s158519-vm backup]$ /usr/bin/python emailCounts.py mbox.txt
Processing input data file......

-------------------------
Organization---Counts
-------------------------
iupui.edu 536
umich.edu 491
indiana.edu 178
caret.cam.ac.uk 157
vt.edu 110
uct.ac.za 96
media.berkeley.edu 56
ufp.pt 28
gmail.com 25
et.gatech.edu 17

Location: Hyderabad, Telangana, India