Getting even more complex – many-to-many relationships:
We’ve learnt that there are several different kinds of relationships in a database (dependent on what data we’re trying to model). When we map out our data model, we should try to capture each the relationships between each of the tables in the model. In database terminology, the nature of each table-to-table relationship is referred to as the cardinality of the relationship.
Previously we looked at databases with one-to-many (and their converse, many-to-one) relationships using a Primary Key as the unique, auto-incrementing ID number in the One-Table (e.g. Recipe Type) and linking this through to the Foreign Key ID number in the Many-Table (e.g. Recipes).
Now we look at what happens when things get more complex, with many-to-many relationships. An example might be a sales history database where many customers may all buy many different kinds of products. In each case, the customer is always unique, but may have bought many different products, AND each product is unique, but may have been purchased by many different customers. In this way we can see that the many-to-many relationship can be broken down into two constituent parts: many-to-one (many customers buying one product) and one-to-many (one customer buying many products).
To construct the many-to-many relationship within our data model, and hence in our database, we will break the relationship out into the two constituent parts by using a separate connecting table – often called a Junction Table. This table performs one function only: to link together both sides of the relationship. For example, to link a many-to-many relationship between Customer and Product tables, we would create a Junction Table which contained two Foreign Keys: Customer_ID and Product_ID.
Note there is no need for a Primary Key in this table; it consists solely of the two connected Foreign Keys. Although at times we may wish to add some non-indexed non-unique additional data field such as a category type, although this is not mandatory.
By using this Junction Table, we can construct a database that models a complex many-to-many relationship using the functionality available within SQL. When we come to query on the connecting table, we can use a COMPOSITE function when we set up our function table which brings both Foreign Keys together into one single unique ‘field’ which we label as the Primary Key. Setting this Primary Key within the Function Table ensures only unique data can be entered into the table.
Assume we have two tables, Customer and Product, the function table connecting them is created as follows:
CREATE TABLE Purchases ( Customer_id INTEGER, Product_id INTEGER, PRIMARY KEY (Customer_id, Product_id) )
Populate the database as normal using INSERT INTO. We now have data in the Customer and Product tables with Primary Key ID numbers being created for each data row in each table. Using these ID numbers, we can construct the data to populate the Purchases table, against using the INSERT INTO command. With all data in the database, we can now use standard SELECT commands to query and extract data as required, for example:
SELECT Customer.name, Product.name FROM Customer JOIN Purchases JOIN Product ON Purchases.Customer_id = Customer.id AND Purchases.Product_id = Product.id ORDER BY Customer.name, Product.name
Putting it all into practice:
Creates a database, parses a file of data (could be XML, could be JSON), populates data into database – first appending to source Table(s), second taking relevant Primary Key ID’s from source tables, then creating the many-to-many connection in a third Junction Table using Foreign Keys which link to the Primary Keys in the source tables.
# import libraries
import json
import sqlite3
# create database file and connection
conn = sqlite3.connect('purchases.sqlite')
curs = conn.cursor()
# drop database tables, OPTIONAL (useful while building/testing)
curs.executescript('''
DROP TABLE IF EXISTS Customer;
DROP TABLE IF EXISTS Product;
DROP TABLE IF EXISTS Purchases
''')
# create database tables/fields if not already created
curs.executescript('''
CREATE TABLE IF NOT EXISTS Customer (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
name TEXT UNIQUE
);
CREATE TABLE IF NOT EXISTS Product (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
product TEXT UNIQUE
);
CREATE TABLE IF NOT EXISTS Purchases (
Customer_id INTEGER,
Product_id INTEGER,
label INTEGER,
PRIMARY KEY (Customer_id, Product_id)
)
''')
fname = raw_input('Enter file name: ')
if ( len(fname) < 1 ) : fname = 'filename.json'
# JSON data format: list of lists [ name-string, product-string, label-integer ]
str = open(fname).read()
data = json.loads(str)
for item in data :
name = item[0]
product = item[1]
label = item[2]
curs.execute('''INSERT OR IGNORE INTO Customer (name)
VALUES ( ? )''', ( name, ) )
curs.execute('''INSERT OR IGNORE INTO Product (product)
VALUES ( ? )''', ( product, ) )
curs.execute('SELECT id FROM Customer WHERE name = ? ', (name, ))
Customer_id = curs.fetchone()[0]
curs.execute('SELECT id FROM Product WHERE product = ? ', (product, ))
Product_id = curs.fetchone()[0]
curs.execute('''INSERT OR REPLACE INTO Purchases (Customer_id, Product_id, label)
VALUES ( ?, ?, ? )''', ( Customer_id, Product_id, label ) )
conn.commit()
# run a select query on resulting database (purchases)
curs.execute('''
SELECT Customer.name, Product.product, Purchases.label
FROM Customer JOIN Purchases JOIN Product
ON Purchases.Customer_id = Customer.id and Purchases.Product_id = Product.id
ORDER BY Customer.name, Product.product
''')
select = curs.fetchall()
name = ''
for item in select :
if item[0] != name :
name = item[0]
print '\n', name
print item[1], item[2]
Databases for data analysis and visualisation:
Having covered the basics of Python and retrieving data from web sources (either in HTML, XML or JSON), and now we know how to set up, write to and query/extract from an SQL database, we’re in a great position to start doing some really useful real world stuff. The examples which follow are as taught by Dr. Chuck on his Using Databases with Python MOOC (links below) but, as with almost all these examples, I’ve re-written them (or rather, written them myself from scratch, then perfected them with reference to the course examples) and occasionally extended them too.
First up, data analysis and visualisation.
If we remember the data science roadmap, as taught on the Leek/Johns Hopkins University: Data Scientist’s Toolkit, a good data analysis process includes several steps:
Define the question – Define the ideal data set – Determine what data is accessible – Obtain the data – Clean the data – Explore/analyse the data – Use statistical prediction/modelling (if applicable) – Interpret the results – Challenge the results! – Report on the results (including appropriate visualisation/presentation) – Create reproducible code/process – Publish the results and code/process to others.
While that’s an idealised roadmap, here we’re going to focus on a simpler multi-step process:
Identify the requirements and data source – Gather the data – Clean and process the data – Analyse and visualise the data – Present to others.
This process puts together various elements we learned so far, such as gathering data from web sources (including dealing with data in various formats, and taking account of rate limits and need for API Keys, OAuths, etc.), extracting the relevant data and storing it into an SQL database in its raw, unprocessed format. We can then add onto the process further Python program(s) which clean and process the raw data, by extracting from our raw_data database, and possibly storing the cleaned/processed data into a new clean_data database. From here we can write further Python program(s) to analyse and visualise the cleaned data and report the results as required.
Example 1.
Retrieving location data using Google GeoCoding API with visualisation (Google Maps)
Repository: https://github.com/debkr/geo_loc
This example shows how to extract data using Google’s Geocoding API, save it in a database (which allows us to extract raw data across several days to take account of the daily rate limits), then return the data visualised on a Google Map. (Note: The Geocoding T’s and C’s require that this free service may only be used where the data is going to be displayed on a Google Map, not visualised through some other format.)
Here’s a simple walkthrough of the process: (1) start with a file of location data in the form of place names, perhaps a list of holiday destinations, or locations (town/city, country) of visitors to a website, or planned en route stop-offs during a road trip; (2) run a Python program to look-up locations from the file, check if they’re already in the SQL database, and (if not already added) look-up location geodata, extract lat/long and append address, raw geodata, location and lat/long to SQL database; (3) run a Python program to output cached data to a JavaScript file; and (4) run an HTML file to query the JavaScript file, plot all locations on a Google Map and return that to a web browser or web page.
The programs I actually wrote were slightly different than the process mapped out in the course. My first program takes addresses from a file, retrieves geodata from the web in batches, and processes/adds relevant data to a database (address, geodata, Google’s formatted location and lat/long data). The geodata retrieval itself runs in batches of 10, to allow for testing and error handling, and to prevent it running for ages; this could easily be changed/increased by changing the loop count parameter.
Several kinds of error-handling have been added to this program to account for Unicode data within the JSON data returned – SQLite does not like these and won’t allow them to be inserted into the database. To get around this problem I simply returned ‘err’ instead (to either/both the geodat and location fields) and continued the program as normal. I’m sure there’s a way to convert the Unicode (special characters)so they read normally and can be added to SQL, but lack of time preventing me from researching and finding a solution to this problem, so it’s been added to the ‘future improvements’ list.
The code for the first program can be found here: https://github.com/debkr/geo_loc/blob/master/GeoLoc100.py
UPDATED: Using (buffer(addr), ) instead of just (addr, ) forces SQL to take the entry, so handling any Unicode and not returning any errors. However, trying this solves some problems but creates others in the above program, so currently not updated my program for this yet.
NOTES: Import the time library using import time, then add the command time.sleep(1) to the end of the loop (the last line inside the loop) causes the program to sleep for 1 second before going on to the next iteration in the loop. This helps ensure the API is not called too quickly helping to avoid being timed out of the API due to rate limit expiry. When running a looping program at the Command Line (in Windows environment) use Ctrl+Z at any time to crash out of the program.
The second program is very simple. It selects all records from the database, ignores any records where latitude and/or longitude are null, and populates the remaining records to a JavaScript file in a JSON data format as an array of arrays:
geoData = [
[51.312059,0.032802, ‘Biggin Hill, Greater London, UK’],
[51.266969,0.071827, ‘Westerham, Westerham, Kent TN16, UK’],
[{latitude-string},{longitude-string}, {location-string-contained within single quotes}]
];
The code for the second program can be found here: https://github.com/debkr/geo_loc/blob/master/GeoJS100.py
The JavaScript file can be read and all locations displayed on a Google map using the HTML code provided by Dr. Chuck in the course (links below), also available for download from www.pythonlearn.org. Note that I have reproduced a slightly-modified version of this HTML code (on GitHub here) to reflect variable name as per my .js file, and to centre the map on my own home location (updated variable: homeLatlng). Copyright and permissions for this HTML code remain with Dr. Chuck as per the pythonlearn.org website.
Example 2.
Web crawling and simple search engine/page rank algorithm, with visualisation (network graphs):
Repository: coming soon
This example writes a simple web crawler (based on the work we did a few lessons back), searches for the back-links within the website, calculates a simple page ranking (relevancy) score based on those backlinks, and visualises the results in a network graph using data visualisation tool D3.js.
The process involves: (1) building a web crawler to automate and systematise the reading or cataloguing of a given web page; (2) calculating an index or page rank for that web page, based on specified algorithms; (3) visualising the web page’s links and index values within a network graph; and [OPTIONAL] (4) searching for new search terms within the indexed data.
A basic web crawler takes a starting URL (e.g. user-specified URL), crawls the page and collects all outbound links, perhaps also categorising them as ‘internal’ (links to other pages within the same domain) or ‘external’ (links to other domains/websites). It then appends all links (+ categories if applicable) to a simple database. Once the starting URL has been so treated, it uploads the next URL on the saved list, and proceeds to crawl that URL. Thus the process repeats over and over until all links and pages have been crawled and catalogued. This process is easily done using the Python library, Beautiful Soup, as we saw in a previous post.
Advanced web crawlers may store the web page’s text or content in a database, in addition to the outbound links, while the simplest crawler collects links only. In the latter case, another crawler or web bot uses the same list of links to crawl and parse/analyse content – perhaps with addition of some discriminating criteria such as ‘only parse/analyse content from the most relevant links’.
Some considerations when building more complex – or long-lived – web crawlers are:
- Selection – which URLs will you crawl and/or download;
- Revisits – how often will you revisit a webpage to check for updated links and/or content;
- Etiquette – frequency of visits (to avoid overloading a website’s servers);
- Permissions – some sites specifically allow or disallow web bots to crawl their site (or sub-sections of their site) using a robot.txt file;
- Parallelisation (advanced) – how to handle results from multiple crawlers operating in tandem across the web.
Building an index consists of analysing each web page’s text/content to allow faster and easier retrieval when key words or phrases are subsequently searched for. This kind of exercise was done previously when looking at the various iterations of the Tagging Engine (e.g. here and here). Indexing algorithms may be as simple or as complex as you like, dependent on the specific needs of the program. For example, they might include some ranking algorithm, based on how relevant and/or authoritative the website/web page is. Another example is where the program is crawling and indexing web pages in order to build up a knowledge-base on a particular subject area. This knowledge-base then forms a searchable database of ‘facts’ which an intelligent agent or bot (e.g. personal assistant, research assistant, learning support agent) can draw upon to answer users’ questions on the given subject or knowledge domain.
Here’s a simple walkthrough of the process: (1) enter a starting URL for the web crawler; (2) run a Python web crawler program to access given URL, identify all links and store them (along with web page content) in a database; (3) Python program automatically repeats process for all new links identified and stored in step 2, adding new links to the database as it goes; (4) run a Python page rank program to calculate the page rank for each individual page (based on a specified algorithm) and update the page rank back to the database; (5) run a Python program to output data to a JavaScript file; (6) run an HTML file to query the JavaScript file and return the data to a web browser in the form of a data visualisation using D3.js – this is in the format of a network graph, shows connections/links between web pages, weighted by their page rankings; and [OPTIONAL] (7) run a Python program to query the database for new user-input search terms (key words, key phrases) and return a set of results ordered by page rank.
Note that step 3 may fail/fall over, or it may continue indefinitely. A kill-switch is therefore added to the crawler program to allow us to override the program and shut it down at any time. We can also building ability to restart again at a later time, or to change the crawler’s parameters.
The program in step 4 is set to run on an ongoing loop, such that – as new pages and links are added into the database – new/updated rankings can be calculated for each web page. As the number of pages and links in the database increases, so the ranking of each page reaches an equilibrium ranking (i.e. tends to a number which no longer changes, despite adding more and more pages/links into the database). We can also add in ability to reset all page rank values for all web pages in the database.
My version of this suite of programs will be provided shortly. (In the meantime, the original code from the course can be found on the pythonlearn.org website here.)
Example 3.
Textual analysis of Gmane.org mailing list archives with visualisation (word clouds):
Repository: coming soon
The gmane.org website stores archives of old mailing lists, and it’s easy to export mailing lists from this archive (see export instructions here). This provides us with lots of open-source text files for crawling, parsing, cleaning, storing, analysis and visualisation. The most common form of visualisation for this kind of textual analysis is word clouds, which show the highest-ranking words and phrases visually. The process followed is very simple, and replicates similar ideas and code employed in my earlier examples of Tagging Engines (links as per Example 2 above).
Here’s a simple walkthrough of the process: (1) run a Python program to gather required data from gmane.org as per their export instructions (being sure not to clog up their servers with heavy and continuous data requests) and save a copy of the raw data to a database; (2) run a Python program to clean and process the raw data into a more useable format and save to a second, relational database – this allows for faster parsing, analysing and querying; (3) run various Python programs to analyse cleaned data and output results either to screen, to file or to secondary JavaScript files for data visualisation and web display. Data stored in JavaScript files is held in a JSON format as a dictionary of key/value pairs, i.e. keyword + it’s frequency. Examples of analysis performed are: (i) top n contributor(s) to a mailing list; (ii) top n words and/or phrases in a mailing list; (iii) changes in frequency of top n words over time; and finally (4) run HTML file(s) to query the JavaScript file(s), to visualise the data using D3.js (e.g. word clouds, keyword frequency timelines, etc.) and return to a web browser or web page.
My version of this suite of programs will be provided shortly. (In the meantime, the original code from the course can be found on the pythonlearn.org website here.)
This is very simple ‘web data analysis’, but it’s not ‘data mining’:
What we learned here is pretty powerful and useful: we can cut and paste or repurpose it into lots of different kinds of real-world applications and projects. But it’s not professional-grade data mining, even though the world and her dog wants to lay claim to that particular buzz phrase and claims that’s what they’re doing when they scrape a bit of web data and running through a visualisation tool like D3.js. Data mining is far more complex and ultimately it relies upon deep statistical computations and predictive models to analyse patterns within large-scale datasets – and that’s a whole different ballgame. Some important data mining technologies are currently provided by:
Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Using Databases with Python by Univ. of Michigan. Part of the Python for Everybody specialisation.