Data analysis process:
When we encountered the data analysis process earlier in the year, we saw the basic process consists of: gather; clean; analyse (including, checking for accuracy); and finally, visualise/present. We’ve been doing lots of Python programming coupled with creating SQL databases to extract data from some source (web pages, files, XML or JSON files) and sort or store it in a database.
The process we’ve been using during the capstone course – and in line with the original Page/Brin search engine process – is to first collect the raw data and store it – unprocessed – into a holding database. From here we’ve gone on to clean up the data and save it in a more structured way in a new, relational database. This results in a smaller database which is quicker to search and retrieve data from. As I found when writing my own search engine application, these first two databases take a long time to retrieve the data, especially when the search engine’s reach is set widely.
Both the data gathering and data cleaning parts of the process are very time-consuming and subject to crashing and require lots of trial and error to complete successfully. The Crawler program can be stopped and started without losing data already g=retrieved and stored, allowing flexibility over when data gets retrieved. The Indexer program (cleaning and saving into the Index database) relies upon a lot of error-handling, and will need to be modified in different ways to handle the unstructured nature of the web data we might have to deal with when gathering data from multiple sources (since the formatting of HTML is very scrappy and messy and changes significantly from one website to the next).
Once we’ve cleaned the data and stored it in the Index database, we can then read that data and perform various analyses on it (such as top n results), or pull the data into formats suitable for data visualisation (word clouds, line charts, etc.). The benefit of performing these different functions as separate programs is to allow multiple data analyses to be performed quickly (and frequently) on clean, structured data. This massively speeds up the analysis part of the overall process.
Data visualisation with D3.js:
So what is D3? It was developed by Mike Bostock (the D3 stands for Data-Driven Documents) as a data JavaScript visualisation library which uses web standards such as HTML to allow visualisations and interactions in-browser. There’s a host of documentation available on the main website at https://d3js.org/, as well as lots of Tutorials on GitHub, which also includes a Gallery of styles and examples.
The first thing to do when setting up a data visualisation using D3.js is to go to the website and download the latest version of the software (latest version accessed today is v 4.1.1). This will need to be saved in the working directory for your data viz program.
The D3 script is called within an HTML program (see code below) which reads a JavaScript file containing the data to be visualised. The HTML file can be worked up based on any one of the many examples given on GitHub. It will also be necessary to copy across the example data input JavaScript file so you can work up a Python program to manipulate the data from your database and write it to the .js file in the required format.
D3 (latest version) is called in the HTML program using the following code:
<script src="https://d3js.org/d3.v4.min.js"></script>
One of the visualisations I like and want to explore for the Search Engine Application is Bubble Charts (examples here and here). I felt that a really good way to improve on a search engine would be to ensure better visualisation of results. An example of how bubble charts are used in this way can be seen here: (word counts and their search results – NYT).
To be continued…
Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Capstone: Retrieving, Processing, and Visualizing Data with Python by Charles Severance, University of Michigan. Part of the Python for Everybody specialisation.