Part of the Python specialisation capstone (see Refs below) is to recreate a simple web search engine, modelled on the original Google search ranking algorithm (you can read the short version of Page and Brin’s 1998 Stanford paper here). The Google algorithm placed emphasis on information obtained from the HTML “link structure and link text” of all links found in all indexed web pages, and to use this information “for making relevance judgments and quality filtering”.
Google search algorithm:
The basic premise of the algorithm is a probability measure, expressed in laymen’s terms as: “how likely is it that a random surfer would alight on this particular web page if they just randomly surfed through all links on all pages on the web until they got bored and gave up”. The algorithm itself includes a measure of all incoming links to a web page (i.e. the number of “citations or backlinks” to that page), enhanced by the quality-ranking of each of those in-coming citation links. In this way, the search algorithm defines an objective page rank or search ranking for each web page.
Quality of search results is further improved by referencing the anchor text used to describe each link, since often these would give a more accurate description of the referenced page than the actual page itself. Other enhancing factors the algorithm takes into account are: to reference location information (to be able to provide location-specific search results), as well as weighting words with larger font size (header tags) or bolded text more highly.
The search engine makes use of databases to store the full raw HTML text of all crawled web pages, which it “stores … in compressed form.”
Information retrieval systems:
Information retrieval systems had been in use for some time when the Google algorithm was being developed, and worked well on smaller scales, particularly on well-curated – and often, closed – collections of high-quality documents. These systems operated a ‘standard vector space model’ which searched for query results based on word occurrence within both the search query and the resulting document(s). This approach was not scalable to the size, complexity and – above-all – open-access nature of information published on the internet (even way back in the late 90’s) as it tended to prioritise ‘simple’ few-lines text pages over higher-quality and more complex documents.
Additional quality scores were developed to cope with the open-access and heterogeneous nature of web-based information, to allow for better quality-ranking of different documents and pages. These meta data included: reputation of site/page/document, how frequently the site was updated, quality of information offered, the site’s popularity or usage, and how many citations or backlinks it had. Not all meta data would be scored the same way across all web pages (archived academic papers would be treated differently than search engine homepages when reviewing frequency or popularity, for example) so different categories of information could be ranked accordingly.
Search engine process walk-through:
Based on the model laid out by Page and Brin in their original paper, the search engine process has three major applications: crawling, indexing and searching.
1. Crawling: this is where all web pages are crawled, parsed and stored in a retrieval system. A starting URL must be specified. The web page at this URL is fetched and stored in a database. (The Google system compresses the full raw HTML of the web page crawled before storing.)
1A. A URL Resolver reads all links in the raw HTML from the web page, creates a list of all links on that page, and stores them together with the associated anchor text and a unique link ID is also assigned to and stored with it. (The Google system calls this the docID.) Any relative links which are found, e.g links to “/pagename”, will be converted to absolute links, e.g. “http://www.domainname.com/pagename”, before adding to the database.
1B. From the starting point of the first URL, with the URL Resolver having parsed out of that web page a list of new links, a URL Server then sends this list of further URLs back to the Web Crawler to fetch these new pages, and so the process is repeated, meaning that both the database, and the list of links being parsed out, are continually extending.
1C. The URL Resolver program also matches up the outgoing links from the crawled web page with the relevant docID of that web page being linked to. In this way it creates a “database of links which are pairs of docIDs”. The pairs show where each link points from and to (in terms of unique web page URLs, stored in the form of their respective docIDs). This database of link pairs can then be used to calculate the search rankings (Google: PageRanks) for all web pages crawled, and these are stored back into the database against each individual URL ID.
2. Indexing: this part of the process is performed first by an Indexer, and then a Sorter. The Indexer reads from the database of stored web pages from step 1 (in Google’s case, the web pages were stored in compressed format, so here the Indexer will un-compress them), parses the content and reads the occurrences of all words in the document (Google calls these hits).
The hits or word occurrence data is stored in a database as the word hit list for that particular document. It includes the word, its frequency, and various meta data useful for weighting and relevancy-scoring: where in the document the word was found, whether it was contained within header tags, whether capitalised or bolded, and so on. Further meta data will be included based on whether the word appears not just in the web page or document’s body text but also appears in one or more of the following: URL, title, anchor text, meta tag (thus increasing the word’s relevancy ranking).
2A. The second part of the indexing application is performed by the Sorter, which generates a new index sorted by unique word (Google: wordIDs), based on the lists of word counts weighted for relevancy, and stores this index back into the database.
2B. The latest weighted word list produced by Sorter in step 2A is also updated back to the lexicon (word database), which consists of “a list of words … and a hash table of pointers”. This final part of the indexing process is mediated by another program (referred to in the Google architecture as the DumpLexicon). The lexicon will be used by the search application in step 3 together with the indexed database from step 2 and the search rankings created in step 1C.
3. Searching: The final application is the Searcher, which runs through a web server and uses the lexicon (or word database) built in step 2B, along with both the relevancy ranking (Google: PageRank) defined in step 1C and the word relevancy indexes created in step 2A. Since word hit lists are categorised as ‘advanced’ (word appears in one or more of: URL, title, anchor text, meta tag) or ‘simple’ (word appears in body text only), advanced search criteria can also be utilised.
Two-stage indexing:
The Google process outlined in the Page/Brin paper employs a two-stage indexing process. There is a lexicon (word database) of all possible words, which is split across a number of ‘barrels’ (word sub-sets). First, the Indexer parses content from the raw HTML data for the web page being processed, and counts word occurrences for all words in the lexicon (presumably adding new words to the lexicon as required). For each word in the lexicon found in the web doc (Google identifies each word in the lexicon with a wordID), the docID is recorded in the relevant barrel for that word along with the word list + associated metadata. In this way, each unique word in the lexicon builds up a list of docIDs (i.e. crawled web pages where that word can be found), and each of those docIDs has its own hit list (word count + meta data). Here the database is ordered by docID – i.e. the web page takes precedence in the sorting order – thus representing the MANY words which occur in ONE web document.
The Sorter then adds the second stage of indexing, taking copies of all ‘barrels’ and further processing them to sort by wordID – i.e. the word takes precedence in the sorting order – thus representing the MANY web documents which contain ONE word. The 2nd-stage sorted and indexed database allows search queries to run much faster, since a one-word search can immediately return the top n occurrences of that word, and a multi-word search will be able to be processed that much faster by returning the top n results where all of those queried words occur. Furthermore, the Sorter can sort each barrel copy as either/or ordered by ALL occurrences of words in the whole document, or just by those occurrences of words in the URL, title, anchor texts or meta tags (thus deemed to be of higher relevance).
Web crawling:
It’s interesting to note that the original Google web crawling application, as well as the URL Server which feeds new URLs to the crawlers, were implemented in Python. The Google system right from the start operated with several distributed crawlers, and handled hundreds of connections, each at its own stage in the request-response cycle (looking up DNS, connecting to host, sending request, receiving response).
Error-handling (crawling and indexing):
Due to the dispersed and open nature of the web, many pages (as we saw when doing some web crawling exercises a little while ago) are built in non-standard ways. This leads to a lot of unstructured data which has to be sifted through, and errors and exceptions dealt with. In the early stages, this is a process of trial and error for both the crawler and indexer to handle.
A robust event managing system is built into the crawler, recording each process in the request-response queue and ticking each one off as it passes successfully through onto the next link in the chain. Therefore, should the crawler crash at any point, there is a record of where each item is in the crawler queue.
Similarly, the indexer is built in a robust manner so it can handle a wide variety of errors thrown up by the chaotic, unstructured nature of the web data it has to parse. As Page and Brin put it: “[T]hese [errors] range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone’s imagination to come up with equally creative ones.” And it’s just a process of trial and error to find ways to handle all of these errors, and more.
Handling search queries:
A simple count-ranked search would be expected to follow a process such as: (1) parse search query; (2) convert word(s) in search query to wordIDs; (3) search within word-sorted barrels/word databases for all documents containing all wordIDs; (4) order results by word count; (5) return ordered results. Varying levels of advanced search criteria could also be built into this process (search for words in title only vs. title + body, etc.).
The Google search application is based on a more complex process which accounts for, not just the word-count ranking, but also the relevancy-ranking in order to produce search results based on quality and relevancy. They outlined a simple seven-step process, which has been slightly adapted here for simplicity: (1) parse search query; (2) convert word(s) in search query to wordIDs; (3) search through all docIDs in the doc-sorted barrels/word databases and extract the first docID which contains all the queried wordIDs; (4) calculate relevancy ranking of docID returned in step 3; (5) repeat steps 3 and 4 to find all resultant docIDs and their relevancy rankings; (6) sort all docs in order of relevancy ranking; (7) return the top n results.
Relevancy ranking:
While Google has its own proprietary PageRank algorithm, alluded to above and in the referenced paper (and of course, subject to modification by Google over time as we have often seen happen), we can define our own relevancy ranking as we see fit for this exercise. Certainly we can devise something far simpler and quicker to run, but perhaps we can add additional end-user preferences to advance search functionality, and so on. This would be particularly desirable when we go on to modify the simple general search engine model to deal with domain-specific search applications (relevant to specific business, educational or training use cases).
A couple of (simplified) pointers from the system employed by Google are (and this list is by no means exhaustive): measure word hit counts (occurrences of wordID) for different hit types (position, relative font size, capitalisation, bold, URL, title, anchor text). Each hit type is assigned a type weighting. The relevancy ranking for that wordID in that docID is then a weighted average of the hit counts across different hit types for that wordID. This holds true for hit counts up to a certain frequency, but is capped after a certain frequency level, i.e. hit count weighting is restricted beyond a certain point (hence why one should avoid the fruitless SEO practice known as keyword stuffing). In the Google system, this internal relevancy ranking is then combined with the (external) page ranking (based on numbers of high-quality backlinks) to give the overall final ranking for that web page or document.
Multi-word search queries are processed by Google not as strict exact string matches (unless the search query is entered in double quotes), but as all queried words appearing somewhere within the document. Relevancy ranking here is weighted in favour of those documents where all queried words are clustered together in close proximity. To achieve this, all documents which match all searched wordIDs are scanned and a proximity score calculated for each doc, ranked from 0 to 9 on a scale ranging from ‘exact phrase match’ through to ‘completely random distribution’. The Google system advances this further by calculating a proximity score for each hit type, which it takes a weighted average of to produce a weighted type-proximity score. This is then averaged on a weighted basis with the word count scores as before.
While in development, the original Google ranking system relied upon ranking debugging, as well as on user feedback (whereby a trusted user reviews the returned results and assesses them for accuracy). Facility was built in to allow prior rankings to be saved and compared against new rankings after tweaking the ranking algorithms. An additional feature of the Google system was/is to ensure no broken links are included within the returned results.
At the very least, asking users to rate the relevancy of the returned results would be a good idea to build into a search engine we might develop, especially if we wanted to grow it beyond very simple functionality or applications. As would a broken-link flag.
Scope of search engine:
As Page and Brin themselves pointed out nearly 20 years ago, enhancements and improvements to a search engine should include working out “smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled”. While their work dealt with issues of scalability, speed and storage capacity as they sought to grow their search engine to match the ever-growing web, we’ll need to go in the opposite direction of travel. We’ll need to figure out how to restrict the scope and reach of our own web crawler, most likely by adding some URL prioritisation criteria – or some other capping facility – to restrict the list of URLs being served to the crawler. We may even want to restrict the crawler to only crawl internal links within the user-specified starting URL. We should think about building in a kill switch, to stop the crawler from running off into the ethers and never returning (so to speak).
Ways to develop:
There are many ways to develop or enhance a simple and well-bounded search engine so I won’t list them all here. Key things I think should be included initially (for the kind of application I have in mind to build) would be:
- Include advanced searching functionality. This should be relatively simple, based on Boolean operators within an SQL database SELECT query. This could be extended to include dynamic user-specified ranking on screen (e.g. order by relevancy or by authority or by relevancy+authority);
- Return some kind of summary of the resulting page or document along with the link. This could be quite simple (returning a simple text summary, restricted to the first 100 characters, say) or could be far more complex (synthesising and summarising the most important points within the page or document – obviously this is far more advanced, and well beyond the scope of the initial search engine application);
- Review options for including more advanced forms of data visualisation of returned search results – effectively creating a data visualisation dashboard to help display the top n search results in a more relevant, engaging and easily-accessible format;
- Build in the Tagging Engine functionality so the application can double up as a word tagger for blog posts, etc. In practice this would probably mean adding a refined version of the Tagging Engine as a stand-alone web-based application executable through the same web portal as the main search engine application.
Building a simple search engine in Python:
Here I work on the problem of building a suite of Python programs to replicate a simplified version of the above search engine process (also taking account of notes already written in a previous post). This project forms part of the Python specialisation capstone. While the tutor has provided sample code as part of the book and course (downloadable zip file), I prefer to build this from scratch myself first before viewing the provided example programs. I learn best by doing, so – despite the time this takes me – I feel I’ll get more benefit from trying this myself rather than just reading through someone else’s programs. Inevitably, my programs are approached in slightly different ways to the those presented in the course. But it also allows me flexibility to build in some of the functionality I’ll want in my search engine application, especially since this will form the foundation for other projects I’ll proceed on to later.
Process flow chart:
The following diagrams map out the processes involved in my own modified version of the search engine application, and form the basis of my suite of programs for the application. They should be self-explanatory. (No time to create nice neat digital versions, so hand-drawn versions will have to suffice.)
(coming soon…)
Program walk-throughs:
1. CRAWL (Crawler110.py)
- import libraries
- create SQL database DB01 RawData
- user-defined URL (start)
- crawl first web page
- parse raw HTML data
- store in database (DB01) ordered by URL (docID)
- URL Resolver lists URLs with attributes and writes to DB01
- URL Server retrieves URLs
- (OPTIONAL) access criteria to cap and/or prioritise URLs
- feed URLs back into Crawler and repeat
- (INTERMITANTLY) check links for link pairings and update data back to DB01
- kill-switch built in to allow program shutdown as required
2A. INDEX (Indexer110.py)
- import libraries
- create SQL database DB02 WordHits
- retrieve data from DB01 into Indexer
- parse content from raw HTML data
- where url refers to categories or tags, extract and update to DB02
- where URL refers to a blog post, extract year/month and title and update to DB02
- (for blog posts only) list all word occurrences and their attributes
- store words and their attributes in DB02 ordered by URL (docID) [ONE docID contains MANY wordIDs]
- for each docID, read data, process Relevancy Score and write back to DB02 (hard-coded algorithm)
- for each docID, read data, process Authority Score and write back to DB02 (may be entered by user)
2B. SORT
- import libraries
- create SQL database DB03 Lexicon (if not already exists)
- create SQL database DB04 OrderedWords
- retrieve data from DB02 into Sorter
- for each docID, read data, check for new (relevant) words not already in DB03 Lexicon (add if new)
- reorder data by wordID not docID [ONE wordID is contained in MANY docIDs]
- store data (ordered by wordID) with URLs and word lists + attributes + relevancy and authority scores
3. SEARCH
- import libraries
- user-defined search criteria entered through command line interface
- or (IDEALLY) web browser > saved as JSON file
- (OPTIONAL) allow for advanced search criteria in above
- read JSON file and build query as SQL command
- query SQL databases (DB03 Lexion JOIN DB02 OrderedWords JOIN DB01 WordHits)
- extract short text summary for inclusion with results
- (IDEALLY) review returned results for broken links before reporting
- order results based on relevancy and authority scores
- return ordered results onscreen or (IDEALLY) via JSON file format to web browser
- (OPTIONAL) add data visualisation dashboard element on web browser reporting
- (OPTIONAL) add mechanism for user feedback on quality/relevancy of returned results and write to database DB02 WordHits (plus feedback into adaptation of Authority algorithm employed by Indexer
SQLite data models:
The following data model assumes four databases but in practice fewer would be needed if searching a well-bounded, hence smaller, part of the web. For simplicity, points specified as ‘optional’ in the above process walk-throughs have not been reflected in these data models.
DB01: RawData
Table: Doc
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
Url_id INTEGER,
fetch DATE,
err INTEGER,
raw TEXT,
read INTEGER
Table: Url
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
url TEXT UNIQUE,
home TEXT,
excl TEXT,
crwl INTEGER,
fetch INTEGER,
path INTEGER
Table: Link
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
link TEXT UNIQUE,
anchor TEXT,
Doc_id INTEGER
Junction Table: Pair
Doc_id INTEGER,
Link_id INTEGER,
PRIMARY KEY (Doc_id, Link_id)
Table: Home
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
domain TEXT UNIQUE
Table: Excl
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
domain TEXT
Crawler: (1) The starting URL is added to Table: Url and fields Url.crwl and Url.fetch are set to 0 to indicate these actions have not yet occurred; (2) Url.path is set to 0 to indicate this is the Crawler’s starting point (level zero) in the link pathway; (3) When the URL is crawled, Url.id is added to Doc.url_id and Url.crawl is updated to 1; (4) When data is fetched from the URL, it’s stored in Doc.raw, a timestamp is added to Doc.fetch and Url.fetch is updated to 1; (5) (DEVELOPMENT TO BE ADDED) Check all raw HTML data parsed from URLs – where error, update err flag to Table: Doc.
URL Resolver: (1) Raw data from Doc.raw is parsed and all links resolved; (2) Each new link is added to Table: Link along with its anchor text; (3) Because there are MANY links and each link may appear on MANY web pages/docs, the JunctionTable: Pair is also updated to reflect the MANY-MANY relationship: Doc.id is updated to Pair.Doc_id and Link.id is updated to Pair.Link_id; (4) Each new link URL is checked to see if already in Table: Url. If not, each new URL is added to Table: Url and fields Url.crwl and Url.fetch are set to 0 to indicate these actions have not yet occurred; (5) Url.path is set to (‘Url.path of referencing doc’ + 1) to indicate that the Crawler has progressed one level deeper into the link pathway.
Link Server: (1) Reviews the list of URLs in Table: Url and, where crwl = 0, returns the next URL on the list, feeds it into the Crawler and updates Url.crwl to 1. The process then repeats as before, starting from Crawler step (2) above.
Link Pairing: Link pairings (i.e. cross-citations) can be handled by querying the Junction Table: Pair, therefore a separate attribute or database field has not been created. Link pairings were updated by the URL Resolver.
DB02: WordHits
Table: Doc
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
DB1_Doc_id INTEGER UNIQUE,
url TEXT UNIQUE,
fetch DATE,
authy INTEGER,
year INTEGER,
mnth INTEGER,
ttl TEXT
Table: Hits
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
DB1_Doc_id INTEGER,
word TEXT,
hits INTEGER,
pos INTEGER,
size INTEGER,
caps INTEGER,
bold INTEGER,
inurl INTEGER,
inttl INTEGER,
inanch INTEGER,
relev INTEGER
Table: Cats
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
DB1_Doc_id INTEGER,
cat TEXT UNIQUE,
parent TEXT,
child TEXT
Table: Tags
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
DB1_Doc_id INTEGER,
tag TEXT
Table: Auth
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
domain TEXT UNIQUE,
auth INTEGER
Indexer: (1) Extract a new document from DB01: RawData (where DB01:Doc.read = 0 or null), update fields Doc.id, Doc.url and Doc.fetch from DB01 for that document, and update DB01:Doc.read to 1 to indicate that doc has now been read/processed by the Indexer; (2) If URL relates to categories or tags, extract and update categories/tags to database then proceed to next URL, else (for blog posts) extract year/month and title and update to database; (3) (Blog posts only) read and parse content in the doc, collating all words and their total hits; (4) Update DB01:Doc.id to Hits.DB1_Doc_id, add each of the found words and their total number of hits to Hits.word and Hits.hits respectively; (5) Count the hits of each word in various key positions in the doc (position, font size (as per header tags), capitalised, bold, within URL (or domain name if preferred), within doc title, within anchor text of referring URL) and update counts/scores to relevant fields in Table: Hits.
Relevancy Scoring: (1) For each word hit within the web page or document, calculate relevancy score based on weighted averages (with all hit attributes scored at pre-defined weightings, and with total hits (field Hits.hits) capped at a pre-defined level); (2) Update relevancy score to Hits.relev for each of the words in the document.
Authority Scoring: (1) Set authority scoring of the web page or document based on user-entry for that domain; (2) Update authority score to Doc.authy and update domain and authority score to Auth.domain and Auth.auth respectively. (3) Alternatively, user may enter authority score manually for each URL (currently hard-coded into program but commented out). (Program may be updated at a later date to calculate authority score based on a pre-defined scoring system.)
DB03: Lexicon
Table: Lex
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
word TEXT UNIQUE
The Lexicon is maintained as a full list of unique words relevant to the area of web, or knowledge domain, being indexed for searching. New words found by the Indexer are added to the Lexicon by the Sorter. The unique word ID (Lex.id) will be mirrored across from DB03 to DB04:OrderedWords (see below).
DB04: OrderedWords
Table: Word
id INTEGER NOT NULL PRIMARY KEY UNIQUE,
word TEXT UNIQUE
Table: Doc
id INTEGER NOT NULL PRIMARY KEY UNIQUE,
url TEXT UNIQUE,
fetch DATE
Junction Table: Score
Doc_id INTEGER,
Word_id INTEGER,
relev INTEGER,
authy INTEGER,
PRIMARY KEY (Doc_id, Word_id)
Sorting: (1) For each document in DB02:WordHits, all fields in Table: Doc are updated (straight copy). Note that Doc.id is a unique integer field but is not auto-incrementing – instead the same doc ID as used in DB02:WordHits is updated here; (2) For each word in DB02:Hits, the word is checked to see if already contained in DB03:Lexicon. If not, the new word is added to DB03:Lexicon and the value of DB03:Lex.id is returned and updated to Word.id. Note this is a unique integer field but is not auto-incrementing – instead the same word ID as used in Lexicon (DB03:Lex.id) is updated here; (3) For each instance of a word appearing in a document, Doc.id and Word.id are added to the Junction Table: Score, to fields Score.Doc_id and Score.Word_id respectively, together with the relevancy and authority scores from DB02:WordHits for that word and doc; (4) The Junction Table: Score is kept sorted based on document authority scoring AND word relevancy scoring. This sorted database will form the basic database for search queries (although search functions can also be modified as required by the user, resulting in modified select commands to this database).
Application programs (GitHub):
1. CRAWL: Crawler 1.1.0 – https://github.com/debkr/Search_Eng/blob/master/Crawler110.py
2A. INDEX: Indexer 1.1.0 – https://github.com/debkr/Search_Eng/blob/master/Indexer110.py
(more coming soon…)
Handling problems during coding:
Two significant problems came up while writing the Indexing program. First, how to get at all the text in a web document. I’m using Beautiful Soup to parse the HTML, and Googling the problem helped me come across the following solution, which parses the string called ‘body’ (which has been sliced from the full raw HTML to edit out the headers and footers):
soup = BeautifulSoup(body, 'html.parser')
text = soup.findAll(text=True)
The second problem was where Python can’t handle Unicode characters properly. Again, Googling the problem threw up a useful thread of StackOverflow here where user5910 offered the code for a function which converts common Unicode characters from the Python source code format (e.g. u”\u2018″) into the standard character (e.g. ‘). This function works extremely well; the dictionary included can also be edited to allow new errors to be handled. Defaults not already hard-coded into the dictionary get edited to an underscore by default. Here is a useful website which catalogues all the UniCode characters and their equivalents.
Updates (after viewing course notes and the sample programs presented in the course):
Error Flag: It may be useful to consider adding an error field to DB01 to indicate the web page or document was crawled but some error was returned so no data was retrieved. In practice my data model allows this check since fetch DATE indicates that the crawl process took place, and entering ‘err’ into the raw data field instead of the raw HTML text. Adding an error field is a cleaning way of handling though, so I decided to update my program and data model to add this improvement.
Link Pairings: The course sample program uses a Junction Table to record ‘From’ links and ‘To’ links in order to reference the direction of the link relationship. I built this into my own data model as the Junction Table: Pair, which records link pairings. Here Pair.Doc_id is equivalent to the ‘From’ link (i.e. the document where the link was found) and Pair.Link_id is equivalent to the ‘To’ link (i.e. where each one of the links from that document is pointing to).
‘Page Rank’ Algorithm vs. Authority Scoring: The course sample program replicates a simplified version of the Google PageRank algorithm; I have not included this in my application but instead replaced it with an authority scoring, the algorithm for which I will develop separately. I do like the idea of working up some kind of citation scoring, though, which would work on the same kind of basis as the page rank algorithm, but also allowing for enhanced authority scoring to weight certain web pages or documents over others. In very general terms, how I see this working is that – say a web page, paper or document on Artificial Intelligence references the book The Emperor’s New Mind by Roger Penrose* – then it would automatically score more highly for citation authority than a page or paper which referenced something written by Joe Blogs on Medium.com. This intuitively feels like the kind of thing Page and Brin where basing their ideas on when they first developed their PageRank algorithm back in the day. I’ve parked development of this idea for now but will revisit later as time allows.
* I’m grateful to @Qwiery for the introduction to Penrose’s work, who must surely be assigned the highest authority ranking of 9 (on a scale of 0 to 9).
Restricting Spidering Range: It’s worth building in a user-defined ‘number of pages to process’, to restrict the program from running a long time. This can be done using a while True loop as follows:
num = 0
while True:
if num < 1 :
inp = raw_input('Enter number of pages to crawl (or Enter to exit): ')
if len(inp) < 1 : break
num = int(inp)
# do the retrieval process
num = num - 1
It’s also possible to add recognition of a user keyboard interrupt (e.g. hitting Control+C at the Command Line Interface) using a try/except structure as follows:
try:
# doing the retrieval process
except KeyboardInterrupt:
print 'Program interrupted by user.'
break
(Documentation on KeyboardInterrupt is available at python.org; also see a blog post I found about it here. There are other, neater ways to handle this too, e.g. see this post on stackoverflow.)
Handling Odd Characters: I’m also reminded to use buffer(html), where html is the variable containing the raw HTML text parsed by Beautiful Soup, to avoid a traceback error when trying to insert into the SQL database HTML data which contains unrecognised or unsupported characters.
Data Visualisations: One obvious improvement on my current search engine application is to add user-configurable visualisation options to the web browser displaying the user’s search results, in the manner of a dashboard rather than just an ordered list of links with text snippets. The Python course recommends, and makes use of, D3.js – a powerful yet easy-to-use and free visualisation tool in JavaScript – pulling in data dumped from Python to a JSON file.
At this point I haven’t worked with D3.js and the visualisation aspect of my search engine application is out of scope for the time being, but will be revisited later when time (and skill level) allows. But here’s a simple description of the process: the D3.js website gives a number of visualisation tools one can use; download the relevant JavaScript program to create that data visualisation; ensure your JSON data file matches the same required data input format; within your HTML for your web page, load in the D3.js library in HTML, load in the JSON data file then run the relevant JavaScript program to activate your visualisation in-browser. Note there appear to be compatibility issues with some versions of Internet Explorer, although running in FireFox gave me no problems; this will need to be ironed out.
Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
References:
Academic paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine (short version) by Sergey Brin and Lawrence Page, Computer Science Department, Stanford University (1998)
Book: Programming for Informatics – Exploring Information by Charles Severance
Capstone: Retrieving, Processing, and Visualizing Data with Python by Charles Severance, University of Michigan. Part of the Python for Everybody specialisation.