Part of the Python specialisation capstone (see Refs below) is to recreate a simple web search engine, modelled on the original Google search ranking algorithm (you can read the short version of Page and Brin’s 1998 Stanford paper here). The Google algorithm placed emphasis on information obtained from the HTML “link structure and link text” of all links found in all indexed web pages, and to use this information “for making relevance judgments and quality filtering”.
Google search algorithm:
The basic premise of the algorithm is a probability measure, expressed in laymen’s terms as: “how likely is it that a random surfer would alight on this particular web page if they just randomly surfed through all links on all pages on the web until they got bored and gave up”. The algorithm itself includes a measure of all incoming links to a web page (i.e. the number of “citations or backlinks” to that page), enhanced by the quality-ranking of each of those in-coming citation links. In this way, the search algorithm defines an objective page rank or search ranking for each web page. [Read more…] about Simple Search Engine in Python