Coding 101 (part 10) More on Web Data

Python gets networked:
Analysing data from files we already hold on the hard-drive is great, but so much data’s being created out there on the internet (especially on social media websites) that we can use for a whole variety of purposes – I’m itching to get my hands on some web data to play with. First I need to learn about how web browsers talk to websites – that is, how my query (view this website url, download that document, search for such-and-such a search term) gets communicated across the ‘net, and how it gets translated into an instruction the website at the other end can understand (in whatever server-side language it might be using: PHP, JavaScript, or whatever).

My knowledge of this area is very sketchy, so I’m happy to be able to get a potted introduction on this Python Web Data course (as always, see the links below for more info on the course I’m currently studying). The goal will be to understand how to write a Python program which can act in the same way as a web browser (i.e. using the same protocols) to interface between my computer and the internet/target website and read some data I’m interested in capturing. (This is just a brief run-through of everything covered in greater detail in Dr. Chuck’s book An Introduction to Networking which you can download for free here.)

Internet in a nutshell:
Your source web browser talks to (connects with and interacts with) the target web server using a variety of protocols. Internet Protocols (IPs) are unique addresses which show who is the source and who is the target. Transfer Control Protocols (TCPs) are ways that instructions and/or data get transferred from one application to another and back again. TCPs are high-level connections which can be made between the browser (your application) and the server (the target website’s application); they’re referred to as the peer-to-peer transport layer.

In order for a connection and transfer to happen, there needs to be some infrastructure to enable it. The connection that gets made has two end-points (one at the source = client-side, the other at the target = server-side) and some infrastructure in between which creates and maintains the connection. The whole thing (the two end-points with the connection between them) is referred to as a socket. When reading data from the web, the socket is the equivalent of the file handle we used when opening and reading data from a file on the hard-drive.

There are a variety of different applications which might run on one server to which we may want access, such as a website, email, a database, a news-server, and so on. Each one of those applications will have its own end-point to the socket, known as a TCP Port (or port number). Each port is like an extension within the host (think of it like an extension within the overall IP address, the unique address for a given website domain/server).

There are a variety of port numbers commonly used for certain types of applications. Port 80 is most commonly used in accessing web data as it relates to the public-access (HTTP://) web server for the relevant domain/IP. Port numbers can also be established on a local host (on your own hard-drive) when developing new web applications and before going live with them online. Here the web address you would access would be http://localhost:nnnn/xxxx.htm where nnn refers to the unique port number you’ve assigned for holding the development files.

Sockets in Python – establishing a connection to the web:
There’s already a handy library of ready-written scripts in Python to handle sockets to allow us to connect with the web via a low-level networking interface. The relevant library is called Sockets, and more documentation is available here. To use the library, we first need to import it using the import function. We can then use the script within the library to do the equivalent of the open() function used when opening and reading local files. The basic code looks like this:

import socket
sockhand = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sockhand.connect( ('www.domainname.com', 80) )

Line 01: imports the Socket library
Line 02: tells Python to create a new socket and give it a handle called sockhand. The syntax of this line is: {socket-handle} = {library}.{method-in-library}({parameter1},{parameter2}) where socket.socket() tells Python to use the socket method within the socket library, with the given parameters in brackets. The parameters in this example, socket.AF_INET and socket.SOCK_STREAM, tell Python to make the socket a connection to the internet and to make it a streaming connection. (A streaming connection is where requests are sent, and data is brought back in order. We could specify an alternate parameter to bring back unordered data.)
Line 03: tells Python the host (i.e. domain name) and port number of the connection.

Note that in the above code, line 02 creates the handle for the connection (named as required) but has not yet made the connection itself. This line effectively creates the client-side end-point of the socket. The code in line 03 then makes the actual connection over the internet (joining the client to the server-side end-point of the socket at the specified host and port).

One important point to remember – Python won’t just be able to connect to any website by using this code. There will need to be the necessary application/software at the server-side to enable the connection and interface.

Python talks to the other side:
Once Python has made the socket and established the connection (connected our Web Browser to the Web Server) we want to do something with that connection, we want to be able to talk to the other side (the server-side) and send/receive instructions and data. To do that we need to make use of some kind of web protocol, the expected procedure or rules for talking to and transferring data. Here the most common protocol is HTTP – hypertext transport protocol.

HTTP is most often known for being the protocol used by web browsers to retrieve web (HTML) pages, images, documents, etc. from web servers, but it can also be used for various web services and applications. Everyone reading this post will be familiar with HTTP as part of the web address or URL (uniform resource locator) we type into our web browser’s address bar, but probably not familiar with exactly what that means or how it does what it does. The URL is broken out as: protocol (http://) – host (www.deborahroberts.info) – document (document.htm), with the request to retrieve data from the web server being initiated by the web browser. Each new link within a web page (identified by the href= html tag [1]) clicked on by the user within the web browser generates a new request from the web browser to the target web server.

The whole cycle of request and retrieval (whether typing the url into the browser address bar, or clicking through on a link on a displayed web page) is called the request-response cycle, and it follows the specified internet protocols laid down for web browsers [2]. The basic process/syntax is:

telnet {host} {port #} {ENTER}
GET {url i.e.: http://www.domainname.com/documentname.htm} PROTOCOL/version# {ENTER}
{ENTER}

This returns the web page (header including various meta data, followed by the body of the webpage). This can all be done fairly easily from the command line using the telnet client [3].

The basic process of the cycle is to open the connection, send a GET request, retrieve the data, then close the connection. In truth, it’s a little more complex than this since the browser has to get and return multiple files such as the document itself, the CSS style-sheet, and any other images, scripts, etc. which might also be at that url. You can easily see this by turning on your developer tools/console in your web browser (F12 in IE; menu option in Chrome) and viewing the ‘networks’ tab.

Python as a web browser:
We’ve already seen how to import the socket library and use it to create a socket at our (client) end, then use the connect() function to make the connection to the host at the server end. Now we can just add the relevant HTTP protocol GET request in Python as a parameter within the send() function. Python will now do exactly the same thing as we would have done using the telnet client at the command line.

Here’s the relevant code that turns Python into a web browser:

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('www.domainname.com', 80) )
# The SYNTAX here is: {socket-handle} = {library}.{method-in-library}({parameter1},{parameter2})

mysock.send('GET http://www.domainname.com/documentname.htm HTTP/1.0\n\n')
# The SYNTAX here is: GET {url i.e.: http://www.domainname.com/documentname.htm} PROTOCOL/version# {ENTER} {ENTER}

while True :
____data = mysock.recv(512)
____if ( len(data) < 1 ) :
________break
____print data
mysock.close()

In this example, once the connection has been established and the GET request sent, the data which is received back is read in a continuous loop. Data is received using the recv() function. For as long as there is data coming back, the loop will continue to run, receiving and printing data in blocks of, in this case, 512 characters. Once completed – that is, when ‘data’ is empty – the loop breaks, and the socket connection is closed using close().

Improving the browser using urllib:
Using the library called urllib (documentation here) we can cut out even those few lines of code we needed using the socket library above, and just parse the url directly into our program and have it act as if it were a file. Urllib is producing/sending all the necessary syntax in the background: creating our end of the socket, connecting it to the host, sending the GET request in the required syntax, etc. The code we now need to read a webpage is really simple:

import urllib
fhand = urllib.urlopen('http://www.domainname.com/documentname.htm')

for line in fhand :
____print line

When you run this code on a valid url, the first thing you notice is that, as we are using the for line in {file handle} : construct, each line being read in the loop includes a closing new line (\n), so we will need to revert to using the rstrip() function to remove those new lines when reading data in line by line. However, the loop is a finite loop which ends automatically once all the data has been read from the webpage.

Secondly, you immediately realise the two big benefits of using urllib over socket library (in addition to it being shorter/easier to code): (1) you can now have the url entered by the user just as if it were the filename of a file held in the working directory, and (2) exactly the same code and types of programming we did before on files can now just be replicated directly for text within webpages (taking account of html tagging, or course).

Here’s a simple example using a test page set up on this website:

import urllib
fhand = urllib.urlopen('http://deborahroberts.info/test-text/')
text = fhand.read()
print text

The above four lines of code will read all the content (including header/meta data, footer, etc. as well as the body of the webpage) into a file called text. We can now delve further into this file, perhaps restricting the file to a snipped version which concentrates only on that content held within the body html tags (..) using the following code:

pos = text.find('<body')
pos1 = text.find('')
snipped = text[pos:pos1+7]
print snipped

There’s lots more advanced socket programming documentation at docs.python.org/2.7/howto/sockets.html.

Handling HTML tagging when parsing from the web:
I spent the best part of a day playing around with code to handle html tagging in text scraped from the web. You can see the fruits of my labours here:

Web Scraper: https://github.com/debkr/scraper
A simple web scraper using the urllib library which parses data including text – with HTML tags – and saves it to an output .txt file.
HTML Parser: https://github.com/debkr/html_parser
A program to take an HTML-tagged text file and clean/de-tag it and save the cleaned text to an output file.

While I realise this is so far removed from what clever, intelligent people are doing out there (see Google’s recently released-to-open-source Parsey McParseface [4] for example), I do have to start somewhere and it’s all good practice!

Next up will be a quick overview of HTML, and an intro to the BeautifulSoup library which will offer us a quick and easy way to parse HTML text from websites. (It’s probably a darn sight quicker than writing your own code, but I won’t stop doing my own thing in my own unique sledgehammer fashion, because I value both the learning opportunity and self-reliance very highly!)

In the meantime, here’s a useful website from Ohio University showing the main HTML mark-up tags.

Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).

Notes:
[1] The full format of the html link tag is: <a href=”http://www.deborahroberts.info/document.htm” target=”_blank”>anchor text here</a>.
[2] Internet protocols are maintained by the Internet Engineering Task Force (IETF): http://www.ietf.org/
[3] Telnet client is available directly in the command line on Macs or machines running Unix. It is not preinstalled on most Windows os versions, but is available as a download: see telnet client for Windows.
[4] Read more about Parsey McParseface:
– Google blog post: Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source
– Google research paper: Globally Normalized Transition-Based Neural Networks, Daniel Andor et al.
– See also Syntaxnet on Github (in tensorflow models)

References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Using Python to Access Web Data by Univ. of Michigan. Part of the Python for Everybody specialisation.