Python and the internet data playground:
The internet is a giant data playground just waiting for us to explore it. This part of the Coursera Course I’m studying (see refs below) covers how to collect data from the web so we can easily record, manipulate and analyse it.
This is where the increasingly-common terms web scraping and parsing come in. (Scraping refers to collecting data from the ‘net, while parsing refers to reading and analysing strings of data/info from the web, just like our previous examples of reading text from files.)
We’ll get to access data using web APIs (Application Programming Interfaces), and learn how to handle data in different technical formats like HTML, XML and JSON. So this is definitely where things will start to get exciting.
Regular expressions:
Regular expressions are a unique and specialized string-matching programming language. It is separate from the main body of coding we’ve been learning in the Python language, but able to be recognised within Python (or within various other programming languages) by means of a built-in regular expression processor.
They are very powerful and useful for searching strings and extracting data from them, whether searching for characters, words or certain patterns of characters, especially where complex in nature. The great thing is they allow us to search or extract much faster and easier than if we were to do so using Python code (although we still have the functionality to do so in Python if we prefer by working through each element of the search step-by-step).
Wild-cards:
In regular expressions, marker characters (symbols like *, +, $, etc.) are used instead of tokens (words like sorted, sort, etc.). A term often used when describing regular expressions is ‘wild-cards’ as related to searching and matching functions. Most people will be familiar with wildcards from advanced searching in Excel or other office-based applications: e.g. searching for ‘A*’ or in some cases ‘A*.*’ finds anything beginning with ‘A’ (i.e. finds simple ‘A’ as well as all wordings beginning with A (‘Aardvark’ to ‘Azalia’) all the way through to alpha-numeric strings like ‘A9’ or alpha-symbol strings like ‘A$’ and so on).
Regular expressions used in computer science and programming are much more esoteric than that, so we have a bunch of new things we need to learn. Below is reproduced the quick guide from the Coursera course (also available for download from Dr. Chuck’s website here). More general info on regular expressions is available on Wikipedia and info specific to Python on the Python documentation website.
Regular expressions in Python:
To search and manipulate data and strings in Python with regular expressions, we first need to import the relevant regex processor: import re. We can then use functions like re.search() to search for a specified expression in a string and re.findall() to extract specified portions of a string. These are quivalent to using Python’s built-in find() function and string slicing name[n:m].
Searching (matching):
To search in a file to find and print all lines of text in the file which include a certain word, using find() we’d use code as follows. This finds and returns the position in the string of the target word, then checks to make sure that position is zero or greater than zero (i.e. the target word was found within the line):
fhand = open(fname)
for line in fhand :
____line = line.rstrip()
____if line.find('target') >= 0 :
________print 'Found target in line:', line
The same thing using regular expressions will look like this:
import re
fhand = open(fname)
for line in fhand :
____line = line.rstrip()
____if re.search('target', line) :
________print 'Found target in line:', line
The above string search is very simple; regular expressions are often much more complex and use a variety of special characters/symbols to hone in on very specific elements within the string.
The above examples show how to find a target within the string; the following would be used instead if wanting to search for all lines beginning with the target. Python uses the startswith() function; the regular expression adds a ‘^’ before the target.
The relevant line in Python becomes:
____if line.startswith('target') :
The relevant line in a regular expression becomes:
____if re.search('^target', line) :
Extending the regular expression (matching):
The benefit of regular expressions over normal Python search functions is that we can build up much more complex searches, or build multi-element searches into one line of code contained within the brackets of the re.search() function. Here’s where the wild-cards come into their own. A commonly used wild-card will be dot, which searches for a match on any character, so used as e.g. ‘A.’ matching any string with A followed either by nothing at all, or followed by any character at all.
The dot is often combined with the asterisk, e.g. ‘A.*’ which matches any string with A followed either by nothing at all, or followed by any character at all repeatedly. For example, ‘F.*:’ will find and match within a string a sequence of characters in the format F___: where what comes between ‘F’ and ‘:’ could be anything, or nothing. Similarly, ‘^F.*:’ will search for the same thing but at the start of the string only, while ‘F.*:$’ searches for the same thing at the end of the string.
This is particularly useful if searching for strings which take a specific format (e.g. a barcode, a nominal ledger code) where we want to extract data from relevant codes only and where the relevant codes are easily identified by having one or several characters within that code being specific characters. This could be all nominal ledger codes beginning with the character ‘A’, or any stock code ending with the character ‘2’, and so on.
The dot is the wildcard which matches any character (or none), including whitespace such as blanks spaces or new lines. We can strip out white spaces and only match for non-whitespace characters instead, using ‘\S’. This would look like ”F\S:’ to match a string in the format F___: where what comes between ‘F’ and ‘:’ could be anything except whitespace, or nothing.
Searching for none or any number of instances of a character – using ‘.*’ for any character or using ‘\S*’ for any non-whitespace character – has its uses. But instead we may want to search for #one or more instances of any character. This will be more pertinent if we are searching through text with specific known formats. Here we use ‘+’ instead of an asterisk, so ‘^A\S+0’ will match any string which begins with the format A___0 where the character(s) appearing between A and zero are any one or more non-whitespace characters.
Searching for one or more things using a set:
We don’t just have to search for a string as a partial element within a larger string, or search just for one target within a string. We might also want to search for one of several targets. To do this we employ sets – a collection of targets identified using square brackets. It can look like this:
stringlist = ['happy','sad','hopeful','angry','optimistic','fearful','unconfident']
for word in stringlist :
____if re.search('^[h,o]', word) :
________print 'Learning Python makes me', word+'!'
Here the string we are searching for, in the variable called word, begins with (denoted by ^) either ‘h’ or ‘o’ (denoted by [h,o]). When we search for anything within the set, we can also combine the set with other regular expression markers (such as ^, $, \S, *, +) which keep their usual meaning and simply applied to each on of the characters/search terms in the set, each one in turn. The set can either take string values or numeric values, as we’ll see in the numeric examples below.
Extracting data:
When searching using re.find() above we used a conditional statement which returned a True/False result dependent on whether the required search expression was found in the string or not. But we also want to be able to extract data from the string, not just search in it. To extract data we use the re.findall() function. This allows us to find and return strings or elements within strings which match our specified criteria. The found and extracted elements are returned as a list. If there are not specified elements found/extracted then an empty list is returned.
A simple match and extract using re.findall() and sets looks like this:
string = 'This is 1 example, containing 3 numbers, and 7 words.'
x = re.findall('[0-9a]', string)
print x
Note that in the above example, we asked to find and extract any instances of the integers in the range 0-9 or the character a. We don’t need to separate 0-9 and a by commas within the square brackets; if we did so, Python would also find and extract instances of the comma character as well. Whatever instances of data the code finds it automatically appends into a list (in this example we’ve called the list x). Printing out the list in the above example gives: [‘1’, ‘a’, ‘a’, ‘3’, ‘a’, ‘7’]. Don’t forget, all those data elements are extracted as string types. We can use a for .. in loop to loop through the list and print out all instances appended to the list, or perform some other executable on them as required.
Because we only asked here to extract integers in the range 0-9, only single-digit integers will be found and extracted. If there were double-digit integers, or floating point decimals, the regular expression needs to be extended to handle them. The relevant search terms to use to extract multi-digit integers will be:
x = re.findall('[0-9]+', string)
To extract either multi-digit integers or 1+ instances of the character ‘a’, we would use:
x = re.findall('[0-9a]+', string)
A more complex example of extracting data using re.findall() and sets is shown below. Here we are mimicking a line string being read in from a CSV file containing both words and numbers. The example I have in mind here is the aged debtors analysis report I’m (occasionally) working on in the background.
numstring = 'This, line, includes, both, words, and, numbers, 123.45, 45.00, 0.00, 0.00, 45.45, 33.00'
y = re.findall('[0-9.]+', numstring)
print y
for y in y :
____print y
Here the set of characters to be found and extracted are integers in the range 0-9 and the decimal point character. Note that the decimal point character needs to be contained within the square brackets to show that it is an actual character being searched for, not the dot wild-card marker. The ‘+’ after the closing square bracket indicates that any sequence of digits including the decimal point (‘+’ means one or more) should be extracted as a string element.
In the given example, printing the list y prints out: [‘123.45’, ‘45.00’, ‘0.00’, ‘0.00’, ‘45.45’, ‘33.00’]. Note these are all strings, not floating point decimals, although it is possible to convert each string value in the list to a float and append to a new list of floating point values in order to be able to perform arithmetic calculations on them. A simple bit of code to do this would be:
yfloat = []
for y in y :
____y = float(y)
____yfloat.append(y)
print yfloat
When the new list yfloat is printed it looks like this: [123.45, 45.0, 0.0, 0.0, 45.45, 33.0]. These are all now floats.
Regular expressions are greedy:
Which means that, if there are two or more possible sub-strings (string elements) the findall function can return, it will always choose to return the larger (longer) one, i.e. greedy matching. Example:
print '\n', 'Greedy greedy regex'
string = 'This string starts with this; it has quite a few semi-colons in it; each semi-colon is going to be seen; only the last one gets used by the greedy regex!'
z = re.findall('^This.+;', string)
for z in z :
____print z
This prints out the string element: “This string starts with this; it has quite a few semi-colons in it; each semi-colon is going to be seen;”.
We can overcome that greediness by using the marker characters ‘*?’ or ‘+?’ in place of ‘*’ or ‘+’ respectively; this gives non-greedy matching. This will then extract only the shortest string element, i.e. it will truncate what it finds after the first time it sees the final character (semi-colon in the above example).
print '\n', 'Regex goes on a diet'
string = 'This string starts with this; it has quite a few semi-colons in it; each semi-colon is going to be seen; only the last one gets used by the greedy regex!'
z = re.findall('^This.+?;', string)
for z in z :
____print z
Now the much shorter string element gets printed out: “This string starts with this;”.
String slicing gets an upgrade:
Before we had access to regular expressions, when we wanted to, say, extract all the email addresses from lots of text, we had to do lots of fiddling about with finding position numbers of the ‘@’ symbol then slicing the string by the required position ranges, checking sub-string lengths and slicing again. It worked, but it was very cumbersome. Now we can use regex’s to make this much cleaner and neater.
text = 'From somebody@somedomain.com Sun 8 May 2016 16:05:00'
email = re.findall('\S+@\S+', text)
print email
The regex target ‘\S+@\S+’ finds the ‘@’ symbol and all non-whitespace characters which both precede it and supersede it (the use of ‘+’ ensures that at least one non-whitespace character will be returned on either side of ‘@’). The findall function extracts and returns the resulting string element into a list called email. The result looks like this: [‘somebody@somedomain.com’].
Where there’s several kinds of line which might contain an email, and we just want to hone in on only those lines of text which begin with ‘From’ and extract the email from those lines only. We did this in Python using conditional statements and loops; again this worked fine but was a bit cumbersome. Well, once again regex’s give us a very simple and neat way to do that too. And we can put it all on one line of code (using this line in the above example instead of the previous findall line):
email = re.findall('^From (\S+@\S+)', text)
Here we’ve used ‘^From ‘ which, as we saw earlier, looks only for those strings beginning with ‘From ‘. Then the second search element ‘\S+@\S+’ does exactly as before: it finds the ‘@’ symbol and all non-whitespace characters which both precede it and supersede it. What’s different in this example is that the ‘\S+@\S+’ expression is contained within an embedded pair of brackets. This indicates to the regular expression processor that this bracketed section is to be extracted. Therefore the full line of code first ensures the string starts with ‘From ‘, second finds the relevant string element (in this case the email address), and third extracts that string element and returns it to a list. As before, if nothing is found and empty list is returned.
Another example is extracting domain names only. Again this can be done really quickly using re.findall replacing this line in the example above:
email = re.findall('@(\S+)', text)
Here the regex finds the ‘@’ symbol and extracts any and all non-whitespace characters which supersede it. Note that if we just used the regex ‘@\S+’, we will extract the ‘@’ symbol along with the domain name and extension. By using ‘@(\S+)’ the string element beginning with @ is found but only those characters within the parentheses are actually extracted.
Another way to get the same result (given by Dr. Chuck in the course) is as follows. Here the regex looks for the string element beginning with @ (as before), and then proceeds to extract whatever is indicated by the parentheses (as before). What differs is the use of ‘[^ ]*’. In this case ‘[^ ]’ is a special single-character operator meaning “not-blank-space” and the * after it indicates none or any of the same which follow, i.e. return the first and any subsequent non-whitespace.
email = re.findall('@([^ ]*)', text)
Note: [^ ] can be seen from the regex quick guide above as meaning “match a single character not listed in the set”. In this case the set has only one element ‘ ‘ (blank-space), but it could contain any number of elements. For example [^fghij] would mean “match any single character which is not f, g, h, I or j”.
It’s also possible to introduce line matching as well as the extraction – so for example to search for only those lines beginning with ‘From ‘ and extracting only the domain name from those lines, uses the following line of code:
email = re.findall('^From .*@([^ ]*)', text)
‘^From ‘ does the same as before – it matches only those lines of text beginning with ‘From ‘. The second element of the search criteria ‘.*@’ prompts the program to look for the string element consisting of any kind of character (‘.’) repeated any number of any times (‘*’) followed by the @ symbol. Thereafter the same extract criteria is used as before, as defined by the contents of the parentheses.
Searching or extracting actual characters:
One important exception when searching or extracting using regular expressions is where we want to return an actual character which is normally understood by the regex processor as being a marker character (e.g. the $ sign). We need a way to indicate that we are referring to the actual character. They way to do this is generally to put a backslash in front of the required character:
found = re.findall('\$[0-9.-]+', text)
Useful applications:
We’ve seen how using regular expressions can speed up our searching and extracting tasks. Now if we want to go back and work on the aged debtors analysis report, we could use regular expressions to pick out the decimal values in a report, convert them to floats and sum them:
import re
fhand = open('sampledata.txt')
sum = list()
for line in fhand :
____line = line.rstrip()
/snip/
## error-guard (remove dot from customer name but not from decimal values)
____pos = line.find('.')
____errdot = line[pos-1:pos+1]
____if re.findall('[0-9]', errdot) :
________corrdot = errdot
____else :
________corrdot = errdot[:1]
____line = line.replace(errdot,corrdot)
____nums = re.findall('([0-9.-]+)', line)
____print 'Customer number:', nums[0]
____print 'All numbers:', nums
____print 'Adding...'
____i = range(len(nums))
____for i in i :
________if i == 0 : continue
________nums[i] = float(nums[i])
________if sum == [] :
____________sum.append(nums[i])
##__________print 'added for first time', sum[i-1]
________elif len(sum)+1 < len(nums) :
____________sum.append(nums[i])
##__________print 'added a new one to list', sum[i-1]
________else :
____________sum[i-1] = sum[i-1] + nums[i]
##__________print 'no need to add any more to list, just summing'
____print 'Sum so far:', sum
print 'TOTAL Sum:', sum
The above is a pretty simply program for summing all like-values in like orders. There were a couple of error handling patterns I had to add in order to clean the data: (1) remove spurious dots from non-decimals (within one of the debtor categories, and within peoples’ surnames), and (2) remove dashes from peoples’ surnames. I’ve only shown the dot-in-surname guard pattern above for brevity.
This was a useful exercise to help develop my coding skills further, particularly in refining skills both in iterative development, and in debugging. I’ve left a few of my debugging print statements as comment lines in the above code as examples.
My preferred option for writing this debtors analysis program is still to use tuples though, i.e. lists of key/value pairs, where the key will be the report header and the value will be the required sum. So there’s still a fair bit more development work to do here. But I’ve decided to park this project for now because:
- (a) the full report requirement is pretty complex which will take up yet more time to get my head around building properly;
and most importantly:
- (b) I’m really impatient to go on to learn more about web scraping, which I see as far more fun, useful and massively more in line with my personal interests and growing list of projects right now.
Onwards and upwards!
Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Using Python to Access Web Data by Univ. of Michigan. Part of the Python for Everybody specialisation.