Quick recap:
In part 10 of this series I learnt a bit about using both the socket library and the urllib library to browse a web page or some other file on a web server, read it and return it as a text file (including HTML tagging where this was included). I put together two little programs that help me to (a) scrape data or a web page from the ‘net (based on a specified URL) and save it to a text file, and (b) handle the most common HTML tags in that text file. The tags handled so far are as follows:
- <h1>..</h1> tags: cleaned and saved, labelled as ‘Title’;
- <h2>..</h2> tags: cleaned and saved, labelled as ‘Header’;
- <h3>..</h3> to <h6>..</h6> tags: cleaned and saved, labelled as ‘Sub-header’;
- <em>..</em> tags (italics): cleaned and saved, labelled as ‘Para-header’ (Paragraph header);
- <p>..</p> tags: indicate text paragraphs, cleaned and saved only (no additional labels added);
- all other tags: ignored.
What this second program does is identify all the text which is related to the main text of the post, strip out the HTML tags surrounding that text, and also add labels to certain lines of text which will later be used to weight the relevancy of words/phrases contained on those lines. I’ve developed this program primarily for use in reading data from say a blog post, and cleaning it up (removing the HTML tags) to then use that to count word/phrase frequency and produce a tag list for that particular blog post. Here words/phrases which are labelled as ‘title’ will rate as most relevant, through ‘header’, ‘sub-header’ and ‘para-header’ and finally down to normal paragraph text.
But the HTML Parser program in its current form just handles cleaning the first/highest-level tag in a line (i.e. the title, header or paragraph tag which contains the text). I could go on to add improvements to it to handle further HTML tags which appear within the paragraph text itself. The most common types of in-text HTML tagging I might come across are:
- <strong>..</strong> tags: bold text;
- <em>..</em> tags: italicised text;
- <span>..</span>: which generally produces some other kind of formatting such as font colour, or underlining, etc. (example: <span style=”color: #000080;”>..</span>);
- <a href=”{some URL here}”>{some anchor text here}</a>: indicating a hyperlink within the text/web page.
The last tag in that list is really important when we want to create a program to browse through various web pages from one starting point (a given URL) then spidering through all the relevant pages linked out from that initial web page. An example of embedded hyperlink HTML tagging is shown below. This is a snipped version of the tagged text file produced when parsing the URL “http://deborahroberts.info/test-text/” into Scraper followed by HTML Parser.
<h1 class=”entry-title”>About me</h1>
<p>I also continue to work in finance in a consultative capacity, working with individuals and small businesses, sharing my experience and skills in financial management, accounting and business/finance systems development. I’m available for consulting and interim positions in these fields, and you can <a href=”http://deborahroberts.info/contact/” target=”_blank”>contact me here</a> or via my consultancy website <a href=”http://autonomyfinance.com/” target=”_blank”>Autonomy Finance</a>.</p>
Python goes surfing:
So as well as cleaning up the text to remove the HTML tagging, we can also search and find certain tags and parse the data (URL links) contained within them. Now we can create a program which can not just parse the text from one web page, but also collect new URLs and give the user the option to (or automatically go ahead and) parse data from those other linked web pages too.
Let’s say the Scraper program has executed and returns a string called bodytext containing all the text, including tagging, parsed from a specified URL. The following code will then search for all URL’s within that web page/text and returns all unique addresses to a saved output file called urltemp.txt:
# extract all internal links to a list (found anywhere in text)
templist = re.findall('<a href="([A-Za-z:/.0-9-_]+)"?', bodytext)
urllist = list()
for url in templist :
____if url in urllist :
________continue
____if (url + '/') in urllist :
________continue
____urllist.append(url)
print '\n', urllist
# save to temporary file
fout = open('tempurl.txt', 'w')
for url in urllist :
____fout.write(url)
____fout.write('\n')
fout.close()
If we build another block of executable code after this – to send the next URL through the Scraper loop – we have ourselves a mini-web spider or web-crawler, which will scrape (read html text from) a part of the web for us. This is in effect a multiple request-response cycle, as if our Python program were acting as a web browser (but reading HTML-tagged text onto our screen/into a .txt file instead of rendering it prettily in a true web browser. It’s a great way to retrieve data from web pages where there is no in-built web application to do so (an Application Program Interface or API).
Scraping is very useful for doing research around a particular subject area, retrieving data from a website, finding new content posted on a website (perhaps you’re looking for the latest news on a certain subject, or a new property in your desired area being listing for sale), or for reviewing/researching a website for its outbound hyperlinks (to other subject-matter sites, to social media, etc.) or SEO keywords, and so on. We may even want to set up our own Search Engine on a particular subject/topic (maybe researching our business/products/etc.). We can use web scraping very effectively for this, by parsing data, analysing and storing results in a database.
We should, of course, remember only to scrape those sites/pages/data which are open-access or for which we have the relevant copyright permissions. In any event, some sites will block spiders/web crawlers from accessing their content, or include within their Terms of Service that scripts must not be used to access content on their websites (for example: facebook.com, medium.com, google search), so we have to be careful and responsible in how we use our new-found programming abilities.
Beautiful Soup makes web crawling easy:
While we can write the code ourselves – (parse URL, find new URLs, parse those, and so on – there are a number of problems we come up against, as I found out pretty quickly when I started doing this exercise (see end of part 10 and above)! Firstly, there are a whole bunch of different tags we could have to deal with (and different websites will have different conventions around which tags they’ll use for different purposes, e.g. some people use a certain header tag for all paragraph headers, while I started using italics. So there’s no consistency there from one web site to the next (and possibly not even from one web page to the next in some cases).
Secondly, and this is a real kicker, not all web pages are coded elegantly. A quick parse from a variety of websites will show this to be true (as I found out when doing my little practices these last few days). On some sites there’s indents running amok causing lots of whitespaces to be added to the start of lines, so you can’t just expect a line to start with a particular HTML tag (although we can pretty easily get around that one by using lstrip() to strip whitespaces from the start of a line in the same way we use rstrip() to strip new lines from the end of a line.)
But worse than that, a lot of sites do not cleanly include an open and close tag all on the same line; this is especially true of the <p>..</p> paragraph tags which denote some text content. We’d like to be able to zoom in on the text content using the paragraph tags but it’s not always obvious. Sometimes the tag appears at the start of the line but doesn’t get closed again at the end of the line. Other times the paragraph tag opens on a line before the text and closes again on another line after the text, leaving the text floating in a sea of indents and whitespace. Again we do have a way we can deal with this – we can use regular expressions to identify a line which appears to be text content; the relevant line of code (which appears in the HTML Parser program) is:
# exception-handling
if re.search('^[A-Z,a-z]', line) :
____print '\n', 'Found a TEXT PARA:', line
____# add the text string to the text file
These are just the small handful of problems I came across when parsing from just a tiny number of sample websites and web pages. I’m sure if I did this exercise even a few more times, I’d come across a whole bunch of new formatting madnesses which I’d then have to find more workarounds to deal with.
So while it can be done, fortunately there’s already a library for that – it’s called Beautiful Soup and it makes the game of web scraping or web crawling so much easier and quicker. It’s a free software available for download from https://www.crummy.com/software/BeautifulSoup/. Instructions from the Python for Everybody course (this is the course I’m currently studying on Coursera – you can read more by following the links within the References section at the bottom of the page) are to download the software (filename: BeautifulSoup.py) and save it to the same folder as your own Python program (usually C:\Python27 or similar). Install details relate to version 2.x of Python although the same version of the software itself works fine for both versions of Python, 2.6+ and 3 (backward compatible versions are also available from the same link).
However the crummy.com website suggests installing using pip.exe which is a script which comes when you download Python. Navigate to the Python27 directory then the Scripts sub-directory to check you have pip.exe. If so, you can run the following command in the Command Line (after navigating to the C:\Python27\Scripts directory):
pip install beautifulsoup4
Here’s a screenshot of what this looks like:
Now we’ve got BeautifulSoup imported into Python and ready to go, let’s do a quick test. Here’s how we call BeautifulSoup in our own Python program – it’s similar to importing other libraries like urllib, although the syntax is slightly different:
from bs4 import *
This line tells Python to import all (*) routines contained within the library. We call BeautifulSoup using ‘bs4’ for the latest version downloaded. Older versions were called using from BeautifulSoup import *, but this returned a traceback error with my set-up: “module not found” (searching google and stackoverflow helped me solve this traceback).
Once we’ve imported BeautifulSoup, and read a URL as before using the urllib library, we can use one of two options to reading the links or any other HTML tagged element within the web page.
Example 1 – Reading just the links and printing them out:
html = urllib.urlopen('http://deborahroberts.info/').read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags :
____print tag.get('href', None)
This code opens and reads the given URL (it could instead be user-specified using raw_input()) – note that here we use the read() method on top of urlopen(); this specifies we wish to read the whole file in one go (not recommended for very large files/web pages!). The web text/data is then parsed with the BeautifulSoup library using the ‘html.parser’. The library’s documentation advises it’s better to explicitly state which parser to use; if none is specified the library will default to the best match it can find, but doing so may mean you get different results if running the same program in different environments. A list is then created for all those anchor (<a…>) tags BeautifulSoup finds in the text. A loop then reads through the list of found HTML tags and – for those where the anchor tag type is <a href=…> (i.e. a URL link) – it prints them out one by one (or returns the default ‘None’ if no links are found).
Example 2 – Reading all <a> anchor tags including all attributes of the tag and the anchor text, etc.. Note this returns all <a… tags, not just <a href=… but also <a class=…, etc.. This version of code is useful if we want to extract not just the URL link but also the anchor text being used (particularly relevant when reviewing SEO keywords):
soup1 = BeautifulSoup(urllib.urlopen('http://deborahroberts.info/'), 'html.parser')
for link in soup1.findAll('a'):
____print link
Example 3 – Reading various HTML-tagged text, be it headers (<h1>, <h2>, etc.), italics (<em>), bold (<strong>) or anything else we may wish to specify. This will helps build an improved version of the Tagging Engine / Blog Tagger, where key words and phrases contained within the headers are to be weighted more highly than high-frequency key words/phrases in the main body of the text:
url2='http://deborahroberts.info/2016/05/coding-101-part-10-more-on-web-data/'
soup2 = BeautifulSoup(urllib.urlopen(url2), 'html.parser')
print '\n', 'All <h1> TAGS:'
for header in soup2.findAll('h1'):
____print header
print '\n', 'All <h2> TAGS:'
for header in soup2.findAll('h2'):
____print header
print '\n', 'All <em> TAGS:'
for italic in soup2.findAll('em'):
____print italic
print '\n', 'All <strong> TAGS:'
for strong in soup2.findAll('strong'):
____print strong
Moving beyond the limitations of HTML:
As we can understand from the above – and from our few little experiments and practice sessions – HTML wasn’t really built for readily extracting or retrieving data for web pages as it’s primary job has always been about creating the aesthetic or presentational aspects of web pages (how they are rendered on screen when viewed through a web browser). And this isn’t helped by the fact that any user of HTML setting up new web pages may follow any number of different coding conventions; there really doesn’t seem to be any standardisation at all. (That last point seems to be a particular OCD-style bugbear of mine; especially in the digital age, why-oh-why can’t people standardised their inputs and outputs? I just don’t get it.)
Luckily, there is a language which was developed specifically for transferring data in easily-formatted ways which aid both transfer and extraction across the ‘net. That language is XML, or eXtensible Markup Language. Why XML is so special is because it creates rules around the encoding of documents and data so that they can be read both by humans and machines. Actually it’s been used a lot in accounting circles, certainly in big organisations and governmental agencies. Think: sending through your tax returns electronically to the IRS or HMRC, for instance. (The latter now requires iXBRL tagging of all submission, which stands for inline eXtensible Business Reporting Language. iXBRL is based upon, and an extension of, the XML language.)
But I’m not going to talk about tax returns, you’ll be pleased to hear!
Data sources and web services:
As we develop our skills and interests in browsing and manipulating or otherwise analysing data from the web, we find ourselves wanting to spend less time on cleaning the data and getting it into a nice, friendly machine-useable format (as we tend to do when scraping HTML text) but instead spend more time on analysing the resultant data and getting use and meaning out of it.
Web services developed as a means of sending data from one application to another (client to server, or vice versa) in well-structured formats to allow which machines/computers to read and process quickly and easily, and – in the case of XML – which are also able to be easily read and understood by humans too. XML is just one such structured formatting protocol widely employed for this kind of data transfer. Another commonly-used protocol, which the P4E course will cover later on, is JSON. JSON stands for JavaScript Object Notation and it allows easy data transfer of data pairs consisting of both an attribute and a value (e.g. the key/value pairs we learnt about when studying dictionaries).
The benefit of these transfer protocols is not just to ensure clean data and efficient transfer, but also to allow different applications – written in different programming languages – to to interact with each another. One way to think of this is like an interpreter allowing two people speaking different languages to converse with one another; or like a mapping exercise that allows a set of Chart of Accounts and nominal structure on an old accounting system to be mapped across to a new Chart of Accounts and nominal structure on a new accounting system.
The web services and data transfer protocols provide this ‘interpreter’ or ‘mapping’ function, necessary because each language has it’s own notation and its own way of storing data internally. For example, Python uses lists and dictionaries, whereas Java employs hashmaps (which I’ll learn more about when I get time to go on to learn Java programming). XML or JSON act as the go-between, allowing data to pass smoothly out of one language/data structure and into the other.
The transfer service being used is referred to as the Wire Protocol, and employs a structured format known as the Wire Format (named in the days when there actually were wires involved!). The process of translating and interpreting or mapping is called Serialising (converting from the source, e.g. Python, into the wire format) and De-serialising (converting back from the wire format to the destination, e.g. Java).
Deeper into XML (eXtensible Markup Language):
Although JSON is much more common these days, XML – although a bit less intuitive and more difficult to access – still has its uses. It’s particularly strong when dealing with documents or forms (see the tax returns example above, or for example when transferring data internally from one department or division to another in a large organisation).
XML works by tagging data in similar ways to the tags in HTML (XML is the younger cousin of HTML, developed after it and in response to it). But the tags are structured into trees or orders of data, known as Elements or Nodes. We can represent this as a parent field (a complex node) which contains many child fields (simple nodes). Each simple node consists of both the tag (which is like a field header in a table or column header in a spreadsheet) and some data value contained within it. Whereas a complex node consists of its tag, and a number of other tags and their respective values.
XML terminology:
Start tags and end tags are like the <..> and </..> open and close tags in HTML.
Content is the value contained within the tag.
Some tags may be self-closing, that is they contain some kind of ‘instruction’ which is self-contained, rather than being a tag which denotes some data field.
Tags may also take on attributes, similar to HTML tag attributes.
XML example:
Here’s an example of what some XML could look like. Here, <person>..</person> is a complex tag as it consists of several other simple (one-value) tags such as <fname>..</fname>, <lname>..<lname> and <website>..</website>.
<person>
__<fname>Deb</fname>
__<lname>Roberts</lname>
__<website type=”personl”>
____www.deborahroberts.info
__</website>
__<email hide=”yes” />
</person>
The benefit of XML should be clear; we can create a tree structure to order the data, using our own defined structure and naming conventions, as well as including instructions (such as hide my email address) and useful attributes or metadata (the web URL is my personal website) and so on.
Indentation is not a syntactic requirement in XML, but does allow for ease of reading by humans. A tip: search for ‘pretty print XML’ for options to convert syntactically-correct but ugly/hard-to-read XML into nicely-indented, easy-to-read XML. (Other options are available for prettifying other languages too.)
The attributes are always included within the start tag. They represent key/value pairs, with the vaue being enclosed in double quotes. The equivalent within Python would be a dictionary which could look like this:
web = { 'personal': 'www.deborahroberts.info', 'work': 'www.acompany.com', 'twitter': '@wearealldatanow' }
Applications of XML:
I’ve been playing around with the idea of building a recipe and ingredient database. I trained as a chef a long time ago and, although I haven’t worked as a chef for a long time, I still love cooking and have a ton of recipe books as well as enjoy making up my own dishes too. So that just seems like a natural extension of my newly-developing coding skills. P4E gives us a great example of using XML which just happens to be a recipe database entry! This is what it looks like (slightly modified from the example given in the course):
<recipe title=”rich sweet pastry” prep_time=”20 mins” cook_time=”10-15 mins” servings=”6″>
__<title>Rich Sweet Pastry</title>
__<ingredient amount=”200″ unit=”g”>Flour</ingredient>
__<ingredient amount=”100″ unit=”g”>Butter</ingredient>
__<ingredient amount=”100″ unit=”g”>Sugar</ingredient>
__<ingredient amount=”1″ unit=”medium”>Egg</ingredient>
__<summary>
____A paragraph of text can go here summarising the recipe.
__</summary>
__<instructions type=”preparation”>
____<step number=”1″>Cream the butter and sugar together until light and fluffy.</step>
____<step number=”2″>Beat the whole egg, then gently mix into the butter/sugar mix a little at a time. Add a little sifted flour if it begins to curdle.</step>
____<step number=”3″>Sift the flour and gently fold into the butter/sugar/egg mix. Mix into a smooth paste.</step>
____<step number=”4″>Roll out carefully between two sheets of greaseproof paper, to a thickness of 0.4cm. Line a well-greased flan or pie mould.</step>
__</instructions>
__<instructions type=”cooking”>
____<step number=”1″>Cook on the middle shelf of a pre-heated oven at 200’C (180’C fan) for 10-15 minutes until golden brown and cooked.</step>
____<step number=”2″>Leave too cool in tin before filling with your dessert flan/pie filling of choice.</step>
__</instructions>
</recipe>
While I can’t vouch for the above recipe as I just made it up on the hoof, I can vouch for the XML structure and I’m really excited to go and try this out soon on my next project (LINK to be added shortly). But all this talk of food and recipes is making me hungry so I’m going to have to take a quick tea break now…
XML trees, paths and schema:
XML structures can be visualised as trees, with a parent that branches out into various child nodes. Any one of these child nodes may also branch out further into various grandchild nodes. At the end of a branch there will generally be some kind of value (text, numeric – integer or float, date, etc.). In the above example, recipe is the parent node, with its children: title, ingredients and instructions. The ingredients node is further broken out into its own child nodes called steps.
It’s possible also to represent XML tree structure as if it were file paths, so a/b/c would be a path where a is the parent node, b is its child, and c is a child of b (and a grandchild of a). In the recipe example above the path could be recipe/title, recipe/ingredient, recipe/summary or recipe/instructions/step – depending on which branch of the tree you are traversing. Think of it like surfing the pages of a blog using breadcrumbs.
As mentioned above, we can set the structure (the parent/child branches of the tree) to be whatever is most relevant – both for the data we’re handling and the application which will use it. The structure we set includes not just the parent/child branches, nodes, their attributes, etc.. It also includes a clear set of rules about the data (mandatory or optional, min and max occurrences) and how it is to be represented (string, integer, float, date), as well as the order it should appear in, and so on. The structure and rules together can be thought of as a ‘contract’, and is referred to as the XML schema.
In the case of complex data exchanges between two parties, a schema is a very necessary thing. Should any one party change even one tiny element within the XML data they present, such that it not longer meets the agreed schema, will result in the data not being validated and rejected by the other party’s application. Third party tools are available to provide validation of XML data against the required schema.
There are several different versions of XML schema available and each has its own unique rules/notation. The latest to be developed, and the one recommended as possibly the easiest to understand and learn, was the XML Schema (XSD) as approved by the World Wide Web Consortium (W3C). Documentation is available from W3C here: http://www.w3.org/XML/Schema.html.
One common standardisation in computerland is the date/time format following the ISO 8601 standard: YYYY-MM-DDThh:mm:ssZ (where Z represents the time zone: GMT/UTC)*. The benefit of this format is that – since computers sort naturally – all dates will sort correctly (year first, then month, day, hour, etc.) in ascending order (or descending order if you reverse it). What a pain to go converting all web-scraped dates and times into this format, but sensible to invoke this standard right from the beginning. I should write a piece of code to do the conversions for me. Another project to add to my list then!
Python and XML:
Learning the ins and outs of XML is all well and good, but we’re really here to learn how to parse data from the web using Python – so let’s get straight back to that. We saw in previous posts how we can parse or read data in Python straight from a file in our working directory, then we went on to learn how to extract data from the web by replicating the web browser’s GET request/response cycle.
Turns out that XML is already pre-loaded into Python; cool.
The following code snippet has come straight from the P4E course and from the pythonlearn.com website (see links under References below) rather than something I’ve played around with, experimented on and created my own versions of (as has been the case in earlier parts of the course). I just don’t feel as comfortable with this as it’s all so new, so I’m just being really good and sticking to what I’m being shown here!
import xml.etree.ElementTree as ET
data = '''
<person>
__<name>Chuck</name>
__<phone type="intl">
____+1 734 303 4456
__</phone>
__<email hide="yes"/>
</person>'''
tree = ET.fromstring(data)
print 'Name:',tree.find('name').text
print 'Attr:',tree.find('email').get('hide')
OK, first thing to notice is the need to import the XML library. Secondly, when we specify the data in the XML data structure, it’s preceded by the line {variablename} = ”’ and closed off by a further ‘triple quotes’ (”’) at the end. (Triple quotes are created only using the single quote character, since the double quote character is itself a part of the XML notation.) The variable name in the above example is data, but it could be anything we want (except reserved words, of course). This looks like a bit of code as it has all the indents in it, but the indents are not part of any syntax – they’re just there to make this easier for us humans to read and understand.
It’s not immediately obvious, but what this is really telling Python is: let some named variable (here, called data) contain the string which is in the form of some XML data, as enclosed by the triple-quotes. Normally a string flows across one line then wraps around (unless there’s a new line (\n) in it). In fact, all of the new lines are also there at the end of each line in the XML structure, and that’s fine; they too form part of the string which is contained within the triple-quotes.
So here we’ve got our XML data. It might have been read in from a web URL or from a file, using the methods previously learnt. And anything that we can do in Python on any other string, or to any other parsed data – such as regex’s, finds, searching for tags and extracting their contents, searching for specific data values, and so on.
But we don’t want to do that (well maybe as a masochistic learning exercise, but not as a regular event). Instead we should use the built-in XML parser – import xml.etree.ElementTree as ET – which was called in the first line of the above code. Now we can call the XML parser within our program, such as in the lines:
tree = ET.fromstring(data)
print 'Name:',tree.find('name').text
print 'Attr:',tree.find('email').get('hide')
Of these three lines, the first ET.fromstring() calls the Element Tree library to parse the XML string called ‘data’ (the equivalent of reading a file or parsing data from a URL). This library performs the XML de-serialisation process for us, so that what it passes back to Python is understandable to Python’s own way or reading and recognising data (now labelled with the variable name ‘tree’). We can now interrogate ‘tree’ and call certain elements and data values within the XML data string.
So for example, it finds <name>..</name> in the XML data string and returns the contents of that XML tag. And it can also find the tag which has a specified attribute, for example it finds the email tag which has an attribute of “hide”. And it does all this a whole lot quicker and easier than if we were to write code to do this ourselves to pull the text out of a text string!
Going back to our recipes database:
There’s a whole bunch of different ways to extract, manipulate and loop through the data contained within the XML data structure using ElementTree: read the documentation at python.org to learn more. But just playing around a bit using the example of the recipes database, I came up with the following:
import xml.etree.ElementTree as ET
recipes = '''XML data string goes here'''
root = ET.fromstring(recipes)
print root.find('title').text
print '\nDETAILS'
print 'Preparation time:', root.get('prep_time')
print 'Cooking time:', root.get('cook_time')
print 'Serves:', root.get('servings')
print root.find('summary').text
print '\nINGREDIENTS'
for ingredient in root.iter('ingredient') :
____amnt = ingredient.get('amount')
____unit = ingredient.get('unit')
____ingred = ingredient.text
____print amnt, unit, ingred
i = range(len(root))
for i in i :
____if root[i].tag == 'instructions' :
________type = root[i].get("type")
________print '\n', type.upper()
________for step in root[i].iter('step') :
____________stp = step.text
____________print step.get('number')+'.', stp.rstrip()
This code imports the Element Tree library, sets up the XML data structure in a string named ‘recipes’ (note I haven’t copied in all the XML here to save space – just refer to the XML data structure above under in the ‘Applications of XML’ section). Using Element Tree the XML data is read, de-serialised and converted into a Python-friendly object called ‘root’ (or whatever name we give it) which structures the data in a format easily stored and read by Python.
(Dr. Chuck’s default suggestion for this variable name was ‘tree’, but I’ve been searching StackOverflow (e.g. here) and the python.org documentation (link as above) to help me get to grips with extracting data from XML and the default being used by others seems to be ‘root’. That makes sense since root refers to the top level, parent, node within the XML data structure. I’ll be using this notation from now on.)
During development of this code, I made full use of the following to help narrow down on relevant attributes, tags and values I wanted to extract. I’m still firmly in the learning-by-doing camp.
print root.tag [outputs the name of the top-level node: recipe]
print root.attrib [outputs a dictionary of attributes attached to the node, together with their values: {‘cook_time’: ’10-15 mins’, ‘title’: ‘rich sweet pastry’, ‘prep_time’: ’20 mins’, ‘servings’: ‘6’} ]
print len(root.attrib) [outputs the number of items in the list of attributes: 4]
Simple lines like root.find(‘title’).text we saw earlier; this finds the title tag and extracts the text content within it. Whereas lines like root.get(‘prep_time’) return the value stored in the specified attribute within root (i.e. within the parent node, recipe).
Element Tree has a handy in-built iteration function iter() which allows us the iterate through any one of the children within the root – just specify the chose child tag as an attribute within the function. For example, root.iter(‘ingredient’) will iterate through all the ingredient elements using our trusty for .. in : loop construct. We can then extract and print out whatever elements we want from those lines; in this case, we use ingredient.get() to extract the values contained within the ‘amount’ and ‘unit’ attributes and the ingredient.text to extract the ingredient itself.
We can also use indexing to identify and call different elements within the data structure. For example, if we execute the following line of code as a check – print root[0], root[0].tag – we find that root[0] relates to the first child element in the XML data structure, whose tag is ‘title’. Repeat for root[5] and we find that relates to the ‘summary’ child element. And so on.
If we check the length of root using len(root) we find it has length 8, corresponding to the eight child elements in recipe, in order: title, ingredient (x4), summary, instruction (x2). (Note that this excludes the grandchild ‘step’ elements under ‘instructions’.) Now we can call any of the child elements and extract their tags, attributes/values and text content as required, using the indexing root[0] through root[7] along with the relevant function get(), find(), .text and so on. And we can use our standard iteration variable, i = range(len(root)), to loop through all the child elements in turn.
But we can also use conditionals within the loop. So in the code above we use if root[i].tag == ‘instructions’ : to just focus only on the ‘instruction’ elements. Now we can use the iter() function to loop through all the steps in each of the two instruction elements (with attribute ‘type’ equal to ‘preparation’ and ‘cooking’ respectively) to print out the step number and text content for all steps.
It’s pretty clear how neat this XML protocol is – as long as the structure and order of the data doesn’t change.
Notes:
* Dr. Chuck’s recommendation is to store dates internally in a database in the generally-accepter computer format YYYY-MM-DD… etc., and if needed to be displayed to users in a different format, to handle the conversion at point of display only (without changing what is stored internally). Top tip.
Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Using Python to Access Web Data by Univ. of Michigan. Part of the Python for Everybody specialisation.