Coding 101 (part 12) Extracting data with JSON

Jump to ‘Going Public’…

Recap:
As ever, these postings document my responses to and learnings from the much-recommended Python for Everybody specialisation on Coursera (links below) and follow on from earlier posts in the series (bookmarked here).

The previous post was an intro to the XML language used as a data exchange protocol between different applications, especially where data is being read in the form of documents which humans need to read as well as machines. XML is particularly useful where the data structure consists of a branching tree structure (parents and children) and with lots of nested elements.

Introduction to JSON:
Now we go on to look at JSON, the JavaScript Object Notation. JSON is another data exchange protocol that works smoothly and easily for machine to machine, or program to program, data exchange, but again can be simply read and understand by humans too. Documentation on the JSON format is available here: http://www.json.org/.

JSON is built around two simple data structures, which appear almost universally in modern programming languages, making incredibly simple to learn and flexible to use. The two data structures are:

1. Object: The key/value pair (known as hashmaps in Java, dictionaries in Python, and by a variety of other names in other languages);

2. Array: An ordered list of values (known as arrays in Java, lists in Python, and by other names in other languages).

The JSON object looks exactly like a curly-bracketed Python dictionary: { ‘key1’: ‘value1’, ‘key2’: ‘value2’, ‘key3’:, ‘value3’ }, while the array looks exactly like a square-bracketed Python list: [value1, value2, value3]. Just as in Python, the value contained in either an object or a list can be in the form of a string, numeric, Boolean (True/False), or can either contain nothing (null) or may contain another object (equiv. of dictionary) or another array (equiv. of list).

As with XML, there is already a library available in Python for encoding and decoding JSON, documentation here: https://docs.python.org/2/library/json.html

Python and JSON objects:
How we can import a JSON file or string or other data structure into Python looks a whole lot like how we imported the same into Python from XML. Remember here’s what the XML data import version looked like:

import xml.etree.ElementTree as ET

data = '''
<person>
__<name>Debsey</name>
__<website type="personal">
____www.deborahroberts.info
__</website>
__<email hide="yes"/>
</person>'''
tree = ET.fromstring(data)
print 'Name:',tree.find('name').text
print 'Attr:',tree.find('email').get('hide')

Now here’s what that same thing looks like when importing the same data into Python from a JSON data source – here structure as a JSON object:

import JSON
data = '''{
__"name" : "Debsey",
__"website" : {
____"type" : "blog",
____"url" : "www.deborahroberts.info"
__},
__"email" : {
____"hide" : "yes"
__}
}'''

loaded = json.loads(data)
print 'Name:', loaded["name"]
print 'Hide email:', loaded["name"]["hide"]

It’s easy to notice the differences: firstly, the different data structure in JSON, and second, how we subsequently call that data in Python. First, the data structure. We leant earlier that JSON stores data in two structures: objects (key/value pairs) and arrays (ordered lists of values). While XML used specially-defined nested/indented groups of tags, e.g. <name>..</name>, to identify data elements and their values, JSON replicates these as nested sets of dictionaries (denoted by the curly brackets) containing key/value pairs. They look a lot like how we’d see them in Python. The only difference is the convention in how JSON is presented, using indentation to highlight the nested dictionaries i.e. layers and levels of data within the data structure.

If we were to enter this data straight into a Python dictionary, it would look like this:

data = { 'name': 'Debsey', 'website': { 'type': 'blog', 'url': 'www.deborahroberts.info' }, {'email': { 'hide': 'yes'} }

Here the curly bracket denotes a dictionary, which contains a set of key/value pairs, each pair being separated by commas. The value assigned to a given key may be a simple string, as in the example {‘name’: ‘debsey’}, or a numeric. Or alternatively it may itself be an object: another dictionary of key/value pairs. So when dealing with a key like ‘website’ – which has a number of different ‘attributes’ – the value object itself becomes the dictionary containing the keys ‘type’ and ‘url’, and their respective values. So we can see how the dictionaries are nested within one another.

As with XML, JSON uses double quotes to denote data values, and we retain the convention of using 3 single quotes to mark the start and end of the data string to be imported. (As before we could import data from any source – file, web page, etc. – but a string is used as an easier and clearer example.)

Also as with XML, we import the relevant library using import JSON, and read the data string using json.loads() to de-serialise the data and re-serialise in a format understandable to and useable by Python – i.e. an actual Python dictionary (technically, referred to as a ‘native dictionary’). (Note: here loads() – say “load-s” – means load from a string. To load from a read-enabled file in JSON format, we would use load() instead.)

To call the relevant data value(s) we can use an indexing method based either on key names, or on index positions, just as we learnt previously with Python dictionaries. For example, loaded[“name”] returns the value stored in the data-pair labelled ‘name’ in the de-serialised/re-serialised data structure called ‘loaded’. We can also step several layers deep into the nested structure to pull out the value stored in an n-ordered layer, e.g. loaded[“website”][“url”] pulls out the value stored in ‘url’ key/value pair in the ‘website’ dictionary.

Python and JSON arrays:
We can import JSON arrays in a similar fashion and they’ll end up in a Python list. This is useful where we have a list of many like-items, where each item contains some data (either another list, or a dictionary), such as the recipes in a recipe database, or all the names and contact details in an address book.

The example below shows the JSON data structure containing a list of dictionaries. Again being this is imported as a string (and so would be imported/read/extracted using the same code as above) but it could come from any data source.

contacts = '''[
__{ ____"blog" : "www.deborahroberts.info",
____"email" : "private"
__},
__{ "name" : "Teressa,
____"blog" : "www.teressagreen.com",
____"email" : "hello@teressagreen.com"
__}
]'''

Now the string ‘contacts’ is a JSON array (contained within square brackets), where each item in the array – here representing the entries in an address book consisting of several key/value pairs – is stored within an object (curly brackets), with each object being separated in the [] list by commas (as we’re used to when dealing with Python lists).

But since this data structure is a list (containing dictionaries) we can check create a for .. in loop to iterate through all the items/values in the list (i.e. all the contacts) and do something the items in each of the dictionaries in the list.

import json

/snip/

contactlist= json.loads(contacts)
for contact in contactlist :
____print 'Name:', contact['name']
____print 'Blog:', contact['blog']
____print 'Email:', contact['email']

Putting it all together:
So we’ve built our web application to scrape or extract some data. But now our web applications wish to access various different services or data sources or other applications out there on the ‘net. To do this successfully, we need a set of instructions or rules around how our application can access each one of those services. This is where the Application Programming Interface (API) comes in. A service-provider will use an API to tell our web application which url(s) to use, how to log in, what format data will be made available in, and so on. It’s basically a mini instruction manual for the application-to-application interface.

The whole method of applications interacting with one another via a pre-defined communications protocol is known as the Service Oriented Approach. Here, a service layer is added which acts as a go-between for various different (and expanding numbers of) applications. Data is provided from each application into the service layer in a pre-specified and easy to access/use format. This can then be easily translated for data input back into any other application. This frees up information and data for wide-spread sharing to relevant parties, preventing information silos and manual duplication of workloads (e.g. rekeying data, maintaining multiple spreadsheets, etc.) A good introductory video on the Service Oriented Approach can be found on YouTube.

Python and APIs:
So now we’ve got our web application, some web service we wish to hook up to and extract some data from, we’ve got hold of the relevant API protocol documentation so that we can implement it and extract the data. There’s two possible web service technologies or architectures we can use. The first is the Simple Object Access Protocol (SOAP) which is software based – a bit of remote programming code used over the internet. Because it relies on software, SOAP is more complex to learn and use. The second is the REST architecture; REST stands for REpresentational State Transfer.

REST is a resource-based approach where the resources sit remotely. We can then create, read, update and delete those resources remotely as we need. Effectively, we’re viewing the web services we wish to interact with/draw data from as a set of resources. Using the REST architecture makes our interactions with web services very simple, similar to the way that web browsers operate with their GET query/response cycle.

Some very, very useful APIs:
Note: The following examples and introductory tutorial on how to access them are provided in the Accessing Web Data class on Coursera (see Refs below), with additional info/examples added by me as I play around and experiment with all this stuff.

1. Google Geo-coding API

This is a great way of taking an address and converting it into a lat/long ()or Google’s best available estimate of lat/long given the info provided, which may be incomplete). This can then be plotted onto a chart or map, like Google Maps for instance. But it can also work the other way around too: take a lat/long and convert that back to a standard address format.

I won’t bother to list the many possibilities (both personal and business/commercial) to which this could be put – I’m sure your imagination is running as wild as mine is here. (Of course, we have to remember privacy rules and ethical considerations, and only use this function on data where we have the relevant permissions.)

With the correct API key (to authenticate you as a known/authorised user), you can enter a url for any address in the required format (Google’s example: https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY) and it will return a piece of JSON data in the form of a .JSON file (you can open and view that in Notepad++ or some other text editor). Google’s example JSON data:

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "1600",
               "short_name" : "1600",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "Amphitheatre Pkwy",
               "short_name" : "Amphitheatre Pkwy",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Mountain View",
               "short_name" : "Mountain View",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Santa Clara County",
               "short_name" : "Santa Clara County",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "California",
               "short_name" : "CA",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "United States",
               "short_name" : "US",
               "types" : [ "country", "political" ]
            },
            {
               "long_name" : "94043",
               "short_name" : "94043",
               "types" : [ "postal_code" ]
            }
         ],
         "formatted_address" : "1600 Amphitheatre Parkway, Mountain View, CA 94043, USA",
         "geometry" : {
            "location" : {
               "lat" : 37.4224764,
               "lng" : -122.0842499
            },
            "location_type" : "ROOFTOP",
            "viewport" : {
               "northeast" : {
                  "lat" : 37.4238253802915,
                  "lng" : -122.0829009197085
               },
               "southwest" : {
                  "lat" : 37.4211274197085,
                  "lng" : -122.0855988802915
               }
            }
         },
         "place_id" : "ChIJ2eUgeAK6j4ARbn5u_wAGqWA",
         "types" : [ "street_address" ]
      }
   ],
   "status" : "OK"
}

We can see the resulting JSON data structure consists of a high-level dictionary with key ‘results’ and an object as the value. This object is itself a list consisting of a set of indented/nested dictionaries and sub-lists, with further sub-dictionaries within them. Each of these will, on de-serialisation within Python, be converted into the corresponding set of nested Python lists/dictionaries. This nicely demonstrates the flexibility of JSON as a data exchange structure.

Once we know the required url format to submit to Google we can get Python to manipulate our data to get it into the required format, append our API key, then easily send that url to Google. Once we receive back the JSON data, we can parse this in Python using the JSON library and extract the relevant data. For example, pulling lat/long from the following extract will give us the central point of the address we specified:

         "geometry" : {
             "location" : {
                "lat" : 37.4224764,
                "lng" : -122.0842499
             },

We can extract the lat/long using the following code snippet (the url is hard-coded here for ease; this can be much-improved upon, with user-defined inputs, error checks, and so on):

import urllib
import json

url = 'http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=biggin+hill+kent'

uhand = urllib.urlopen(url)
data = uhand.read()
results = json.loads(data)

print '\nAddress:', results['results'][0]['formatted_address']
print '\nLatitude:', results["results"][0]["geometry"]["location"]["lat"]
print 'Longitude:', results["results"][0]["geometry"]["location"]["lng"]

The above code example takes only the first result returned (using results[‘results’][0] – i.e. index position sub-0 in the list). This is fine if only one result is returned, or we assume the API has returned first the result most relevant to the user’s original input. If more than one result could be required, we can add additional code to count the number of results, loop through them and either display them or, or give the user the option to narrow down on the result and only print the one required.

Another cool modification or improvement to the above code is to add a user-input for the address, and to use a url encoder (there’s one built into the urllib library: urllib.urlencode()) to make up the required url to feed into the API using the urllib.urlopen() function as before. So instead of the hard-coded url as in the above example, this would look like:

serviceurl = 'http://maps.googleapis.com/maps/api/geocode/json?'
address = raw_input('\nEnter location: ')
url = serviceurl + urllib.urlencode({'sensor':'false', 'address': address})

A similar process is involved for the reverse geo-coding option (converting lat/long back to address). There are also all kinds of additional improvements which can be made to prettify and make the above code examples more useful in the real world. And then we can go and integrate that into other applications, websites and so on, to make them more cool and more interactive. So let’s do lots of experimentation, and enjoy!!

A word about the Google Geo-coding API limits for personal and business usage. Usage limits, as per current documentation (accessed 28 May 2016), are 2,500 search requests per day at a rate of no more than 10 requests per second. Higher limits and access rates may be unlocked with a billing plan on a pay-for-usage basis. A further premium plan is also available. Rules exist about how the returned query results may be used and displayed: they must be used for display on a Google map only. Please check policies in force at time of use as they may have changed.

2. Twitter API
Now it gets real exciting. I’m on Twitter (@wearealldatanow) but I don’t use it a massive amount, and certainly don’t use it optimally. I do retweet an awful lot of interesting links (I’m effectively using that as a real-time, in-stream bookmarking facility). That pee’s some people off, it seems, and I get a lot of followers who do un-follow (I suspect, although I can’t be sure, that’s due to my copious retweeting at certain times of the day – or it may just be that they were mock-followers in the first place).

There’s so many things I want to use Twitter for and find out more about. Here’s just a brief list of the things I’d like to do initially (I know there’ll be loads more):

a. Who’s follows me and why (+ who’s un-followed me and why);
b. Access all my retweeted/bookmarked links for my research assistant to follow up on;
c. Curate lists of people and bots useful to follow (I follow a lot of bots for research purposes right now). Allows me to pull out the best links (equiv. to b. above) without needing to retweet;
d. Ditto for list of “people who’s post(s) I’ve liked” for similar reason (hat-tip @Chuck_Moeller for this idea);
e. Set up a personalised version of a curated newsfeed based on my personalised lists.

The Twitter API uses a layer of security called OAuth (link to Twitter’s OAuth documentation), which is a commonly-used open standard for web authentications (refer the OAuth.net website for more info). OAuth uses a token from by the service provide which links to your personal or business API account. A variety of APIs are available: here we restrict ourselves to the REST APIs. Included in the terms for REST APIs are rate limits for requests – generally 15 or 180 requests within each 15 minute window, dependent on the type of request being made. (Rate limits correct at date of accessing documentation, 28 May 2016.) There are also APIs available for e.g. Search, Streaming (over an open connection), etc.

Getting set up for Twitter API (a potted intro):
(1) Set up a new application under your account (see the developer section: Application Management – read the documentation for rules around applications/auth’s/etc.); (2) Get an API key if needed; (3) Retrieve the OAuth codes(keys and secrets) for that application (4 in total: consumer key, consumer secret, token key, token secret); (4) Obtain the resource URL from the Twitter documentation for the relevant serach request you wish to perform; (5) Add the OAUth codes into the resource URL together with your requied paratemers such as user’s screen name and number of results to be returned. The URL will need to be encoded to include the unique one-time-use OAuth signature. This can be done in Python using the built-in OAuth library. Good practice is to create a hidden or secret function which you can then import and call within your programs so that your private keys are never exposed to others; (6) You can now use this unique one-time URL to access the Twitter API to obtain the required results.

Getting and dealing with the resulting Twitter data:
Using the unique one-time URL you made using the above process, based on the resource URL, you can perform your required GET request to go and access some data from Twitter which is returned in a JSON format. Example shown here is GET statuses/user_timeline. In this example, a list of the specified user’s timeline gets returned (provided you have the required access permissions for that user’s account/timeline). This is equivalent to what you would see if viewing the timeline in your browser. It’s limited to up to 3,200 tweets (including native retweets) although you will also be limited by your own rate limits. In practice you’ll want to restrict the number of results returned using the parameter “count = n”.

The resource URL element (provided by Twitter in the GET statuses/user_timeline documentation) is: https://api.twitter.com/1.1/statuses/user_timeline.json. To this will be added the following application-specific attributes:

Parameters:
- count (the number of results (e.g. tweets, friends, etc.) you wish to return);
- screen name (the username you wish to query);
OAuth Signature Elements (1-4 and 6 are automatically encoded by the OAuth function in-built in Python):
- oauth version;
- oauth token;
- another oauth key (ensures this instance of the signed request is sent only once);
- oauth timestamp (ditto);
- the four keys/secrets referred to above (consumer key; consumer secret; token key; token secret);
- oauth signature method.

Oauth is already built into Python so we can import the oauth library into our code along with the urllib library (since we will want to open() and read() the URL, once made, to access our required Twitter data). We should access the hidden file we created which stores our individual keys/secrets and attach them, together with the other elements of the OAuth signature and the other required parameters, to the relevant resource URL. (We could enter them manually if we’re just working with our program privately, but this is both less secure and less convenient.) The URL is then ready to be used in the GET request to retrieve the JSON data which we can query or manipulate as needed.

A useful tip is to read both the data itself (the body of the JSON data, i.e. the keys/values contained in the data structure) and also the headers (which returns various status information as well as the descriptions of what the JSON data structure contains and where). The body of the data is extracted as normal, using read(). The headers can be returned using the function info().dict. For example:

uhand= urllib.urlopen(url)
data = uhand.read()
headers = uhand.info().dict

The body data is returned as a JSON array (equiv. list) of nested objects (equiv. dictionaries), while the header data is returned as a simple Python dictionary. Usefully, within the header dictionary are two fields, one showing when the rate limit will expire, and another showing the limit remaining (i.e. the number of remaining requests which can be run in this time window). These can be looked up by indexing with the relevant key:

print headers['x-rate-limit-remaining']

The data file itself is a JSON data structure, and we can read/de-serialise and then interrogate this data in exactly the same way as in previous examples above.

Again, plenty of opportunity to play and experiment here (within the rules laid down by Twitter) and I’ll be developing a few mini programs along the lines of the ideas I listed earlier. I’ll post links to those as I develop them.

Going public (tentatively):
I’ve decided it’s high time I start tweeting about – and linking to – the projects I’m blogging about.

Although I feel like a complete beginner still, and to be honest, a bit of a fraud for writing about such things anyway (let alone daring to tweet about them), I’ve been getting some good (and kind, complimentary) feedback from different quarters – including from people who are advanced software engineers/developers themselves. It’s been encouraging to me to receive such feedback. And kind of a validation too, I have to say.

While a part of me doesn’t want to open up this blog – and hence myself – to criticism or ridicule (and frankly, that’s what I expect since I’m clearly no expert here at all), I do recognise it’s something which has to come at some point. My mind’s buzzing with so many ideas and larger projects (not just mini programs like the ones I’ve been working on so far, but books, workshops, freelance ideas and larger software ideas too). Some of these ideas are now running in the background in private mode.

I’m finally beginning to feel confident that I could conceivably create myself a portfolio career out of all these fun, creative, exciting, experimental things I’m working on. And perhaps (dare I think it?) never have to work in accountancy again.

Sitting here contemplating that possibility, I just can’t help grinning like a Cheshire Cat. :D

Notes:
* Dr. Chuck’s recommendation is to store dates internally in a database in the generally-accepte computer format YYYY-MM-DD… etc., and if needed to be displayed to users in a different format, to handle the conversion at point of display only (without changing what is stored internally). Top tip.

Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).

References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Using Python to Access Web Data by Univ. of Michigan. Part of the Python for Everybody specialisation.