Coding 101 (part 7)

Lists work great but they leave something on the table:
I’ve been building a Tagging Engine in Python as a little exercise to help me learn by doing, using my knowledge so far. It became clear pretty quickly that I needed a better way to handle pairs of data. In this case I was looking at a list of words and the number of times each of them appeared in a text, so that I could rank the most common words by order of significance (frequency). If I just used one list and appended both the word and its count to the list, one value after the other, there was no way I could sort by count number.

I got round this problem by having two lists, one for the words and another for the word counts. I could then manipulate the data as needed. This did work fine in the simple program I wrote, but it was my usual unwieldy, sledgehammer approach again. I knew there was a way I could handle that pair of data points better – using Python’s Dictionaries functionality – but I didn’t want to rush ahead of the curve. Well now I get the chance to learn all about dictionaries.

Dictionaries – an introduction:
Dictionaries in Python are an even more powerful way to handle lots of data, i.e. multiple values of things, than lists are. While lists allow us to store a collection of values (e.g. all the words in the text, all the numbers of things, the names in a group of people, and so on), the list is very one-dimensional. In fact, the list is simply a collection of things that may not even need to be similar to one another. The values within the list variable are ordered (zero, one, two, etc.) allowing us to use indexing (listname[0], listname[1], etc.) to identify a specific item within the list.

Dictionaries, on the other hand, store data as pairs of values (e.g. all the words in a text AND how many times each word appears, all the names of people in a group AND their hair colour, etc.). In this way, dictionaries are like a little mini database, albeit a database of only two linked fields.

The values within the dictionary are not ordered as they are in lists, but instead each value has its own identifier or label, called a ‘key’. We can think of the key as the ‘field header’ in the database, or the column header in a table of things in Excel. The key is PAIRED WITH its value. For example, ‘Sally’ has ‘brown’ hair and ‘Alice’ has ‘blonde’ hair: ‘Sally’ and ‘Alice’ are the keys, while ‘brown’ and ‘blonde’ are their respective values.

When I learnt programming at school, the concept of dictionaries was referred to as ‘arrays’. They are the same concept, just different names. As you’d expect, different programming languages use different names for Dictionaries or Arrays, just as they do for lots of other different concepts and bits of functionality.

As with lists, the multiple (paired) values are contained within one single variable, but this time the variable type is ‘dict’ instead of ‘list’.

Playing with dictionaries:
A great way to see the main differences between lists and dictionaries is just to create a list and a dictionary and then print them out to see what happens…


listname = [1,2,3,'seven','twelve']
print len(listname), type(listname), listname

Prints out: <type ‘list’> [1, 2, 3, ‘seven’, ‘twelve’]


dictname = dict()
dictname['key1'] = 1
dictname['key2'] = 2
dictname['key3'] = 3
dictname['key4'] = 'seven'
dictname['key5'] = 'twelve'
print type(dictname), dictname

Prints out: <type ‘dict’> {‘key3’: 3, ‘key2’: 2, ‘key1’: 1, ‘key5’: ‘twelve’, ‘key4’: ‘seven’}

The first thing to note is that a list can be specified in one go; all the items are contained in square brackets, and separated by commas, or we can append a series of values to the list and they will be added into the list in the order given. When we print out listname, Python prints [1, 2, 3, ‘seven’, ‘twelve’]. As we learnt before, we can call each item in the list using indexing, so listname[0] is 1, while listname[3] is ‘seven’.

When we create a new dictionary, we specify the variable name and tell Python that variable is a dictionary using the dict() function. We then assign each value to each key, using the construct dictname[‘keyname’] = value. Just as we could with lists, we can include any kind of value (numeric or string) in the dictionary. If we want to return a specific value, we need to call that particular keyname. For example, print dictname[‘key1’] will print out 1, while print dictname[‘key4’] will print out ‘seven’. Note that Python identifies dictionaries by use of curly brackets (as opposed to square brackets for lists), and returns the pair of items in the format ‘key:value’, with each pair separated by commas.

Keys within dictionaries can take either strings or other things as their name/identifier, although the use of strings as identifiers is most common. Wherever we have a numerical value within one of the keys within the dictionary, we can perform arithmetic operations on that value, for example:

dictname['key3'] = dictname['key3'] + 1
changes the value stored within key3 from 3 to 4.

One striking thing about dictionaries is that, when we print out the contents of dictname, it does not return the keys/values in the order we entered them. This is in contrast to lists, which will always keep their specified order, unless we apply the sort() function. This is due to a function called ‘hashing’ which is an internal function applied to the indexing of the keys within the dictionary. Hashing allows the program to more quickly index and manipulate the data that was entered into the database or dictionary, and is used in a lot of different database programs, not just Python. (Read more about hashing here.)

Lists vs. dictionaries:
-> Just as we can use listname = list() to create an empty list, so we use dictname = dict() to create an empty dictionary.
-> Alternatively, we can set an empty list using listname = [], and similarly we can set an empty dictionary using dictname = {}.
-> We can add values to a list using the append() function, i.e. listname.append(‘newvalue’), whereas to add a key/value pair to a dictionary, we use an assignment statement dictname[‘keyname’] = ‘newvalue’.
-> Values entered into the list remain in the same order, unless we change the order somehow (e.g. using sort() function), whereas key/value pairs within a dictionary will take on their own internally-calculated order, which is NOT the same as the order items were entered into the dictionary.
-> Alternatively, we can specify a list up front using square brackets, e.g. listname = [1, 2, 3], and similarly we can specify a dictionary up front using curly brackets, e.g. dictname = {‘key1’:1, ‘key2’:2, ‘key3’:3}.

Applications – multiple counting:
Because we can use as string as key and pair it with a numerical value which changes (or increments arithmetically), a very useful thing to use dictionaries for is counting the number of times some word appears – which takes us right back to my Tagging Engine exercise I was working on earlier. The dictionary can count, not just one value or one word, but many values or words simultaneously. This allows us to iterate through, for example, a word list and count and keep track of lots of different separate values and counts all at the same time, thus speeding up the program considerably.

In order to do this in a practical application, such as in the Tagging Engine, we will need to find a way of adding a new word and its count as a new key/value pair. But first we need to check if that word exists as a key in the dictionary already. We can use the Boolean construct if .. not in .. to check this:


dictname = {'open':1, 'closed':0, 'ajar':3}
if 'slightly open' not in dictname : print 'slightly open is not a key in this dictionary'

If we check and find the value is not already a key in dictionary, we can simply add it using the normal assignment statement dictname[‘newkeyname’] = ‘newvalue’.

Here’s a simple script which uses a dictionary to count and return the number of times a word appears in a pre-specified word list:


wordcounts = dict()
wordlist = ['once','more','unto','the','breach','dear','friends','once','more']
for word in wordlist :
____if word not in wordcounts :
________wordcounts[word] = 1
____else :
________wordcounts[word] = wordcounts[word] + 1
print wordcounts

This will print out: {‘dear’: 1, ‘breach’: 1, ‘the’: 1, ‘once’: 2, ‘friends’: 1, ‘unto’: 1, ‘more’: 2}.

The get() function:
The get() function is additional built-in functionality Python has for dictionaries, which allows us to do the above process far more quickly and easily. Instead of writing script to (a) check if the word is already a key in the dictionary, (b) add the word as a new key, (c) increment the word count by one for each word in the wordlist, we can use the construct dictname.get(keyname,0). This function will return the value held within the key ‘keyname’ in the dictionary ‘dictname’, or – if the key does not exist – it will return the default value 0.

Using this function we can now simplify the above script to count and return the number of times a word appears in a pre-specified word list:


wordcounts = dict()
wordlist = ['once','more','unto','the','breach','dear','friends','once','more']
for word in wordlist :
____wordcounts[word] = wordcounts.get(word,0) + 1
print wordcounts

This prints out exactly the same as before: {‘dear’: 1, ‘breach’: 1, ‘the’: 1, ‘once’: 2, ‘friends’: 1, ‘unto’: 1, ‘more’: 2}.

A few more things to wrap up:
There is also a way to convert a dictionary into a list. If we want to create a list of just the keys in the dictionary, we can either use list(dictname) or listname.keys(). To create a list of just the values in the dictionary, we use listname.values(). If we want to create a list of lists which has both the key and the value as a pair we use listname.items().

Something cool which Python allows us to do if we’ve got a dictionary of key/value pairs, if we turn that into a list of pairs using listname.items(), we can then iterate through both the key and the value together in a definite loop, as follows:


for key,value in dictname.items() :
____print key, value

The list of key/value pairs is known as a Tuple and we’ll cover that in the next post. But in the meantime, I’ve got enough to get on with perfecting my Tagging Engine program. But first, dinner beckons…

Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).

References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Python Data Structures by Univ. of Michigan. Part of the Python for Everybody specialisation.