Building a Tagging Engine in Python using Dictionaries

I started playing around building a Tagging Engine in Python using Lists but now I’ve studied a bit more – particularly Dictionaries – I want to see how I can perfect what I was working on. Here are a couple of key things I added/changed in this program over the version I was working on earlier.

1. Using a dictionary instead of lists for faster counting and simpler recall. This is straight forward, using the get method as taught by Dr. Chuck (see Coding 101 part 7 for more details).

words = dict()
for word in wordlist :
____if word in excluded : continue
____words[word] = words.get(word,0) + 1

2. Counting and returning the most common word in the dictionary. This uses the ability to split a dictionary out into a tuple using items(), then looping through for the key/value pair (in this case word, count).

wordhigh = None ; counthigh = None
for word,count in words.items() :
____if wordhigh is None or count > counthigh :
________wordhigh = word
________counthigh = count
print 'Most common word and its count:'
print wordhigh, counthigh

3. Printing out all the words and their counts, in descending order. Here I’ve stripped out the keys and values from the dictionary into their own lists (wordlist and countlist). I set an iteration variable y as a list with length equal to the number of unique words (i.e. length of wordlist). I then used a second iteration variable x of the same length, to allow me to scroll through each word in turn in wordlist (using x as the index).

Variable y (the integer list [0, 1, 2 … etc]) is used to denote ascendancy. First it prints out word(s) with highest count (where y = 0, that is where counthigh – y == counthigh). Secondly it prints out word(s) with count one less than the highest count, and so on. This might be a sledgehammer, but lack of knowledge forces me to use this method. Anyway, it works so I’m happy with this for now.

I added an option for the user to specify whether to print out the descending list of words and counts or not. An improvement on this would be to add an option to list either in ascending or descending order. To list in ascending order, I would simply reverse the order of iteration list variable y using y = sorted(y, reverse=True).

wordlist = words.keys()
countlist = words.values()
inp = raw_input('Do you want a complete listing (Y/N)?')
if inp.lower() == 'y' :
____print '\n', 'All words in descending order:'
____y = range(len(wordlist))
____for y in y :
________x = range(len(wordlist))
________for x in x :
____________if countlist[x] == counthigh - y :
________________print wordlist[x], countlist[x]

4. As before, the user can specify how many orders of count frequency they’d like listed, e.g. top 3, top 5, etc.. This is especially important when dealing with large text files.

inp = raw_input('Enter a number for "n orders of word frequency":')
n = int(inp)

5. Getting the user-input was the easy part. The problem I had before was how to restrict the program to returning exactly the orders of frequency specified by the user. I knew I needed to use a series of loops but I couldn’t quite get my head around how exactly to do it. After lots of mucking about and trial and error and not getting very far (and questioning whether I should really be doing this on a Bank Holiday Sunday when the sun’s shining outside), I realised I needed to draw a flowchart of the program flow. That really helped a lot, and my final version is as follows:

# while n not zero,
# loop through all counts from counthigh downwards
# and return all words & counts for n-i-1
order = 1
countnum = 0
while n <> 0 :
____if countnum == counthigh : break
____if (counthigh - countnum) in countlist :
________print '\n', 'Order:', order
________print '--------'
________x = range(len(wordlist))
________for x in x :
____________if countlist[x] == counthigh - countnum :
________________print wordlist[x], countlist[x]
________order = order + 1
________countnum = countnum + 1
____else :
________countnum = countnum + 1
________continue
____n = n - 1
print '++Finished++'

The loop is based on the user-defined variable n (as in 4. above). I’ve used a while loop here and with each successful pass through the loop, n decreases by 1. Once n reaches zero the loop ends (with n-1 being the last pass of the loop; this equates with “order n” specified by the user). It was important to add the line if countnum == counthigh : break right at the start. Without this countnum would continue to increment by 1 and cause the program to enter an infinite loop.

I know what I wrote doesn’t seem like much if you’re a seasoned programmer, but for my first proper little program, I’m pretty chuffed with it. You can find the whole thing over on my GitHub account at https://github.com/debkr/tag_eng/