This is me just mucking about with lists, testing out what I’ve learnt so far and applying it to little problems I might want to solve. I find it the best way to learn, and it’s more fun than reading books!
Project 1: Building a tagging engine (Mon 18Apr16)
1. This snippet splits each line into a list, creates an iteration variable to loop through all words in the line list and print them out. I add various print statements at suitable points (both variable print statements and descriptive text statements) to help me test the program structure, to make sure it’s doing what I want and expect it to at each point through the loop.
for line in fhand :
____line = line.rstrip()
____linelist = line.split()
____i = range(len(linelist))
____print i, len(i) # just checking
____for i in i:
________print i, linelist[i]
____print 'finished the loop'
print 'finished all the lines'
2. Add a word count in the for i in i: loop using this code:
________count = count + 1
________print count # just checking
3. Create an empty list and, within the iteration loop, append each item in the line list to the empty list; the length of empty list grows by one as each new item is added.
wordlist = []
/snip/
________wordlist.append(linelist[i])
4. Of course, I like using a sledgehammer to crack a nut. The easier way to get the same result is to just read the whole file and split into a list. (Sometimes my brain hurts.)
# split all words in file into a list
text = fhand.read()
wordlist = text.split()
print wordlist
print len(wordlist)
5. Define a list of excluded words, use iteration loop to test for all words in word list and append only those significant words (not in excluded list) to a new significant word list.
excludelist = ['be','it','to','our','the','on','we','do','a','is','for'] # etc etc
count = 0 ; sigwordlist = [] ; sigcount = []
i = range(len(wordlist))
for i in i:
____if wordlist[i] in excludelist : continue
____count = count + 1
____# append each word in line list to signif word list
____sigwordlist.append(wordlist[i])
print sigwordlist
print 'Number of words (signif):', len(sigwordlist)
print 'Number of words (all):', len(wordlist)
6. That’s great. Now I need to know how to count how many times a significant word appears in the significant word list. I tried mucking about with nested iteration loops, using both i and n as using range(len(sigwordlist)), but Python didn’t seem to like it. So I gave up and tried a different mothod – tried using the find() function. No luck. Apparently lists can’t use the find() function. Last resort – I had to read the manual (python.org – data structures).
Turns out I can use the listname.count() function with a list, to return the number of times something appears in listname. This works, so I create a new list with the significant words and their counts. (Probably using a sledgehammer again here; no time to further refine this right now though.)
sigwordcount = []
i = range(len(sigwordlist))
for i in i:
____# how many times does sigwordlist[i] appear in sigwordlist?
____if sigwordlist[i] in sigwordcount : continue
____sigcount = sigwordlist.count(sigwordlist[i])
____sigwordcount.append(sigwordlist[i])
____sigwordcount.append(sigcount)
print sigwordcount
print len(sigwordcount)
7. The above snippet puts both the significant word and its count in a single list. That makes it harder to sort or otherwise manipulate the paid of data points (the word and its count) with the knowledge I’ve learnt so far. A better way is to append the word to one list and its count to another list, as follows:
sigwords = []; sigcounts = []
/snip/
____sigwords.append(sigwordlist[i])
____sigcounts.append(sigcount)
/snip/
8. I can now more easily sort the count list within an iteration loop (I don’t need to convert to numeric first to test value comparisons one to another), and print out each word and its count, most frequent first, and/or append these to new, ordered list(s) if required.
To help me do that, I first create a reverse-ordered list of the counts of significant words.
revsigcounts = sorted(sigcounts, reverse=True)
orderedcounts = []
i = range(len(revsigcounts))
for i in i:____# how many times does revsigcounts[i] appear in revsigcounts?
____if revsigcounts[i] in orderedcounts : continue
____sigcount = revsigcounts.count(revsigcounts[i])
____orderedcounts.append(revsigcounts[i])
print orderedcounts
print len(orderedcounts)
9. Now I can order the significant words based on their frequency and either print them out and/or append them to a new ordered words list. (The new ordered words list may either be with or without the relevant frequency counts, but the latter is less useful & I’ll need to go on to study more advances data structures like dictionaries to perfect this in future. For now though I’ll just stick with my usual sledgehammer methodology!)
orderedwords = []
orderedwordswithcounts = []
i = range(len(orderedcounts))
print 'Significant words in descending order or frequency:', orderedcounts
for i in i:
____n = range(len(sigcounts))
____for n in n:
________if sigcounts[n] == orderedcounts[i] :
____________print sigcounts[n], sigwords[n]
____________orderedwords.append(sigwords[n])
____________orderedwordswithcounts.append(sigwords[n])
____________orderedwordswithcounts.append(sigcounts[n])
Note that I’m using really long variable names here just to help with understandability; in practice they should be a lot shorter than this!
10. The last thing I’m going to do before going to get some lunch (I’m starving) is to add a quick user-input to specify the top n words (by frequency) to be returned, e.g. the top 5 or top 10. Here I haven’t bothered with any error handling, but that’s something that should really be included.
num = raw_input('Enter a number for "top n words":')
num = int(num)
print orderedwordswithcounts[:num*2]
Some problems I need to iron out:
- I need to work out how to get reserved words like ‘if’ or ‘in’ into the excluded words list – when I tried (putting them in either single- or double-quotes) it returned a syntax error;
- I’m currently using a text file with punctuation marks stripped out – I’ll need to find a way to handle that in future text files which will inevitably include punctuation which I’ll need to strip out;
- I was able to convert everything to lowercase so problems of case didn’t affect the words returned – but I need to work out how to deal with plurals of words – is there a way I can just automatically strip ‘s’ off the end of a string when it appears? That should be enough to solve it;
- Sort out error handling and make the output more pretty on the final section, ‘top n words’;
- Oh yes, and find a nicer way to post code to a WordPress blog!
Coda (Tue 19Apr16):
Sitting in the sunshine at lunch the next day, I realise two things. Firstly, a much better way to do this counting exercise is to loop through all words in the string, see if the words is in the newly-created sigwords list, if not append it to sigwords list, or else skip (continue). This loop can either be performed within the line loop, or perhaps better would be to do the process on words within whole list using range(len(wholelist)) for iteration variable i applied to wholelist[i].
Secondly, a sensible development of the tagger program would be to allow users to specify whether uploading simple text or html. If html, the program can use html tags (title, headers (h1, h2, etc.), strong or italic html tags to weight importance of text (either as words or phrases) more highly than normal, untagged/unformatted text. This will help make the program more useful if being used, for example, to auto-tag a WordPress blog post.
Further refinements:
Add an additional iteration loop to run through all words in the significant words list, reviewing the words in pairs and/or triplets. (Use if i is divisible by 2, or 3, execute the loop action, else continue, to scroll through the words in blocks of 2, or 3.) Return the phrase using print sigwordlist[i-1], sigwordlist [i], sigwordlist [i+1]. Prompt user to check/confirm if phrase is a significant phrase. If so, append phrase to significant phrases list. User to be allowed to modify/overwrite sig phrase as required before appending. The same functionality can be used to test titles and headers for significant phrases.
The user can be prompted for a file path and filename where the significant phrases list can be saved. This can then be reloaded when using the program multiple times on multiple web pages/articles/etc.
This functionality will be useful for accelerated learning, and for rapid searching/reviewing of many web sites/pages/articles.