Coding 101 (part 6)

When strings become spaghetti:
Working with strings and files, particular when using the for {line} in {filehandle}: construct, allows us to do some cool manipulation of data, by finding, splitting and stripping the data into different chunks based on some repeating factor (such as a comma spearating each value in order), then sorting, counting and totalling those values through iterative loops.

Here’s an example of what that could look like (edited for relevance):

# strip text strings into variables
for line in fhand :
____line = line.rstrip()    # remove whitespace at end of line

____pos = line.find(',')
____str1 = line[pos+1:]    # text string - test using startswith('xxx')
    
____pos = str1.find(',')
____str2 = str1[pos+1:]    # text string
    
____pos = str2.find(',')
____str3 = str2[pos+1:]    # text string
  
____pos = str3.find(',')
____str4 = str3[pos+1:]
____value1 = str3[:pos]    # numeric string
____value1 = float(value1)    # convert to float
    
____pos = str4.find(',')
____str5 = str4[pos+1:]
____value2 = str4[:pos]    # numeric string
____value2 = float(value2)    # convert to float
  
____# counts & values sorted by Category 1 (text at start
____# of line) and Category 2 (text at start of str1)
____if line.startswith('1A') :
________count1A = count1A + 1
________total1A = total1A + value2
________# further Category 1 analysis as required
________if str1.startswith('2A') :
____________count1A2A = count1A2A + 1
____________total1A2A = total1A2A + value2
____________# further Category 2 analysis as required
________elif str1.startswith('2B') :
____________count1A2B = count1A2B + 1
____________total1A2B = total1A2B + value2
____elif line.startswith('1B') :
________count1B = count1B + 1
________total1B = total1B + value2
________if str1.startswith('2A') :
____________count2A2A = count2A2A + 1
________total1B2A = total1B2A + value2

Here’s the same code in an editor (Notepad++) showing the Python syntax highlighting:

It does work, but – as you can see – it’s pretty ugly! And that’s fine if we’ve only got a small data set to work with, or a small text file to handle, and only need to loop through a small handful of variables and values.

But what about when we start working with larger files, say an aged debtors report which lists each debtor line by line and by age of debt. There could be tens of thousands of lines of data, each split out across 10-15 columns or more, that’s hundreds of thousands of values to handle across multiple variables. There has to be an easier way to handle that, and fortunately there is.

Lists make handling data structures easier:
We saw lists briefly before; they’re essentially a collection of multiple values which can be assigned to a variable (this is a multi-value variable as opposed to a single-value variable). Lists are marked by square brackets, with each item or element in the list separated by commas; the elements within the list are known as list constants.

Lists can contain values or they may be empty. Values within a list can be either strings or numerics (or some mixture of the two) or they can be other variables. Lists may even contain other lists.

string_list = ['mary','had','a','little','lamb']

mixed_list = ['mary','had',10,'little','lambs']

name = 'mary'
size = 'little'
list_with_variables = [name,'had',10,size,'lambs']

a = 10 ; b = 2 ; c = 3
numbers = [a,b,c]
calculation = numbers[0]*numbers[1]+numbers[2]

empty_list = []
empty_list = list() # does the same thing

list1 = [2,3,4]
list2 = [1,list1,5,6]

Lists and iteration:
Lists are great for providing iteration variables to take a program through a loop. So in our ugly debtors example above, we used if to test first for Category 1A, 1B and so on, and nested within that we tested each time for Category 2A, 2B, 2C and so on. Rather than hard-coding the required searching and sorting categories into the program, we can have them user-defined using raw_input and stored within a list. We can then iterate through that list using a definite for .. in loop, as follows:

for category1 in [1A,1B,1C,1D,1E]:
____# do something with elements in category1

We may wish to define the executable code as a function first (as it’s likely to be repeated for all relevant cases, for all values of Category 1). We can then call the function wherever we need to in the program to execute it.

Lists and index postions:
We could instead use an iteration loop with a simple count 0,1,2,.. to execute a block of code for each list element in turn. The loop iteration variable calls the elements of the list in turn using the list indexing function listname[n] (we say, “listname sub-n”) where n is the index position in the list (as defined by the iteration variable 0,1,2,..). Note that the index position numbering begins at zero, not 1, just as it did with position indexing in strings.

As well as using the list index position as a look-up tool (listname[1] returns the second element in listname), the list index also acts as a handle for the list elements as ‘sub-variables’ (listname[1] is the sub-variable name for the second item in the multi-value variable called listname). This allows us to change the value contained in any list element at anytime, by reassigning the value in that element, including reassigning the value of one element into another element:

listname = [1,2,3,4,5]
listname[3] = listname[1]
listname[1] = 7

List length:
A list can be as long as we like, with the length being equal to the number of elements within the list (so listname in the example above has a length of 5, i.e. the list consists of 5 elements). We find the length of a list using the len() function:

print len(listname)
(Prints out: 5)

Where list2 contains list1 as one of its elements, len(list2) only returns the number of elements in list2 (list1 may have any number of elements itself, but it will only count as one element within list1).

The range() function is useful for returning a list of ordered integer values starting at zero and of the specified length, for example:

newlist = range(6)
print newlist
(Prints out: [0, 1, 2, 3, 4, 5])

We needn’t specific the number itself in the range function though, we can specify some variable instead, or even some other function. So using the len() function applied to listname within the range() function returns an ordered integer list starting at zero and containing the same number of elements as listname. We have just created our numbered iteration list corresponding to listname’s position indeces:

category1 = ['1A','1B','1C','1D','1E']
i = range(len(category1))
for i in i:
____# do something with category1[i]

Other cool things about lists:
All the kinds of things we can do to strings, we can also do to lists too, like adding them together (concatenating), and all the really useful data searching and manipulation functions like finding, slicing, sorting, summing the elements (if numeric), finding max’s and min’s (largest and smallest elements, either alphabetically or numerically) and so on.

# concatenating lists
list1 = [1,2,3]
list2 = [4,5,6]
list3 = list1 + list2
(As you’d expect, list3 becomes: [1, 2, 3, 4, 5, 6])

# searching within a list (boolean)
9 in list3 # returns FALSE
5 in list2 # returns TRUE
'mary' in string_list # returns TRUE (see above for string_list)
'peter' not in string_list # returns TRUE

# slicing lists to retrieve a specified sub-section
print list3[1:3]
(Prints out: [2, 3])

# sorting changes the list order, sorted alphabetically or numerically
string_list.sort()
numbers.sort()
print string_list, numbers
(Prints out: [‘a’, ‘had’, ‘lamb’, ‘little’, ‘mary] [2, 3, 10]

An even cooler thing about lists:
We can also make new lists within the program itself using the append() function. Starting with an empty list [], we can run through some iteration loop calculating counts or totals or some other values across all the elements within a list. Then we append that calaulcated value to the new list, and keep on repeating for as long as the loop runs. The value might be either a string or a numeric, or, more likely, the calculated value in some variable. The newly-created list can then be used in some way at the end (once the loop has completed), either for further calculations, or for printing/reporting.

newlist = []
newlist.append(variable)
newlist.append('string')
newlist.append(7)

Building a new list inside an iteration loop would look like this:

# an iteration that builds a new list
cat1 = ['1A','1B','1C','1D','1E']
cat1total = []
total = 0
i = range(len(cat1))
for i in i:
____total = total + 6    # example calculation only!
____print cat1[i], total
____cat1total.append(total)
print 'cat1 total:', cat1total

(Prints out: [6,12,18,24,30])

Turning strings into lists:
We’ve seen useful ways to search inside a text string, and also to slice it based on index positions of known characters repeating in the string across multiple lines (e.g. the ‘@’ symbol in an email address, a repeating colon, or commas (when data is read in from a comma-delimited file). But if the text string is just a large block of random text (words and spaces, and punctuation marks), perhaps pulled from the web somewhere: how do we handle that?

We can use the split() function, which, as its name suggests, splits the ‘words’ in the block of text into individual strings (words), and automatically puts them in an ordered list. This is powerful, although there are a few points to be careful of:
1. Where the text includes single quotes as punctation marks, the text needs to be read in double quotation marks (“”) otherwise a syntax error is returned
2. Case remains, although we can deal with this using the lower() function if we need to
3. Punctuation marks remain (although we can handle this by cleaning the data first to replace all punctation marks with blank spaces (except single quotes which should just be removed)

saying = 'Mary had a little lamb'
words = saying.split()
print words
(This prints: [‘Mary’, ‘had’, ‘a’, ‘little’, ‘lamb’])

The split() function can be used to read data from a file line by line, where each line has multiple columns or fields which we want to perform various sorting, counting and totalling function on. As we loop through each line, each column or field in that line becomes an element in the list, on which we can perform the required calculations or code.

Now that’s got to be a whole lot better than all those lines of code at the start of this post! Here’s the relevant code:

# reading multiple fields (space-delimited) into a list
fname = raw_input('Enter the filename:')
fhand = open(fname)
for line in fhand :
____line = line.rstrip()    # removes newline at line end
____list = line.split()
____print list

We may not always have space-delimited data, but that’s fine because we can specify what the split character should be. So we can split by commas when reading data from comma-delimited files, or maybe by colon-/semi-colon. And it gives us all kinds of flexibility in splitting up strings of data, e.g. we can split out domains from an email by using listname.split(‘@’). We just need to know what format our data is in and use the split character accordingly.

Generally, we’d expect to have comma-delimited data files though, and the example below splits based on commas. (Just make sure if you’re reading numbers from a file, that the raw data doesn’t include commas denoting thousands as this will really screw things up!)

for line in fhand :
____line = line.rstrip()
____list = line.split(',')

Real-world uses:
It’s clear from the above just how powerful and valuable lists can be in programming, especially when dealing with large data sets of multiple records, and with multiple fields per record (as is often the case in busines and finance systems). I realise I’ve only just scratched the surface of lists here, but there’s plenty of ideas to get me started on building a new, more elegant version of the debtors analysis program I’ve been working on. I’ll be posting more about that in due course.

Read more like this:
This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).

References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Python Data Structures by Univ. of Michigan. Part of the Python for Everybody specialisation.