This post follows on from earlier Coding 101 posts and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
Opening data files:
In all our previous examples we used data as a constant (i.e. hard-coded into the program some way) or we prompted the user to enter some data which was then manipulated by the program in some way. We want to be able to read data from a variety of sources though – either from files, or from the web – and we know these are going to be much larger data sources, so we’ll need to be able to access and save files on our hard drive somewhere.
To read a file of data into Python, we must first identify to Python that we wish to open and read from or manipulate the file. To do so, we need to tell Python where to the locate the file; we do this using the open() function. This built-in function returns a file handle which Python can then use for identification/location/etc..
Within the brackets of the function, we need to specify the filename and extension. We also have the option to specify the mode – either ‘r’ for ‘reading from’ or ‘w’ for ‘writing to’ the file. If we don’t explicitly include the mode ‘r’ Python will assume we wish to read from the file anyway. The function to open and read from a file will look like this:
file_handle = open('filename.txt','r')
This creates the file handle, but doesn’t actually begin reading the data itself; we will still need to instruct Python to do that within the program using further built-in functions such as read() or write().
Line breaks:
Most data we’re likely to read from a file will occur on lines, such as lines of text in an email, or lines of data exported from a report. Python recognises a new line using the ‘newline’ character denoted by ‘\n‘. (This is equivalent to the carriage return on a typewriter, or the enter key on a computer keyboard, or a break in web html.) We can use \n to print separate out words in a string on different lines of an output:
text = 'I\nAm\nPython'
print text
Prints out the following:
I
Am
Python
Reading data files:
When we come to read data from a file, we first need to open it, then read it. A simple example might be:
data = open('rawdata.txt','r')
datastring = data.read()
This reads all of the data from the file into a single, long text string. The line-breaks appearing at the end of each data line will be contained within the long text string – as was shown in the simple string example above ‘I\nAm\nPython’. A word of caution: this is only really recommended where the file is small enough that all the data can be read into one string without slowing the computer down.
When we read the data into one single long string, it won’t really give us a lot of flexibility in how we can search or cut and slice the data (in order to analyse it, or manipulate it or compute other things from it).
Line breaks and loops:
But since the special character is implicitly embedded in any file consisting of multiple lines of data when Python reads it, this allows us to construct a loop to read multiple lines of data, with each line of data being processed or manipulated in some repetitive way, sequentially one at a time. By using a definite loop (using for .. in .. :) we can make Python run the loop exactly the right number of times needed to execute the loop code on each successive line of data in the file.
for line in file_handle :
____print line
(This will print out each line from the file in sequence until it reaches the end.)
We can view the lines in the file as equivalent to the numbers in a list [1, 5, 67, -4, 0] or the characters in a string ‘mary had a little lamb’ and apply the same kinds of loop processes as before, the most common being to count the number of lines in the file. (In real-world terms this would be counting the number of items in the data sample, e.g. the number of email addresses in a list, or the number of customer accounts in a report.)
Now we have something more powerful, since we can use loops and counts (and other commands) to manipulate or compute or report on data lines in the file which might have a certain classifications. For example, in a text file of emails all the lines beginning with ‘From:’ will include an email address which we can manipulate in order to collect email addresses and/or domains. Or, if the file is a data dump of a report listing all customers line by line prefixed by some Customer Type (or other customer attribute such as the customer’s payment method) then we can report counts – and perform other manipulations – on all the lines of data relating to just one kind of customer type.
So we can see it can be very powerful. Here’s an example loop based on the latter (from a report for customers line by line based on customer payment type):
data = open('rawdata.txt','r')
count = 0
invcount = 0
for line in data :
____if line.startswith('Invoice') :
________print line
________invcount = invcount + 1
____count = count + 1
print 'Total count =', count print 'Invoice =', invcount
(This will count the number of lines in the data file which begin with ‘Invoice’ (denoting invoice payment type), as well as the total number of lines of customer data in the file (all payment types), and print out all the ‘invoice’ data lines as well as the total count and invoice count.)
The print function and lines breaks:
A quirk of Python is that it adds a line break (newline) at the end of each line of data when printing it out. This is in addition to the newline which is already inherent in the data at the end of each data line. To remove this we can add the rstrip() function at the start of the loop to strip out any blank space and newline characters at eh end of each line – both are classed as whitespace and both will be stripped out by using one or other of the various strip functions (strip(), rstrip(), lstrip()). The following single-indented line should be added after the for .. in .. : command and before the single-indented if .. : command:
____line = line.rstrip()
If not .. : continue:
We can use conditionals within the loop to identify only those lines meeting the tested criteria so that a process is only performed on those lines (e.g. if line.startswith(‘Invoice’) : print line). But we can also have negative criteria tests as well – which either allows us to perform a process on a line if a certain criteria is NOT met, OR it allows us to SKIP that particular line altogether and continue through loop back at the beginning for the next line in the file.
data = open('rawdata.txt','r')
count = 0
for line in data :
____line = line.rstrip()
____if not line.startswith('Invoice') :
________continue
____print line[:49]
____count = count + 1
print 'Total count =', count
(This will count only those lines in the data file beginning with ‘Invoice’, prints out the first 50 characters of those lines, and the count of those lines. All other lines (i.e. not ‘invoice’ payment type) are skipped and are neither printed nor counted.)
Searching within a line using ‘in’:
Using ‘in’ allows us to search for a string within each line (returns TRUE or FALSE) and so we can use this to set up a condition, with some command being run or not dependent on that string being found in the line. A simple example might be, searching for and printing out all customers called ‘John’:
data = open('rawdata.txt','r')
for line in data :
____if not 'John' in line :
________continue
____print line
(This prints out all lines containing the string ‘John’. Note this searching function is case-sensitive but we would use the lower() function to turn all alpha characters to lowercase before searching, unless we can be sure that all data follows consistent case formatting.)
User-specified files:
Where we might want to perform the same routine on a number of different data files (but all formatted the same way, and consisting of the same basic data structures), we can prompt the user to enter the filename using a filename input variable. The routine will then run on that specified file.
filename = raw_input('Enter the filename: ')
file_handle = open(filename)
The remainder of the program will then follow as normal. Note that, since filename is a variable we do not need to include it in single quotes in the open() function. Also, we do not need to specify mode ‘r’ as Python assumes we wish to open this file in read mode.
We can also program the user input prompt into an input loop, so returning to prompt the user for another filename after processing the first file, and continuing until the user exits the program. Note that a traceback error will occur if the entered filename does not exist in the current working directory (or at the specified file path) – this can be handled with a try/except construct.
References:
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Python Data Structures by Univ. of Michigan. Part of the Python for Everybody specialisation.