This post follows on from earlier posts (Coding 101 (part 1) ~ (part 2) ~ (part 3)) and records my responses and learnings from the highly-recommended Python programming book and Coursera specialisation by Charles Severance (see References below).
A quick recap on strings:
Strings are computer-speak for characters, specifically where some object or value has the ‘type’ string. Type is an attribute Python applies to any given object or value so it knows how to handle that object or value, i.e. what kinds of operations can and cannot be applied to it. String, and two numeric types – integer and float – are the most common types within Python.
A string may contain one or more characters, so ‘a’ and ‘0’ are strings, just as ‘abcdefghij’ and ‘Hello world. I am Python.’ are. When we enter data into Python using the user entry function raw_input(), whatever we entered gets read by Python as a string. So even if we entered 1 or -6 or 12.25, Python would still read it as a string (i.e. as if it were letters not numbers). Note that Python 2.7x recognises either single quotes OR double quotes to identify a string.
We can’t do mathematical computations on strings in a numerical sense, but some mathematical operators can be used on strings. For example, ‘+’ applied to two strings has the effect ‘concatenate’, i.e. it joins those two strings together. Note that concatenate does not add any spaces between the strings, it simply squashes the strings together one after another.
It’s not possible to mix string types and numeric types and apply operations to them, e.g. you can’t add together ‘6 apples’ and 27. If we want to input some numerical data and perform mathematical computations on it, we must first convert it to a numeric type. Here we use the function int(), which converts the content of the brackets into an integer, or the function float(), which converts the contents of the brackets into a floating point decimal.
Inside strings:
Because a string is any number of characters, we can realise that a string has a length (the number of characters it contains) and also that each character appears at a certain point or position within the string. The position within the string is identified by an ‘index’, which is a numerical integer value starting at 0 and incrementing by +1. The string can be viewed as a list of characters (as opposed to a list of numbers which we saw in part 3 are denoted by the square brackets as e.g. [1, 4, 65, -3, 27].)
For example, in the string ‘mary had a little lamb’, the string length is 22 and the positions in the string will be marked by index values 0, 1, 2, .. 21. The first letter has index [0] and the first l in little (the 12th character) has an index [11]. Note that the index number is always contained within square brackets, and is known as the index operator (we say “sub-elven” for [11]).
The indexing of characters within a string, beginning with index [0] or “sub-zero”, has the advantage that a numeric integer variable which increments +1 through a loop can be used to read and check/sort/do some other repetitive computation on all the characters within that string, in sequence, one after the other until it reaches the end. This is possible because the index number is able to take, not just some absolute value [1], but instead can be a variable, thus allowing the value of the index/position to change as the value of the variable changes (e.g. [n]), or can even be an expression such as a variable minus a number (e.g. [n – 1]).
A few important points to note: (1) index numbering will always begin at zero; (2) if we try to read a character with a higher index value than the ‘string length – 1’ this will return an error message (traceback); (3) if we construct a loop with indexing variable of sequentially incrementing value, we will need to make sure to construct the values of the indexing variable within the loop to it never exceeds ‘string length – 1’.
Example 1:
saying = ‘mary had a little lamb’
first_letter = saying[0]
print len(saying), first_letter
(This outputs: 22 n)
Example 2:
saying = ‘mary had a little lamb’
n = 11
print saying[11]
(This outputs: l)
Example 3:
saying = ‘mary had a little lamb’
n = 0
while n > -1 :
- print n, saying[n]
- n = n + 1
print ‘The end’
This outputs each successive index number and its associated character, until the loop reaches n = 22, at which point it returns a traceback since the loop is trying to read the non-existent 23rd character in the string.
To overcome this problem, we can change the while condition to check the variable n and make sure it hasn’t exceeded ‘string length – 1’:
saying = ‘mary had a little lamb’
n = 0
while n < len(saying) :
- print n, saying[n]
- n = n + 1
print ‘The end’
Just as we saw in part 3 when looping through numbers in a list, so too can we loop through characters in a string, using the for/in definite loop construct: for {variable} in {string or variable denoting string}:
saying = ‘mary had a little lamb’
for n in saying :
- print n
print ‘The end’
Applications of loops with strings:
For all those applications of loops we saw in part 3 as applied to numerical lists, so too we can apply to characters in a string, since strings are really just treated as if they were lists of characters. So we can do all the sorting, searching, counting, and finding largest or smallest values. when looking for largest or smallest values in character strings, these will be their lexicographic values (i.e. in alphabetical order, which comes first and which comes last).
Slicing strings:
Where we want to return a chunk of the characters in a string all at once, we can do this by slicing the string. Within the index operator [] we quote the first index number we want to read/return then the index number immediately after the index number we want to read/return, separated by a colon:
saying = ‘mary had a little lamb’
name = saying[0:5]
print name
animal = saying[18:22]
print animal
This outputs: mary, then lamb. Note that if the second index number is larger than the most available index numbers, the slice returns all characters from start point up to end of the string. Where we want to slice all characters from the start of the string up to a point, or from a certain point up to the end of the string, we can omit the starting index number (e.g. saying[:5] returns mary as before) or ending index number (e.g. saying[18:] returns lamb as before). If we use the default as saying[:] this will return the string from the beginning and to the end, i.e. it will return the whole string.
Using ‘in’ as a logical operator:
As well as using in with for when constructing a definite loop, we can also use in as a way of searching for something within a string and returning either TRUE or FALSE dependent on whether it was found or not. For example:
‘x’ in saying {is FALSE}
‘lamb’ in saying {is TRUE}
This expression can be used as a conditional, for example:
if ‘x’ in saying : print ‘Found x’ {condition is False so will not print anything}
if ‘lamb’ in saying : print ‘Found lamb’ {condition is True so will print ‘Found lamb’}
Comparing strings:
As we saw above, strings can have a value and be sorted largest/smallest; all the values are as per standard alphabetical order. To check and compare strings, we are referring to this alphabetical value. The usual comparison operators will be used:
‘lamb’ == saying[18:] {means “is exactly the same as”}
‘a’ < ‘b’ {means “less than” i.e. ‘a’ comes before ‘b’ alphabetically} ‘tommy’ > saying[:5] {means “more than” i.e. ‘tommy’ comes after ‘mary’ alphabetically}
Again, we can use any of these comparison operators in a conditional and specify some action dependent on whether the comparison is true or false, for example:
if ‘lamb’ == saying[18:] : print ‘two little lambs’
Built-in functions applicable to strings:
Python includes some useful built-in functions which can be applied to strings, such as .lower() which will return a new lower-case copy of the original string. This is very useful if you need to search for something while ignoring case, for example:
inp = raw_input(‘Enter your name’)
lowername = inp.lower()
print lowername
Another example is .find() which looks for a string within another string and returns the index number of the first time it appears (or if not found it returns -1), for example:
saying = ‘mary had a little lamb’
index_number = saying.find(‘lamb’)
print index_number
(This outputs: 18)
As well as finding a string within another string, you can also do a find and replace function, using .replace(‘old string’,’new string’), for example:
saying = ‘mary had a little lamb’
new_saying = saying.replace(‘lamb’,’tiger’)
print new_saying
This outputs the new string ‘mary had a little tiger’ which is the new string content of the modified variable new_saying. Note that the original string contained within the variable saying has not been changed.
Often we need to clean data, which might include stripping out blank spaces contained at the start and/or end of a string. To strip out blanks from both the start and end of a string we can use the function .strip(). If we just need to strip out blank spaces from the start of the string we can use .lstrip(); if we just need to strip out blank spaces from the end we can use .rstrip().
Another useful function when analysing data is to check if a string begins with a particular character or string. To do this we use the function .startswith(), which is a logical and returns either True or False. Again, we can use this in a conditional statement, for example:
saying.startswith(‘mary’) {is TRUE}
if saying.startswith(‘mary’) : print ‘Starts with mary’
In all these functions, case matters. So if we asked saying.startswith(‘Mary’), this would be FALSE since ‘Mary’ == ‘mary’ in Python-land. Another reason why the .lower() function is so useful.
There’s a whole variety of other built-in functions which can be applied to strings, with lots more info and documentation available at the Python website: https://docs.python.org/2/library/stdtypes.html#string-methods.
Real-world application (1) Parsing text:
This real world application extracts the domain and extension from a email address contained within a long data string as can be found in an email header. Example is from the Python Data Structures course referenced below; line numbers have been added at left.
L1: data = ‘From somebodys.name@somedomainname.com Sun 10 April 2016 10:51:34’
L2: atsymbol_position = data.find(‘@’)
L3: space_position = data.find(‘ ‘ , atsymbol_position)
L4: email_domain = data[atsymbol_position + 1 : space_position]
L5: print email_domain
L1 enters a long string of data taken from an email (this would usually be read in from a file rather than typed in as a constant as here – refer later posts for how to read data from files). L2 finds the index position of the @-symbol within the email address (in the example it’s 19). L3 finds the index position of the next blank space (‘ ‘) to appear after the position of the @-symbol (in the example it’s 38). L4 slices the string, using the index slicing operation [x:y], and reads from one index position after the @-symbol (atsymbol_position + 1) until one index position before the blank space (space_position). This returns the sub-string within the whole data string giving the domain name only, and this is printed out in L5.
References
Book: Programming for Informatics – Exploring Information by Charles Severance
Course: Python Data Structures by Univ. of Michigan. Part of the Python for Everybody specialisation.