< Previous Page | Home Page | Next Page >
One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations. Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.
Strings in Python can be defined using either single or double quotations (they are functionally equivalent):
x = 'a string'
y = "a string"
x == y
In addition, it is possible to define multi-line strings using a triple-quote syntax:
multiline = """
one
two
three
"""
With this, let's take a quick tour of some of Python's string manipulation tools.
For basic manipulation of strings, Python's built-in string methods can be extremely convenient. If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing. We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper
Python makes it quite easy to adjust the case of a string.
Here we'll look at the upper(), lower(), capitalize(), title(), and swapcase() methods, using the following messy string as an example:
fox = "tHe qUICk bROWn fOx."
To convert the entire string into upper-case or lower-case, you can use the upper() or lower() methods respectively:
fox.upper()
fox.lower()
A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the title() and capitalize() methods:
fox.title()
fox.capitalize()
The cases can be swapped using the swapcase() method:
fox.swapcase()
Another common need is to remove spaces (or other characters) from the beginning or end of the string.
The basic method of removing characters is the strip() method, which strips whitespace from the beginning and end of the line:
line = '         this is the content         '
line.strip()
To remove just space to the right or left, use rstrip() or lstrip() respectively:
line.rstrip()
line.lstrip()
To remove characters other than spaces, you can pass the desired character to the strip() method:
num = "000000000000435"
num.strip('0')
The opposite of this operation, adding spaces or other characters, can be accomplished using the center(), ljust(), and rjust() methods.
For example, we can use the center() method to center a given string within a given number of spaces:
line = "this is the content"
line.center(30)
Similarly, ljust() and rjust() will left-justify or right-justify the string within spaces of a given length:
line.ljust(30)
line.rjust(30)
All these methods additionally accept any character which will be used to fill the space. For example:
'435'.rjust(10, '0')
Because zero-filling is such a common need, Python also provides zfill(), which is a special method to right-pad a string with zeros:
'435'.zfill(10)
If you want to find occurrences of a certain character in a string, the find()/rfind(), index()/rindex(), and replace() methods are the best built-in methods.
find() and index() are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')
line.index('fox')
The only difference between find() and index() is their behavior when the search string is not found; find() returns -1, while index() raises a ValueError:
line.find('bear')
line.index('bear')
The related rfind() and rindex() work similarly, except they search for the first occurrence from the end rather than the beginning of the string:
line.rfind('a')
For the special case of checking for a substring at the beginning or end of a string, Python provides the startswith() and endswith() methods:
line.endswith('dog')
line.startswith('fox')
To go one step further and replace a given substring with a new string, you can use the replace() method.
Here, let's replace 'brown' with 'red':
line.replace('brown', 'red')
The replace() function returns a new string, and will replace all occurrences of the input:
line.replace('o', '--')
For a more flexible approach to this replace() functionality, see the discussion of regular expressions in Flexible Pattern Matching with Regular Expressions.
If you would like to find a substring and then split the string based on its location, the partition() and/or split() methods are what you're looking for.
Both will return a sequence of substrings.
The partition() method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:
line.partition('fox')
The rpartition() method is similar, but searches from the right of the string.
The split() method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:
line.split()
A related method is splitlines(), which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo BashÅ:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""
haiku.splitlines()
Note that if you would like to undo a split(), you can use the join() method, which returns a string built from a splitpoint and an iterable:
'--'.join(['1', '2', '3'])
A common pattern is to use the special character "\n" (newline) to join together lines that have been previously split, and recover the input:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))
In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string representations of values of other types.
Of course, string representations can always be found using the str() function; for example:
pi = 3.14159
str(pi)
For more complicated formats, you might be tempted to use string arithmetic as outlined in the chapter on Operators.
"The value of pi is " + str(pi)
A more flexible way to do this is to use format strings, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted. Here is a basic example:
"The value of pi is {}".format(pi)
Inside the {} marker you can also include information on exactly what you would like to appear there.
If you include a number, it will refer to the index of the argument to insert:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')
If you include a string, it will refer to the key of any keyword argument:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')
Finally, for numerical inputs, you can include format codes which control how the value is converted to a string. For example, to print a number as a floating point with three digits after the decimal point, you can use the following:
"pi = {0:.3f}".format(pi)
As before, here the "0" refers to the index of the value to be inserted.
The ":" marks that format codes will follow.
The ".3f" encodes the desired precision: three digits beyond the decimal point, floating-point format.
This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available. For more information on the syntax of these format strings, see the Format Specification section of Python's online documentation.
< Previous Page | Home Page | Next Page >