Part 2: Dealing with text files

Sequence objects

Sequence objects are objects that contain elements which are referred to by indices: lists, arrays, text strings, etc. The elements of any sequence object seq can be obtained by indexing: seq[0], seq[1]. It is also possible to index from the end: seq[-1] is the last element of a sequence, seq[-2] the next-to-last etc.

Example:

text = 'abc'
print text[1]
print text[-1]

Subsequences can be extracted by slicing: seq[0:5] is a subsequence containing the first five elements, numbered 0, 1, 2, 3, and 4, but not element number 5. Negative indices are allowed: seq[1:-1] strips off the first and last element of a sequence.

Example:

text = 'A somewhat longer string.'
print text[2:10]
print text[-7:-1]

The length of a sequence can be determined by len(seq).

Lists

Lists are sequences that can contain arbitrary objects (numbers, strings, vectors, other lists, etc.):

some_prime_numbers = [2, 3, 5, 7, 11, 13]
names = ['Smith', 'Jones']
a_mixed_list = [3, [2, 'b'], Vector(1, 1, 0)]

Elements and subsequences of a list can be changed by assigning a new value:

names[1] = 'Python'
some_prime_numbers[3:] = [17, 19, 23, 29]

A new element can be appended at the end with list.append(new_element). A list can be reversed with list.reverse() and sorted with list.sort().

Two lists can be concatenated like text strings: [0, 1] + ['a', 'b'] gives [0, 1, 'a', 'b'].

Lists can also be repeated like text strings: 3*[0] gives [0, 0, 0].

Tuples

Tuples are much like lists, except that they cannot be changed in any way. Once created, a tuple will always have the same elements. They can therefore be used in situations where a modifiable sequence does not make sense (for example, as a key in a database).

Example:

tuple_1 = (1, 2)
tuple_2 = ('a', 'b')
combined_tuple = tuple_1 + 2*tuple_2

Loops over sequences

It is often necessary to repeat some operation for each element of a sequence. This is called a loop over the sequence.

for prime_number in [2, 3, 5, 7]:
    square = prime_number**2
    print square

Note: The operations that are part of the loop are indicated by indentation.

Loops work with any sequence. Here is an example with text strings:

for vowel in 'aeiou':
    print 10*vowel

Loops over a range of numbers are just a special case:

from Numeric import sqrt

for i in range(10):
    print i, sqrt(i)

The function range(n) returns a list containing the first n integers, i.e. from 0 to n-1. The form range(i, j) returns all integers from i to j-1, and the form range(i, j, k) returns i, i+k, i+2*k, etc. up to j-1.

Testing conditions

The most frequent conditions that are tested are equality and order:

equal a == b

not equal a != b

greater than a > b

less than a < b

greater than or equal a >= b

less than or equal a <= b

equal	`a == b`
not equal	`a != b`
greater than	`a > b`
less than	`a < b`
greater than or equal	`a >= b`
less than or equal	`a <= b`

Several conditions can be combined with and and or, and negations are formed with not. The result of a condition is 1 for "true" and 0 for "false".

Most frequently conditions are used for decisions:

if a == b:
    print "equal"
elif a > b:
    print "greater"
else:
    print "less"

There can be any number of elif branches (including none), and the else branch is optional.

Conditions can also be used to control a loop:

x = 1.
while x > 1.e-2:
    print x
    x = x/2

Text files

Text file objects are defined in the module Scientific.IO.TextFile.

Reading

Text files can be treated as sequences of lines, with the limitation that the lines must be read in sequence. The following program will print all lines of a file:

from Scientific.IO.TextFile import TextFile

for line in TextFile('some_file'):
    print line[:-1]

Why line[:-1]? The last character of each line is the new-line character, which we don't want to print (it would create blank lines between the lines of the file).

Text file objects can also deal with compressed files. Any file name ending in ".Z" or ".gz" will be assumed to refer to a compressed file. Programs will of course receive the uncompressed version of the data. You can even use a URL (Universal Resource Locator, familiar from Web addresses) instead of a filename and thus read data directly via the internet!

Writing

Text files can be opened for writing instead of reading:

from Scientific.IO.TextFile import TextFile

file = TextFile('a_compressed_file.gz', 'w')
file.write('The first line\n')
file.write('And the')
file.write(' second li')
file.write('ne')
file.write('\n')
file.close()

Files opened for writing should be closed at the end to make sure that all data is actually written to the file. At the end of a program, all open files will be closed automatically, but it is better not to rely on this.

Note that automatic compression is available for writing too, but not URLs, because most servers on the internet do not permit write access for good reasons!

Some useful string operations

The module string contains common string operations that are particularly useful for reading and writing text files. Only the most important ones will be described here; see the Python Library Reference for a complete list.

Getting data out of a string

The function strip(string) removes leading and trailing white space from a string. The function split(string) returns a list of the words in the string, with "words" being anything between spaces. The word separator can be changed to any arbitrary string by using split(string, separator).

To extract numbers from strings, use the functions atoi(string) (returns an integer) and atof(string) (returns a real number).

To find a specific text in a string, use find(string, text). It returns the first index at which text occurs in string, or -1 if it doesn't occur at all.

Example: The following program reads a file and prints the sum of all numbers in the second column.

from Scientific.IO.TextFile import TextFile
import string

sum = 0.
for line in TextFile('data'):
    sum = sum + string.atof(string.split(line)[1])

print "The sum is: ", sum

Example: The following program prints the name of all user accounts on your computer:

from Scientific.IO.TextFile import TextFile
from string import split

for line in TextFile('/etc/passwd'):
    print split(line, ':')[0]

Converting data into a string

Any Python object (numbers, strings, vectors, ...) can be turned into a string by writing it in inverse apostrophes:

from Scientific.Geometry import Vector

a = 42
b = 1./3.
c = Vector(0, 2, 1)

print `a` + ' ' + `b` + ' ' + `c`

This program prints "42 0.333333333333 Vector(0,2,1)".

Another way to convert anything into a string is the function str(data) (which is not in the module string, but part of the core language). The two methods do not always give the same result, although they do for numbers. In general, str(data) produces a "nice" representation of the value, whereas the inverse apostrophes return a representation that indicates not only the value, but also the type of the data. For example, if s is a string, then str(s) is the same as s, whereas `s` returns s enclosed in apostrophes to indicate that the data is a string. In practice, try both and use the one you like best.

The function join(words) takes a list of strings and returns the concatenation with words separated by a space. The last line of the preceding program could therefore simply be print string.join(`a`, `b`, `c`). The word separator can again be changed to an arbitrary string.

The functions lower(string) and upper(string) convert a string to lower- or uppercase letters.

The function ljust(string, width) returns a string of at least width characters in which string is left-justified. The functions rjust and center work similarly, but place the supplied text at the end or in the center.

Some useful functions not described here

Python has a very large collection of code for dealing with more or less specialized forms of text. It is impossible to describe them all here, or even list them. You can find all the information you need in the Python Library Reference.

First, there are many functions in the module string that have not been described here.

An important set of functions deals with finding and changing data according to patterns, called regular expressions. These functions are located in the module re. They are very powerful, but the syntax of regular expressions (also used by Unix tools like grep and editors like vi and emacs) is a bit complicated.

The module htmllib contains functions to extract data from HTML files, which are typically used on the World-Wide Web. The module formatter provides a way to create HTML files.

Fortran-formatted files

Fortran programs use text files that are emulations of punched card decks, and therefore use different formatting conventions. Items of data are identified by their position in the line rather than by separators like spaces. The layout of a line is defined by a format specification. For example, 2I4, 3F12.5 indicates two four-character fields containing integers, followed by three twelve-character fields containing real numbers with five decimal digits.

The module Scientific.IO.FortranFormat takes care of converting between Python data objects and Fortran-formatted text strings. The first step is the creation of a format object, representing the Fortran format specification. The second step is creating a Fortran-formatted line from a list of data values (for output), or the inverse operation (for input).

The following example reads a PDB file and prints the name and position of each atom. Note that each line must be analyzed twice: the first time only the initial six characters are extracted, to identify the record type, and in the case of an atom definition the actual data is extracted using the specific format.

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from Scientific.Geometry import Vector

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')
# Contents of the fields:
# record type, serial number, atom name, alternate location indicator,
# residue name, chain id, residue sequence number, insertion code,
# x coordinate, y coordinate, z coordinate, occupancy, temperature factor,
# segment id, element symbol, charge

for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = data[2]
        position = Vector(data[8:11])
        print "Atom ", atom_name, " at position ", position

The next example shows how to write data in a Fortran-compatible way. The output file contains a sequence of numbers in the first column and their square roots in the second column.

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from Numeric import sqrt

format = FortranFormat('2F12.5')

file = TextFile('sqrt.data', 'w')
for n in range(100):
    x = n/10.
    file.write(str(FortranLine([x, sqrt(x)], format)) + '\n')
file.close()

Exercises

Write a program that counts the number of lines and words in a file.
Write a program that reads a PDB file and counts the number of carbon atoms (i.e. the atoms whose name begins with 'C').
Write a program that converts a PDB file to the XYZ format used by XMol (and some other programs). The XYZ format is very simple: The first line contains the number of atoms, the second line a comment (use whatever you like), and the remaining lines contain one atom each, with four entries: first the element symbol (e.g. 'C' for carbon), and then the coordinates x, y, and z. The entries are separated by one or more spaces. This is an example for a single water molecule:
```
3
One water molecule
O 0.  0.     0.
H 0.  0.957  0.
H 0. -0.24  -0.927
```
Note that the data does not have to be lined up in columns.

Table of Contents