Solutions to the exercises of part 2

Counting the lines and words in a file

from Scientific.IO.TextFile import TextFile
from string import split

lines = 0
words = 0
for line in TextFile('some_file'):
    lines = lines + 1
    words = words + len(split(line))

print lines, " lines, ", words, " words."

This program can be tested by running the Unix utility wc for comparison.

Counting the carbon atoms in a PDB file

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

carbons = 0
for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	if atom_name[0] == 'C':
	    carbons = carbons + 1

print carbons, " carbon atoms"

Note the use of strip to remove the spaces around the atom name that are present in a properly formatted PDB file (but not in the PDB files produced by most programs).

Testing this program is less straightforward. One possibility is to apply a slight change to make it count only C-alpha atoms. That must give the number of residues in the file.

Conversion from PDB to XYZ

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import join, strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

atom_lines = []

for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	element = atom_name[0]
	atom_lines.append(join([element, str(data[8]),
				str(data[9]), str(data[10])]) + '\n')

output_file = TextFile('protein.xyz', 'w')
output_file.write(`len(atom_lines)` + '\n\n')
output_file.write(join(atom_lines, ''))
output_file.close()

There are several ways to avoid reading the file twice for obtaining the number of atom, but the general idea is to collect information into a list and then writing it out in a separate step. In the example above the list contains the text lines of the output files, but it could also be used to store the input lines or some intermediate data representation.

Using the output lines in the list has the advantage of not requiring a second loop to generate the output file; a single write operation plus a call to join is sufficient. In fact, this combination occurs so frequently that a shorthand was introduced which is even more efficient: instead of output_file.write(join(atom_lines, '')) you can write output_file.writelines(atom_lines).

Table of Contents