from Scientific.IO.TextFile import TextFile from string import split lines = 0 words = 0 for line in TextFile('some_file'): lines = lines + 1 words = words + len(split(line)) print lines, " lines, ", words, " words."This program can be tested by running the Unix utility wc for comparison.
from Scientific.IO.TextFile import TextFile from Scientific.IO.FortranFormat import FortranFormat, FortranLine from string import strip generic_format = FortranFormat('A6') atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' + '3X,3F8.3,2F6.2,7X,A4,2A2') carbons = 0 for line in TextFile('protein.pdb'): record_type = FortranLine(line, generic_format)[0] if record_type == 'ATOM ' or record_type == 'HETATM': data = FortranLine(line, atom_format) atom_name = strip(data[2]) if atom_name[0] == 'C': carbons = carbons + 1 print carbons, " carbon atoms"Note the use of strip to remove the spaces around the atom name that are present in a properly formatted PDB file (but not in the PDB files produced by most programs).
Testing this program is less straightforward. One possibility is to apply a slight change to make it count only C-alpha atoms. That must give the number of residues in the file.
from Scientific.IO.TextFile import TextFile from Scientific.IO.FortranFormat import FortranFormat, FortranLine from string import join, strip generic_format = FortranFormat('A6') atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' + '3X,3F8.3,2F6.2,7X,A4,2A2') atom_lines = [] for line in TextFile('protein.pdb'): record_type = FortranLine(line, generic_format)[0] if record_type == 'ATOM ' or record_type == 'HETATM': data = FortranLine(line, atom_format) atom_name = strip(data[2]) element = atom_name[0] atom_lines.append(join([element, str(data[8]), str(data[9]), str(data[10])]) + '\n') output_file = TextFile('protein.xyz', 'w') output_file.write(`len(atom_lines)` + '\n\n') output_file.write(join(atom_lines, '')) output_file.close()There are several ways to avoid reading the file twice for obtaining the number of atom, but the general idea is to collect information into a list and then writing it out in a separate step. In the example above the list contains the text lines of the output files, but it could also be used to store the input lines or some intermediate data representation.
Using the output lines in the list has the advantage of not requiring a second loop to generate the output file; a single write operation plus a call to join is sufficient. In fact, this combination occurs so frequently that a shorthand was introduced which is even more efficient: instead of output_file.write(join(atom_lines, '')) you can write output_file.writelines(atom_lines).