Scripts with Python, Part II

Continuing the discussion of writing Python programs to manage stages of an analysis pipeline...

In this notebook:

  • stage 2: removing duplicate sequences
  • closer look at init and main functions
  • the "big picture" -- do we really need a Python program to manage a pipeline?

Project Subdirectories

A suggestion for how to organize a complex project like the 16S analysis pipeline:

  • collect all your data files and put them in a subfolder called data
  • create one subfolder for each stage of the analysis
  • the script that manages a stage reads from one folder and writes new data into a separate folder

For the 16S rRNA pipeline:

  • FASTQ files are in a subfolder named data
  • the assemble_pairs script creates a new folder named pairs, the app for this stage (pandaseq) reads from data and writes to paired
  • the remove_duplicates script creates a folder called unique, the app (cd-hit-dup) reads from paired and writes to unique

For example, if my project directory is called 16S this is what it would look like after I run the scripts for the first two stages:

$ cd 16S
$ ls
assemble_pairs.py*        
data/             
paired/
remove_duplicates.py*
unique/

Stage 2: Dereplication

The program that removes duplicate sequences from a FASTQ file is named cd-hit-dup.

Running the program is trivial. Just specify the name of the input file and the name of the output file on the command line.

One plan is to give the output file the same name as the input file. You'll be able to tell them apart by which directory they are in. For example, to remove duplicates from D1.fasta the command is

cd-hit-dup -i paired/D1.fasta -o unique/D1.fasta

Another idea idea would be to add something to the output file so it's easier to tell them apart from the original:

cd-hit-dup -i paired/D1.fasta -o unique/D1.unique.fasta
In [1]:
! ls
D1.fasta             Untitled1.ipynb      huh                  paired               run_panda.ipynb      stage2.ipynb         test.fasta.clstr
Remote Shell.pdf     aciss.pdf            log.txt              pooled               scripts.html         stage3.ipynb         test.fasta2.clstr
Untitled.ipynb       data                 otus                 run_cd_hit_dup.ipynb scripts.ipynb        test.fasta           unique
In [2]:
! cd-hit-dup -i paired/D1.fasta -o test.fasta
Total number of sequences: 216
Longest: 253
Shortest: 171
Sorted by length ...
Start clustering duplicated sequences ...
Number of clusters found: 85
Number of clusters with abundance above the cutoff (=1): 85
Number of clusters with abundance below the cutoff (=1): 0
Writing clusters to files ...
Done!

Script

As was the case with the script for the previous state, give the names of two folders when youu run the script. It will find all the FASTA files in the input folder and figure out what to pass to cd-hit-dup:

% remove_duplicates.py paired unique
cd-hit-dup -i paired/D1.fasta -o unique/D1.fasta
cd-hit-dup -i paired/D2.fasta -o unique/D2.fasta
cd-hit-dup -i paired/D3.fasta -o unique/D3.fasta
...
In [3]:
from glob import glob
In [4]:
glob('paired/*')
Out[4]:
['paired/D1.fasta',
 'paired/D2.fasta',
 'paired/D3.fasta',
 'paired/F1.fasta',
 'paired/F2.fasta',
 'paired/F3.fasta']

★ Defining Functions with Default Arguments

When we define a function we specify the names of the arguments it needs. For example, to calculate body mass index we need to know height and weight, so the def statement for the bmi function includes two names:

In [5]:
def bmi(weight, height):
    return (weight / height**2) * 703

When we call the function we pass two numbers, one for each argument:

In [6]:
bmi(200, 70)
Out[6]:
28.693877551020407

Often it's convenient to give a default value to an argument. For example, if you're going to use your bmi function to track your own BMI, you can define the function so it uses your height as a default. Someone who is 5'10" would write this definition:

In [7]:
def bmi(weight, height = 70):
    return (weight / height**2) * 703

Now the function can be passed only one argument, which will be used for the weight:

In [8]:
bmi(200)
Out[8]:
28.693877551020407
In [9]:
bmi(180)
Out[9]:
25.82448979591837

But it can still be given two arguments if it's ever used for the general situation of any height and weight:

In [10]:
bmi(150, 65)
Out[10]:
24.958579881656807

Default Arguments and the Command Line

The same idea can be used to write the function that reads command line arguments. Our init function will call parse_args, a function in Python's argparse library.

  • normally we call parse_args with no arguments, and it gets its input from argv
  • if we pass a list of strings parse_args will use those instead

Below is a version of the BMI program that uses argparse to get values from the command line.

In [ ]:
import argparse

def bmi(weight, height):
    return (weight / height**2) * 703

def init(arglist = None):
    parser = argparse.ArgumentParser(
        description = "Demonstrate argument parsing"
        )
    parser.add_argument('-H', '--height', required=True, help="Height (in inches)")
    parser.add_argument('-W', '--weight', required=True, help="Weight (in pounds)")
    return parser.parse_args(arglist)

if __name__ == '__main__':
    args = init()
    print(bmi(args.weight, args.height))

We can copy and paste that code into a text file and run the file from the command line -- when we do the init function will get the height and weight from the command line.

If we want to test the init function we can make a list of words based on what users might type when they run the program:

In [ ]:
args = '-H 70 -W 180'
init(args.split())
In [ ]:
args = '-h'
init(args.split())
In [ ]:
args = '--weight 200'
init(args.split())

Why Write a Script?

At this point you might be wondering if it's worth the trouble to write a program like assemble_pairs or remove_duplicates.

With command line editing it wouldn't be that hard to run pandaseq 80 times. As soon as you get the command working the way you want on the first pair of files just go back and change D1 to D2 and hit return:

pandaseq -N -f data/D1_r1.fastq -r data/D1_r2.fastq -w paired/D1.fasta ...
pandaseq -N -f data/D2_r1.fastq -r data/D2_r2.fastq -w paired/D2.fasta ...
pandaseq -N -f data/D3_r1.fastq -r data/D3_r2.fastq -w paired/D3.fasta ...
You do have to remember to change all three places D1 is used.

If it takes pandaseq 10 seconds to complete one run, and it takes you 5 seconds to edit and check a command before hitting return again, you could finish the entire project in 80 * 15 = 1200 seconds, or 20 minutes.

It's going to take a lot longer than that to write the assemble_pairs script. So why bother?

It's More Fun

No, really, it is.

It's More Reliable

The script (once it's debugged and you trust it) will make sure the same name is used on each line, or that you don't accidently skip a data set, or ...

It's Reproducible

You can run the computation again.

  • You might lose the files with the output and need to rerun all the calls to pandaseq
  • Something convinces you to come back to this step and change one of the parameters used by pandaseq (e.g minimum output sequence length)
  • You want to run the same analysis again on a new data set

It Provides a Record

You can write have scripts in the analysis pipeline record the commands they run. Have each script append to a log file, so in the future you (and others) know exactly what you did.

The pipeline I run saves results and log messages in a database.

sqlite> select * from log where script like '%remove%' limit 5;
time                 script                                      event       message                                                               
-------------------  ------------------------------------------  ----------  ----------------------------------------------------------------------
2014-12-05 18:23:41  /home1/conery/pipping/remove_duplicates.py  start       /home1/conery/pipping/remove_duplicates.py adam.db --force --load_seqs
2014-12-05 18:23:41  /home1/conery/pipping/remove_duplicates.py  exec        cd-hit-dup -i panda/merged.1.fasta -o uniq/unique.1.fasta             
2014-12-05 18:23:45  /home1/conery/pipping/remove_duplicates.py  exec        cd-hit-dup -i panda/merged.2.fasta -o uniq/unique.2.fasta             
2014-12-05 18:23:52  /home1/conery/pipping/remove_duplicates.py  exec        cd-hit-dup -i panda/merged.3.fasta -o uniq/unique.3.fasta             
2014-12-05 18:23:57  /home1/conery/pipping/remove_duplicates.py  exec        cd-hit-dup -i panda/merged.4.fasta -o uniq/unique.4.fasta  
In [ ]: