Continuing the discussion of writing Python programs to manage stages of an analysis pipeline...
In this notebook:
init and main functionsA suggestion for how to organize a complex project like the 16S analysis pipeline:
dataFor the 16S rRNA pipeline:
dataassemble_pairs script creates a new folder named pairs, the app for this stage (pandaseq) reads from data and writes to pairedremove_duplicates script creates a folder called unique, the app (cd-hit-dup) reads from paired and writes to uniqueFor example, if my project directory is called 16S this is what it would look like after I run the scripts for the first two stages:
$ cd 16S $ ls assemble_pairs.py* data/ paired/ remove_duplicates.py* unique/
The program that removes duplicate sequences from a FASTQ file is named cd-hit-dup.
Running the program is trivial. Just specify the name of the input file and the name of the output file on the command line.
One plan is to give the output file the same name as the input file. You'll be able to tell them apart by which directory they are in. For example, to remove duplicates from D1.fasta the command is
cd-hit-dup -i paired/D1.fasta -o unique/D1.fasta
Another idea idea would be to add something to the output file so it's easier to tell them apart from the original:
cd-hit-dup -i paired/D1.fasta -o unique/D1.unique.fasta
! ls
! cd-hit-dup -i paired/D1.fasta -o test.fasta
As was the case with the script for the previous state, give the names of two folders when youu run the script. It will find all the FASTA files in the input folder and figure out what to pass to cd-hit-dup:
% remove_duplicates.py paired unique cd-hit-dup -i paired/D1.fasta -o unique/D1.fasta cd-hit-dup -i paired/D2.fasta -o unique/D2.fasta cd-hit-dup -i paired/D3.fasta -o unique/D3.fasta ...
from glob import glob
glob('paired/*')
When we define a function we specify the names of the arguments it needs. For example, to calculate body mass index we need to know height and weight, so the def statement for the bmi function includes two names:
def bmi(weight, height):
return (weight / height**2) * 703
When we call the function we pass two numbers, one for each argument:
bmi(200, 70)
Often it's convenient to give a default value to an argument. For example, if you're going to use your bmi function to track your own BMI, you can define the function so it uses your height as a default. Someone who is 5'10" would write this definition:
def bmi(weight, height = 70):
return (weight / height**2) * 703
Now the function can be passed only one argument, which will be used for the weight:
bmi(200)
bmi(180)
But it can still be given two arguments if it's ever used for the general situation of any height and weight:
bmi(150, 65)
The same idea can be used to write the function that reads command line arguments. Our init function will call parse_args, a function in Python's argparse library.
parse_args with no arguments, and it gets its input from argvparse_args will use those insteadBelow is a version of the BMI program that uses argparse to get values from the command line.
import argparse
def bmi(weight, height):
return (weight / height**2) * 703
def init(arglist = None):
parser = argparse.ArgumentParser(
description = "Demonstrate argument parsing"
)
parser.add_argument('-H', '--height', required=True, help="Height (in inches)")
parser.add_argument('-W', '--weight', required=True, help="Weight (in pounds)")
return parser.parse_args(arglist)
if __name__ == '__main__':
args = init()
print(bmi(args.weight, args.height))
We can copy and paste that code into a text file and run the file from the command line -- when we do the init function will get the height and weight from the command line.
If we want to test the init function we can make a list of words based on what users might type when they run the program:
args = '-H 70 -W 180'
init(args.split())
args = '-h'
init(args.split())
args = '--weight 200'
init(args.split())
At this point you might be wondering if it's worth the trouble to write a program like assemble_pairs or remove_duplicates.
With command line editing it wouldn't be that hard to run pandaseq 80 times. As soon as you get the command working the way you want on the first pair of files just go back and change D1 to D2 and hit return:
pandaseq -N -f data/D1_r1.fastq -r data/D1_r2.fastq -w paired/D1.fasta ... pandaseq -N -f data/D2_r1.fastq -r data/D2_r2.fastq -w paired/D2.fasta ... pandaseq -N -f data/D3_r1.fastq -r data/D3_r2.fastq -w paired/D3.fasta ...You do have to remember to change all three places
D1 is used.
If it takes pandaseq 10 seconds to complete one run, and it takes you 5 seconds to edit and check a command before hitting return again, you could finish the entire project in 80 * 15 = 1200 seconds, or 20 minutes.
It's going to take a lot longer than that to write the assemble_pairs script. So why bother?
No, really, it is.
The script (once it's debugged and you trust it) will make sure the same name is used on each line, or that you don't accidently skip a data set, or ...
You can run the computation again.
pandaseqpandaseq (e.g minimum output sequence length)You can write have scripts in the analysis pipeline record the commands they run. Have each script append to a log file, so in the future you (and others) know exactly what you did.
The pipeline I run saves results and log messages in a database.
sqlite> select * from log where script like '%remove%' limit 5; time script event message ------------------- ------------------------------------------ ---------- ---------------------------------------------------------------------- 2014-12-05 18:23:41 /home1/conery/pipping/remove_duplicates.py start /home1/conery/pipping/remove_duplicates.py adam.db --force --load_seqs 2014-12-05 18:23:41 /home1/conery/pipping/remove_duplicates.py exec cd-hit-dup -i panda/merged.1.fasta -o uniq/unique.1.fasta 2014-12-05 18:23:45 /home1/conery/pipping/remove_duplicates.py exec cd-hit-dup -i panda/merged.2.fasta -o uniq/unique.2.fasta 2014-12-05 18:23:52 /home1/conery/pipping/remove_duplicates.py exec cd-hit-dup -i panda/merged.3.fasta -o uniq/unique.3.fasta 2014-12-05 18:23:57 /home1/conery/pipping/remove_duplicates.py exec cd-hit-dup -i panda/merged.4.fasta -o uniq/unique.4.fasta