In running jobs on oscar, we saw how to create a batch file to run a job on oscar. However, sometimes you face a situation where we have to perform an analysis on a number of data sets stored in multiple files. If you only have 3 or 4 data files, the obvious answer is to copy and edit your slurm file. If you have 3 or 400 files to run the procedure on, this approach is more than a bit tedious.
The Linux Slurm system provides job arrays as a mechanism for dealing with this situation. A job array is a collection of jobs that all run the same program, but with different values of a parameter. The parameter is a range or list of integers. This page describes job arrays and shows a simple example.
Unfortunately, job arrays are limited to simple integer parameters. It's possible to work around this limitation by programming in your app. However, I prefer to write (actually steal from some of previously written programs) and create a series of batch files directly. As a friend of mind says I would rather write a program to write a program than write a program.
To make this a bit more concrete, let's consider how we could create bowtie2 index files for each of the Drosophila melanogaster chromosomes. The chromosomes are stored in a separate genome directory and we'll write the index files to a different path. The idea is to read the genome directory and create a list of names of the appropriate fasta files. For each of these files, we create a batch file and queue it to be run.
Here's the code: Download
===================================================================
import os
import re
def GetFileList(path, extension):
"""
GetFileList - search a directory path for files with a matching extension
path - the directory path
extension - a regular expression that identifies file types
returns - a list of file names The path is not added to the files.
"""
file_list = []
for f in os.listdir(path):
if re.search(extension, f):
file_list.append(f)
return file_list
def LaunchBowtieIndex(chrom, tmp_path, genome_path, index_path):
"""
LaunchBowtieIndex - launch a job on the oscar batch queue
chrom - a chromosome name, e.g. chr2L
tmp_path - path for slurm stdout and sderr
genome_path - location of the genome files
index_path - destination path for the index
LaunchBowtieIndex creates a slurm file in tmp_path containing commands to
run bowtie2-build and launches teh slurm file
"""
job_name = '_'.join(['bowtie_index', chrom])
slurm_file = ''.join([tmp_path, job_name, '.slurm'])
slurm_out = ''.join([tmp_path, job_name, '.slurm.out'])
chr_file = genome_path + chrom + '.fa.masked'
index = index_path + chrom
bowtie_command = 'bowtie2-build ' + chr_file + ' ' + index
f = open(slurm_file, 'w')
f.write('#!/bin/bash' + '\n')
f.write('#SBATCH --time=10:00:00' + '\n\n')
f.write('#SBATCH --mem=4G' + '\n\n')
f.write('#SBATCH -J job_name' + '\n\n')
f.write('#SBATCH -o ' + slurm_out + '\n')
f.write('#SBATCH -e ' + slurm_out + '\n\n')
f.write('module load bowtie2/2.1.0' + '\n')
f.write(bowtie_command + '\n\n')
f.close()
os.system('sbatch ' + slurm_file)
def Main():
"""
Find the fasta files in the genome directory and build an index for each
"""
bowtie_index_path = '/users/thompson/data/fly/bowtie2_index/'
genome_path = '/users/lalpert/scratch/Illumina/Project_Robert_Reenan_lane5_130523/First_Pass_Masked_Genome/chromFaMasked/'
tmp_path = '/gpfs/scratch/thompson/illumina/tmp/fly/'
chr_files_list = GetFileList(genome_path, '\.fa$')
for chr_file in chr_files_list:
chr = chr_file.split('.', 1)[0]
LaunchBowtieIndex(chr, tmp_path, genome_path, bowtie_index_path)
if __name__ == '__main__':
Main()
==================================================================
If you decide to use this, you need to change the index path and the temporary path that holds the batch files.
As always, if you have questions or problems, drop me an email or comment here.
No comments:
Post a Comment