In running jobs on oscar, we saw how to create a batch file to run a job on oscar. However, sometimes  you face a situation where we have to perform an analysis on a number of data sets stored in multiple files. If you only have 3 or 4 data files, the obvious answer is to copy and edit your slurm file. If you have 3 or 400 files to run the procedure on, this approach is more than a bit tedious.

The Linux Slurm system provides job arrays as a mechanism for dealing with this situation. A job array is a collection of jobs that all run the same program, but with different values of a parameter. The parameter is a range or list of integers. This page describes job arrays and shows a simple example. 

Unfortunately, job arrays are limited to simple integer parameters. It's possible to work around this limitation by programming in your app. However, I prefer to write (actually steal from some of previously written programs) and create a series of batch files directly. As a friend of mind says I would rather write a program to write a program than write a program.

To make this a bit more concrete, let's consider how we could create bowtie2 index files for each of the Drosophila melanogaster chromosomes. The chromosomes are stored in a separate genome directory and we'll write the index files to a different path. The idea is to read the genome directory and create a list of names of the appropriate fasta files. For each of these files, we create a batch file and queue it to be run.

import os
import re

def GetFileList(path, extension):
    GetFileList - search a directory path for files with a matching extension
    path - the directory path
    extension - a regular expression that identifies file types

    returns - a list of file names The path is not added to the files.
    file_list = []

    for f in os.listdir(path):
        if, f):

    return file_list

def LaunchBowtieIndex(chrom, tmp_path, genome_path, index_path):
    LaunchBowtieIndex - launch a job on the oscar batch queue
    chrom - a chromosome name, e.g. chr2L
    tmp_path - path for slurm stdout and sderr
    genome_path - location of the genome files
    index_path - destination path for the index

    LaunchBowtieIndex creates a slurm file in tmp_path containing commands to
    run bowtie2-build and launches teh slurm file

    job_name = '_'.join(['bowtie_index', chrom])
    slurm_file = ''.join([tmp_path, job_name, '.slurm']) 
    slurm_out = ''.join([tmp_path, job_name, '.slurm.out'])

    chr_file = genome_path + chrom + '.fa.masked'
    index = index_path + chrom
    bowtie_command = 'bowtie2-build ' + chr_file + ' ' + index

    f = open(slurm_file, 'w')
    f.write('#!/bin/bash' + '\n')
    f.write('#SBATCH --time=10:00:00' + '\n\n')
    f.write('#SBATCH --mem=4G' + '\n\n')
    f.write('#SBATCH -J job_name' + '\n\n')
    f.write('#SBATCH -o ' + slurm_out + '\n')
    f.write('#SBATCH -e ' + slurm_out + '\n\n')
    f.write('module load bowtie2/2.1.0' + '\n')
    f.write(bowtie_command + '\n\n')
    os.system('sbatch ' + slurm_file)

def Main():
    Find the fasta files in the genome directory and build an index for each

    bowtie_index_path = '/users/thompson/data/fly/bowtie2_index/'
    genome_path = '/users/lalpert/scratch/Illumina/Project_Robert_Reenan_lane5_130523/First_Pass_Masked_Genome/chromFaMasked/'
    tmp_path = '/gpfs/scratch/thompson/illumina/tmp/fly/'

    chr_files_list = GetFileList(genome_path, '\.fa$')

    for chr_file in chr_files_list:
        chr = chr_file.split('.', 1)[0]
        LaunchBowtieIndex(chr, tmp_path, genome_path, bowtie_index_path)

if __name__ == '__main__':

If you decide to use this, you need to change the index path and the temporary path that holds the batch files.

