running jobs on oscar

This may be old news to you. If so, just ignore this message.

CCV doesn't want you to run long running or computationally intensive jobs on the login nodes on oscar. They expect you to submit jobs to one of the queues. You job will then be executed when resources, compute nodes with the amount of memory etc. that you have requested, become available.

Some of the jobs we run are both compute and memory intensive, and long-running. In addition, many of the kinds of tasks we need to perform can be described as embarrassingly parallel. For example, we often have to perform the same operations on each chromosome or gene in a genome. If you have 10 genes you want to operate on, the speedup of running all 10 in parallel over running them one after the other is obvious.

To launch a job on the oscar cluster, you use the sbatch program. The basics of running jobs on oscar are described here, https://ccv.brown.edu/doc/running-jobs.html. I just want to point out a few things that I have found useful in running on oscar. sbatch sends jobs to the batch queue by default. There are other queues, but we can deal with them later if needed.

One problem that you will run into is data management. We often deal with the output from a number of different experiments. The Illumina experiments from the Reenan lab have an ID associated with them, something like YAS0003_AAGACG_L003_R1_001. The reads from a particular experiment will be in a file called something like YAS0003_AAGACG_L003_R1_001.fastq. In analyzing the data from a particular run, I typically align the reads to each chromosome separately using bwa or bowtie2.  To keep track of what has been run, I use the Linux file system to keep track of the results of individual experiments. For example, I have a directory called  /users/thompson/data/fly/bowtie_output/YAS0003_AAGACG_L003_R1_001/. It contains output files like chr2L.sam from bowtie2. It doesn't matter what naming scheme you use, but it's a good idea to use something and be consistent. It's amazing how much output and data you can build up in a short time. Putting a readme file in each directory describing what the directory does is also a good idea. You could get fancy and use something like SQL to manage a database of your results, but I have found that by the time you get a database designed, we have moved on to something else.

Output and fastq files are LARGE. Bowtie2, python, and perl will read gzipped files, so you may want to gzip your output and fastq data to save space.

I have attached a script file for indexing a chromosome and running bowtie2. I've commented it so you can see what it does. If you want to try it, you have change the output directories to your own. There are also sample batch scripts for a variey of tasks in the batch_scripts/ directory under your home directory. 

The bowtie2 aligner runs considerable quicker when run in parallel. I usually run it on 8 separate nodes. Bowtie2 requires an index file of the genome or chromosome you are aligning against. The attached script assumes that the fly chromosomes have already been indexed. You can also build an index of the entire genome. In some cases having the chromosomes indexed separately is convenient at other times operating on the entire genome is useful.

To run the script, you just enter 
sbatch bowtie2_chr2L.fa.masked.YAS0003_AAGACG_L003_R1_001.slurm 
at an oscar command prompt.

Here's the code for the script: Download

==========================================================
#!/bin/bash

# comments start with #
# batch commands begin with #SBATCH
# This is a bash script so the top line is needed

# launch this file with
# sbatch bowtie2_chr2L.fa.masked.YAS0003_AAGACG_L003_R1_001.slurm

# Request 10 hours of runtime: This amount of time is overkill for bowtie2.
# You should probably only ask for 1.

#SBATCH --time=10:00:00

# we're going to ask for 4G of memory and 8 nodes
# 4G is overkill

#SBATCH -n 8
#SBATCH --mem=4G

# Specify a job name:
# job names help you keep track of what is running.
# The names show up when you run myq

#SBATCH -J bowtie.YAS0003_AAGACG_L003_R1_001

# Send the stdout and stderr output. This is useful for tracking down errors.
# I write these files to a temp directory and
# delete them later if all goes well.
# Specify a stdout and stderr output file. Write both to the same file.
# If you don't do use your own file anmes, files with odd names
# based on the job number will show up.
# That makes thinga hard to keep track of.
# if you run this, you need to change the directories below.

#SBATCH -o /gpfs/scratch/thompson/illumina/tmp/fly/bowtie_chr2L.fa.masked.YAS0003_AAGACG_L003_R1_001.slurm.out
#SBATCH -e /gpfs/scratch/thompson/illumina/tmp/fly/bowtie_chr2L.fa.masked.YAS0003_AAGACG_L003_R1_001.slurm.out

# Now we describe what we want to do.
# CCV doesn't have all programs and all versions available by default.
# The use a module system. Since this is a bash script, it will inherit your
# environment, so you could load bowtie2 before running,
# but I usually forget, so
# we'll make sure the right version is available.
#
# Modules are described at https://ccv.brown.edu/doc/software-modules.html

# load the path to bowtie2
module load bowtie2/2.1.0

# Bowtie2 is described at http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
# We'll align reads agains D. melanogaster chromosome 2L.
# This command uses 8 threads, i.e. runs in parallel on 8 nodes.
# -x indicates the index location for this chromosome
# -U is the file containing the single end reads.
# -S is the sam output file.
# sam files are LARGE. You may want to zip it to save space.
# --phred33 say that the read quality scores are based on the phred33 scale.
# More about that later. See https://en.wikipedia.org/wiki/Phred_quality_score
# Run the command. You can run multiple commands sequentially.

# If you want to try this, you need to change the -S option
# to point to you own directory.

bowtie2 --phred33 --threads 8 -x /users/thompson/data/fly/bowtie2_index/chr2L -U /users/lalpert/data/lalpert/LoxP_RNA/LoxP_RNA1/YAS0003_AAGACG_L003_R1_001.fastq -S /users/thompson/data/fly/bowtie_output/YAS0003_AAGACG_L003_R1_001/chr2L.sam

echo Finished execution at `date`


====================================================================

You can keep track of you jobs by typing myq at a command prompt. You can kill a job by using qdel job_number, where job_number is the number that myq shows for your job. 

More later,

No comments:

Post a Comment