Getting Started with flies

I'll write up the procedure that I used to get the context specific error rates for phiX. In the mean time, here are some of the tools that we have been using. Folks might want check them out before getting started on any analysis.

If you have questions, no matter how trivial they may seem, about any of this, please drop me an e-mail at william_thompson_1@brown.edu

The fly genome is in Lauren's directory on oscar. If you don't have an account for oscar, you can get one by filling out the form linked from this page https://www.ccv.brown.edu/start/account. You want to indicate that you are working with Prof. Lawrence and me and are part of CCMB when you fill out the form. This will give you access to the CCMB condo and more disk storage. The oscar user manual is at https://ccv.brown.edu/doc/getting-started.html

Usually Lauren or I receive files on oscar containing the  reads. The reads come from the genome center's Illumina sequencing machine. This document is kind of old, but covers the basics http://www.brown.edu/Research/CGP/download/illumina-public/Overview-Illumina-Next-Gen-sequencing.pdf of sequencing. The attached file has some more info about Illumina sequencing.

We get the reads in the form of compressed fastq files. The fastq format is described at https://en.wikipedia.org/wiki/FASTQ_format

The reads must be aligned to the genome. Since we're working with Drosophila melanogaster, the fly genome files are  located in Lauren's directory on oscar at /users/lalpert/scratch/Illumina/Project_Robert_Reenan_lane5_130523/First_Pass_Masked_Genome/chromFaMasked/
The files for each chromosome are listed as chrxx.fa and the files named chrxx.fa.masked have had the repeat regions masked. The genome files are in Fasta format, https://en.wikipedia.org/wiki/FASTA_format. You can find out more about the fly genome at http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=363003373.

The reads have to be mapped to the fly genome. We have been using bwa to align the sequences. 
http://bio-bwa.sourceforge.net/  The bwa commands are at http://bio-bwa.sourceforge.net/bwa.shtml. The latest version of bwa is available on oscar at /users/thompson/bwa-0.7.5a/bwa or you can download your own. The system version installed on oscar is an older version. You need to index the genome files before aligning with bwa. The index files are on oscar at /users/thompson/data/fly/bwa_index/ You can use those or create your own index.  bwa produces .sam and .bam files. The sam file format is human-readable. bam files are compressed sam files. Sam files are described at http://samtools.sourceforge.net/SAMv1.pdf. Sam and bam files can be manipulated with samtools http://samtools.sourceforge.net/  This page http://davetang.org/wiki/tiki-index.php?page=SAMTools is useful when using samtools.

Here's a typical pipeline for aligning reads to the fly genome. We align the reads one chromosome at a time.
# align the reads to the genome
/users/thompson/bwa-0.7.5a/bwa aln -t 4 $db_index $fastq_file > $sai_file  
# make a sam file
/users/thompson/bwa-0.7.5a/bwa samse $db_index $sai_file $fastq_file > $sam_file 
# make a sorted bam file
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools view -bS $sam_file | $SAMTOOLS sort - $bam_name
# index the bam file
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools index $bam_file $bam_index
# zip the sam file if we need it later or else delete it
gzip -v $sam_file

replace the files starting with "$" with your file names. $db_index is the bam index file, something like /users/thompson/data/fly/bwa_index/chr2L.fa.masked. $fastq_file is the fastq file. $sai_file is a sam index. $bam_file is the final bam file and $bam_index is its index. We don't really need the sam file, so it could be deleted. It can always be reproduced from the bam file. Check out http://davetang.org/wiki/tiki-index.php?page=SAMTools to see the details of these commands.

Once you have the bam file, you can process the alignments in various ways. Lauren is working on the EM algorithm. I have been counting mismatches in order to determine error rates.

Here are some useful tools for analyzing the alignment:
Python
biopython http://biopython.org/wiki/Main_Page Biopython has tools for reading fasta, fastq etc. files. It's installed on oscar. It's big and has lots of useful routines. Read the tutorial.
Pysam https://code.google.com/p/pysam/ This is a Python wrapper for samtools. It's faster than writing your own. Read the docs carefully. I got tricked by the docs because I wasn't careful enough.
Numpy/Scipy/Matplotlib etc. http://www.scipy.org/ - this is your one stop shopping place for data analysis in python. 
Ipython - make python easier to use. Get it at scipy.org or it's already on oscar.

Matlab
The bioinformatics toolbox will read sam files. Check out http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/bioindexedfileclass.html I haven't used this, but it might be useful if you are a Matlab fan.

Perl
BioPerl http://www.bioperl.org/wiki/Main_Page Like BioPython, but even bigger.
Bio::DB::Sam - like Pysam, but bigger.

I think that's enough for now. Anything that doesn't make sense, please ask.

No comments:

Post a Comment