The Analytic Garden: How to Count

I described a method for counting mismatches previously. This time, I'll count something different in a slightly different manner, but the principles will be the same. In this case, we're looking for mismatches to the official genome in sequence reads, but we want to know how the mismatch rate varies along the read. The position along the read is called the cycle, because it indicates the machine cycle that sequenced the particular base.

The basic method is similar to the one used in the previous post, but this time we'll use perl instead of python. Perl receives a lot of bad press these days, having gone from the being the Swiss-army chainsaw of the web to having its web processing thunder stolen by Php and its general programming ubiquity being usurped by python. Still, I like perl and use it frequently. Perhaps, this is because I have been using it for so long that I know many of its idiosyncrasies and can either workaround them or use them to my advantage. Despite the bad reputation perl has among python programmers, it's far from dead.

I have what might be considered slightly unorthodox style of perl programming. I make a great deal of use of the Method::Signatures module. This module allows you to declare the type and characteristics of each parameter being passed to a subroutine. It's extremely flexible, and allows for named parameters, aliases, type and value constraints on parameters, and required or optional parameters. I use it as way to document my programs by indicating what type of parameters are expected to be passed to a subroutine and it helps eliminate subtle errors like passing the wrong kind of ref variable.

Counting Mismatches

To do the actual heavy lifting of the counting routine, we'll make use of the Bio::DB::Sam module. This module is part of the extensive BioPerl library. Like PySam, this module is a wrapper for samtools. The program is, I think, relatively straight forward. You give it the path to a.bam file generated by BWA or Bowtie2 and the name of where you want the output to go. The program opens the bam file and uses the pileup routine to call a counting subroutine. Like the PySam version, Bio::DB::Sam's pileup routine uses a callback subroutine. In perl, we pass pileup a CodeRef (pointer to some code) to the subroutine that pileup should call for each position in reference genome.

The callback routine is passed three parameters: a string indicating the reference sequence, a position within the reference, and an ArrayRef (a pointer to an array) to an array of Bio::DB::Bam::PileupWrapper objects. These objects contain the position within the read and information about indels, and a Bio::DB::Bam::AlignWrapper object that describes the read and the alignment of the read to the reference, among other things.

Here's the counting routine

func CountMismatches(Str $seqid, Int $pos,
                       ArrayRef $pileup) {
    my $refbase = $bam->segment($seqid, $pos, $pos)->dna();
    foreach my $p (@{$pileup}) {
      my $aln = $p->alignment();
      next if(! ReadOk($aln, \%bad_reads));

      next if ($p->indel() or $p->is_refskip()
               or $p->is_del());

      my $qbase = substr($aln->qseq(), $p->qpos(), 1);
      next if ($qbase =~ /[nN]/);

      if ($refbase ne $qbase) {
         $mismatches[$p->pos()]++;
      }
      $totals[$p->pos()]++;
    }
}

func is a keyword supplied by Method::Signatures. It indicates that the subroutine will use Method::Signature parameter types. The routine is declare as receiving 3 paramters: a string, an integer, and a reference to an array. We could actually make the ArrayRef more specifice by declaring it as an ArrayRef[ Bio::DB::Bam::PileupWrapper], a reference to an array of Bio::DB::Bam::PileupWrappers.

We extract the base from the reference, and then loop through the pileups. For each pileup, we extract the alignment and check to see if the read fits our criteria. As in the python counting example, we use a hash, perl equivalent of a python dictionary, to keep track of the identities of bad reads. We pass the base checking routine a reference to a hash of read names that we want to ignore. We also ignore any indels, extract the base from the read sequence, compare it to the reference base, count it if it is a mismatch, and add one to the totals.

Here's how we determine if the read passes our filter:

func ReadOk(Bio::DB::Bam::AlignWrapper $align,
            HashRef $bad_reads) {
if ($bad_reads->{$align->query()->name()}) {
    return;
}

my @qscores = $align->qscore();
if (any {$_ < $MIN_BASE_QUALITY} @qscores) {
    $bad_reads->{$align->query()->name()} = 1;
    return;
}

my $r1 = Bio::Range->new(-start => $align->start(),
           -end    => $align->end(),
                           -strand => 1);

foreach my $r (@SNPS) {
    if ($r1->contains($r)) {
      $bad_reads->{$align->query()->name()} = 1;
      return;
    }
}

return 1;
}
We pass this routine the alignment object that describes the read and a hash containing any already flagged reads. It checks to see if we have flagged it before. If not, we extract the base call qualities and check to see if any are less than our cutoff of 30. Finally, as we did in the python example, we check to see if the read overlaps any of the end positions or the SNP position. If so we flag it. In the python example, we used sets of positions and looked for intersection. We could done that here, but instead, we expressed the special positions as Bio::Range objects and look for overlap with the read.

One problem that you might notice is that $bam, %ref_base, @mismatches, and @totals are not declared in the counting routine. This is because we don't have control over which parameters the pileup callback will be be passed. As in the python example, we declare the counting routine within another subroutine that contains declarations for these three variables. This violates my principle of keeping variable score within the lowest level, but it's a necessity here.

Perl lexical variables, variables declared with my won't retain their values between subroutine calls to a nested subroutine. There are a number of ways to work around this limitation. The simplest, available in recent versions of perl, is to declare the shared variables in the enclosing routine as state variables. State variables are similar to static variables in C/C++. They are initialized once and maintain their values between subroutine calls.

Here's part of the the enclosing routine:

func Main(Str $bam_file, Str $out_file) {
my $genome_file = '/users/thompson/data/phiX2/from_illumina/data/phiX.illumina.fa';

state %bad_reads;

state @mismatches;
@mismatches = (0) x ($READ_LENGTH+1);

state @totals;
@totals = (0) x ($READ_LENGTH+1);

state $bam = Bio::DB::Sam->new(-bam => $bam_file,
                                 -fasta => $genome_file);
my ($seq_id) = $bam->seq_ids();
$bam->pileup($seq_id, \&CountMismatches);

The routine is passed the bam and output file names. The genome file is declared within the routine. It really should have been declared a constant, because that's what it is. Notice the variables shared with the counting routine are declared as state variables and the counts are initialized to zero. We open the bam file and call pileup to feed the reads to the counting routine. One pileup is finished, the only thing left is to write the counts and totals to a CSV file for later processing.

Here's the full deal:   download

package main;
use FileHandle;
use strict;
use warnings FATAL => 'all';
use autodie;
use Method::Signatures;
use feature qw(say);
use English;
use Bio::DB::Sam;
use Bio::Range;
use Readonly;
use List::MoreUtils qw(any);

Readonly::Scalar our $MIN_BASE_QUALITY => 30;
Readonly::Scalar our $READ_LENGTH => 50;

Readonly::Array our @SNPS =>
                  (Bio::Range->new(-start => 1, -end => 1, -strand => 1),
                   Bio::Range->new(-start => 2, -end => 2, -strand => 1),
                   Bio::Range->new(-start => 3, -end => 3, -strand => 1),
                   Bio::Range->new(-start => 4, -end => 4, -strand => 1),
                   Bio::Range->new(-start => 1301, -end => 1301, -strand => 1),
                   Bio::Range->new(-start => 5382, -end => 5382, -strand => 1),
                   Bio::Range->new(-start => 5383, -end => 5383, -strand => 1),
                   Bio::Range->new(-start => 5384, -end => 5384, -strand => 1),
                   Bio::Range->new(-start => 5385, -end => 5385, -strand => 1),
                   Bio::Range->new(-start => 5386, -end => 5386, -strand => 1));

if (@ARGV < 2) {
say("usage: $PROGRAM_NAME bam_file out_file");
exit(1);
}

Main($ARGV[0], $ARGV[1]);

func Main(Str $bam_file, Str $out_file) {
my $genome_file = '/users/thompson/data/phiX2/from_illumina/data/phiX.illumina.fa';

state %bad_reads;

state @mismatches;
@mismatches = (0) x ($READ_LENGTH+1);

state @totals;
@totals = (0) x ($READ_LENGTH+1);

state $bam = Bio::DB::Sam->new(-bam => $bam_file,
                -fasta => $genome_file);
my ($seq_id) = $bam->seq_ids();
$bam->pileup($seq_id, \&CountMismatches);

open(my $out, '>', $out_file);
$out->say(join(',', 'cycle', 'mismatches', 'total'));
foreach my $i (1..$READ_LENGTH) {
    $out->say(join(',', $i, $mismatches[$i], $totals[$i]));
}
$out->close();

func CountMismatches(Str $seqid, Int $pos,
             ArrayRef $pileup) {
    my $refbase = $bam->segment($seqid, $pos, $pos)->dna();
    foreach my $p (@{$pileup}) {
      my $aln = $p->alignment();
      next if(! ReadOk($aln, \%bad_reads));

      next if ($p->indel() or $p->is_refskip() or $p->is_del());

      my $qbase = substr($aln->qseq(), $p->qpos(), 1);
      next if ($qbase =~ /[nN]/);

      if ($refbase ne $qbase) {
    $mismatches[$p->pos()]++;
      }
      $totals[$p->pos()]++;
    }
}
}

func ReadOk(Bio::DB::Bam::AlignWrapper $align, HashRef $bad_reads) {
if ($bad_reads->{$align->query()->name()}) {
    return;
}

my @qscores = $align->qscore();
if (any {$_ < $MIN_BASE_QUALITY} @qscores) {
    $bad_reads->{$align->query()->name()} = 1;
    return;
}

my $r1 = Bio::Range->new(-start => $align->start(),
               -end    => $align->end(),
               -strand => 1);

foreach my $r (@SNPS) {
    if ($r1->contains($r)) {
      $bad_reads->{$align->query()->name()} = 1;
      return;
    }
}

return 1;
}
A couple of details in the way some of the constants were declared. The Readonly module allows you to declare named constants of a specific type. The list of positions we wish to ignore is declared as an array of Bio::Range objects. I don't mind global constants. It's global variables that make me nervous.

The Analytic Garden

How to Count - Part Deux

Counting Mismatches

No comments:

Post a Comment

Labels

Contributors

wfmu