Posts

Showing posts from February, 2014

How to Count

Counting is hard. It seems like it shouldn't be. After all, we all learned to count as part of our first experiences in school, or maybe even before. In computational biology, we often have to count to determine basic properties of a collection of sequence or to estimate probabilities and expected values. The problem we frequently run into is how do we determine what to count and how do we go about counting it. Sequencing Errors One of the questions we have been interested in answering is; what are the basic error rates in the base calls coming from the genome center's Illumina sequencing machine. A brief overview of the sequencing and alignment processes can be downloaded here . If you looked at https://www.youtube.com/watch?v=l99aKKHcxC4 you can see how reads are built one nucleotide at a time. The base calling process determines the nucleotide type by image processing of  emitted fluorescence. For each nucleotide, intensity values are calculated for each type: A, C, G, a...

Accessing oscar

Image
This probably should have been the first post in this series, but better late than never. For Linux and Mac users, accessing oscar is relatively simple; open a terminal window and type: ssh -Y username@ssh.ccv.brown.edu where username is your user name. Enter your password and you' read to go. For Windows users, life is tougher. Windows doesn't have a built in ssh client. ssh is a network protocol for accessing remote systems. An ssh client is the program that runs on your computer and manages teh communication with a remote host. There are a number of ssh clients for Windows. A few are listed here . CCV recommends PuTTY as the ssh client for Windows. You can download PuTTY from here . Besides PuTTY, you should download PSCP. Here's how to get started with putty. I assume you downloaded it and put in some directory on your windows machine. Double click it to start it. In the Host Name box, type:   ssh.ccv.brown.edu Make sure Port is set to 22 and SSH butto...

Running multiple jobs on oscar

In running jobs on oscar , we saw how to create a batch file to run a job on oscar. However, sometimes  you face a situation where we have to perform an analysis on a number of data sets stored in multiple files. If you only have 3 or 4 data files, the obvious answer is to copy and edit your slurm file. If you have 3 or 400 files to run the procedure on, this approach is more than a bit tedious. The Linux Slurm system provides job arrays as a mechanism for dealing with this situation. A job array is a collection of jobs that all run the same program, but with different values of a parameter. The parameter is a range or list of integers. This page describes job arrays and shows a simple example.  Unfortunately, job arrays are limited to simple integer parameters. It's possible to work around this limitation by programming in your app. However, I prefer to write (actually steal from some of previously written programs) and create a series of batch files directly. As a frien...

running jobs on oscar

This may be old news to you. If so, just ignore this message. CCV doesn't want you to run long running or computationally intensive jobs on the login nodes on oscar. They expect you to submit jobs to one of the queues. You job will then be executed when resources, compute nodes with the amount of memory etc. that you have requested, become available. Some of the jobs we run are both compute and memory intensive, and long-running. In addition, many of the kinds of tasks we need to perform can be described as embarrassingly parallel. For example, we often have to perform the same operations on each chromosome or gene in a genome. If you have 10 genes you want to operate on, the speedup of running all 10 in parallel over running them one after the other is obvious. To launch a job on the oscar cluster, you use the sbatch program. The basics of running jobs on oscar are described here,  https://ccv.brown.edu/ doc/running-jobs.html . I just want to point out a few things tha...

Getting Started with flies

I'll write up the procedure that I used to get the context specific error rates for phiX. In the mean time, here are some of the tools that we have been using. Folks might want check them out before getting started on any analysis. If you have questions, no matter how trivial they may seem, about any of this, please drop me an e-mail at william_thompson_1@brown.edu The fly genome is in Lauren's directory on oscar. If you don't have an account for oscar, you can get one by filling out the form linked from this page  https://www.ccv.brown. edu/start/account . You want to indicate that you are working with Prof. Lawrence and me and are part of CCMB when you fill out the form. This will give you access to the CCMB condo and more disk storage. The oscar user manual is at  https://ccv.brown.edu/doc/ getting-started.html .  Usually Lauren or I receive files on oscar containing the  reads. The reads come from the genome center's Illumina sequencing machine. This...

PROLOG from the Bottom Up

This is the first of two articles that describe a very basic implementation of PROLOG. Our focus in the first article is not how to program in PROLOG but how PROLOG operates and what features distinguish it from more conventional programming languages. In the next column we will discuss the programming structures that are used to implement these features. This will not only provide you with a valuable look inside the language but also demonstrate some interesting Pascal programming techniques. Click here for a PDF version: PROLOG from the Bottom Up - Part 1