Fetch!

A student in our group asked whether it was possible to open a .bam file located on a remote site and process it locally without having to download it. While it's possible to do that in general with a some programming trickery, unless you have some sort of special access it's usually better to download a large file and deal with it locally.

If you have ssh access and reasonable connection speed to the site containing the file, you can mount the directory as a local file system and access files as if they reside on your computer. On Linux, you can use sshfs. On a Mac, there is oxfuse, an sshfs client for Mac OS-X. For Windows, there is win-sshfs. After connecting to a remote site with one of these programs, the remote directory appears as a local path. You pay a performance price for operating on files across the internet, but often convenience overrides the performance hit.

Rather than directly accessing the file across the net, let's look at a simpler problem. Finding a list of files on a site and downloading the files. The process of reading a web site and extracting information with a remote program is called web scraping. The idea is simple, you write a program to automate the process of fetching the web page and pulling out the information that you would typically extract yourself. For example, in our case, pull out all of the links to .bam files on one of the modEncode pages. Web crawlers for the big search engines like Google or Yahoo do this sort of thing constantly.

Web Scraping

Before we look at an example, there are a few thing to note about scraping someone's web page. First, check the terms and conditions, if any, on the use of the page. It's someone's data and they may not like what you're about to do with it. Avoid banging away at a site. You can hit a page much more rapidly with a program than you can by clicking on a link. Some servers can't take the load of a program hitting a page rapidly. Some sites like UCSC Genome Browser and NCBI have rules about how rapidly you can request data. They provide alternate methods to access large amounts of data, for example the UCSC Table Browser. Violating a website's rules can get your IP address banned. Finally, be flexible and program for change. Websites change all the time. Code that works today may not work tomorrow because the layout of the site has changed.

Downloading Bam Files from modENCODE

As a practical case, let's look at downloading some .bam files from the modENCODE site http://intermine.modencode.org/query/experiment.do?experiment=Genomic+Assembly+of+D.+melanogaster+strains+and+cell+lines. This site contains data for genomic assemblies of D. melanogaster strains and cell lines. We're interested in the .bam files on the page. The links to download the .bam files are embedded in <a> tags. These tags are hyperlinks. The href element of the tag is an HTML element that shows as a link in your browser. If you use your browser to view the webpage source, you will see code like this:

  <a href="http://submit.modencode.org/submit/public/get_file/5511/extracted/BS372_samtools_sorted.bam"  
    title="Download Binary Sequence_Alignment/Map (BAM) file BS372_samtools_sorted.bam" class="value extlink"> BS372_samtools_sorted.bam </a>  
  <a href="http://submit.modencode.org/submit/public/get_file/5511/extracted/BS372_AC01WMACXX_1_2_sequence.txt.gz"  
    title="Download FASTQ file BS372_AC01WMACXX_1_2_sequence.txt.gz" class="value extlink"> BS372_AC01WMACXX_1_2_sequence.txt.gz </a>  

We want to extract the target of the href links and if it's a bam file, download it.

Python provided a number of useful libraries for client-side handling of webpages. To fetch webpages there are urllib and urllib2. One of the better libraries for parsing websites is BeautifulSoup. It provides methods for retrieving and parsing HTML and XML files. If you plan on doing a lot of web scraping, it's worth your while to examine the BeautifulSoup documentation. We will use it to fetch the modENCODE page and extract the links to bam files.

The code is relatively simple. We use urllib2 to open the URL of the website. BeautifulSoup reads the page and extracts the appropriate tags. We then loop through all of the tags and find the URL links to bam files.

def GetBamFileList(url):
  """
  GetBamFileList
  url - a url to the page containing links to bam files

  returns
  bam_links - a list of urls to the bam files
  """
  
  bam_links = []
  
  c = urllib2.urlopen(url)   # open the url
  soup = BeautifulSoup(c.read())  # fetch the page
  links = soup('a') # get all of the a tags
  for l in links:
    if re.search('\.bam', str(l)):   # search for the links to bam files
      bam_links.append(l['href'])

  return bam_links

Once we have a list of links, we can use urllib to download a bam file. We use the name from website to create a name for a local file and grab the file from the website.

def FetchBamFile(bam_url, out_path):
  """
  FetchBamFile
  bam_url - the url of the bam file to be downloaded
  out_path - where the file should go
  """
  
  if out_path[-1] != '/':
    out_path += '/'
    
  m = re.search('(BS\d+_.+?\.bam$)', bam_url)  # extract the file name
  name = m.group(1)
  out_file = out_path + name
  
  urllib.urlretrieve(bam_url, out_file)  # fetch it 
 
One thing to note is that we had to know what we were looking for on the website in order to get the proper links. That usually requires some digging to know how the site is designed.

Download the full program.

No comments:

Post a Comment