Swimming Upstream
For a recent project I needed sequence regions upstream (preceding then 5' end of the gene) of a set of orthologous genes. The orthologs for a gene of interest are obtained from https://www.ncbi.nlm.nih.gov/gene . For example, searching for JAK2 orthologs at that site yields a table of JAK2 genes for a large number of species. After selecting species, the ortholog table can be downloaded. Fetching Genomes Since I wanted to analyze a number of different genes, I decided to automate the process of getting the upstream regions. The first step was to fetch the GenBank records for the genomes of the selected species. The GenBank IDs for each species are included in the downloaded ortholog table. Fetching genomes is straightforward, if a bit slow. It uses Pandas to read the ortholog Table from NCBI and BioPython.Entrez to download the complete GenBank record for the genome. def main (): args = GetArgs() genome_path = args . genome_path ortholog_tab...