Boltz-1 Democratizing Biomolecular Modeling

 Boltz-1 is pretty cool. Boltz-1 is an open-source deep learning model for predicting biomolecular structures based on their sequences. According to the developers, Boltz-1 achieves AlphaFold3 level accuracy. They have released training and inference code, model weights, datasets, and benchmarks under the MIT open license. They're democratizing biomolecular modeling. You can read the introductory paper here and a press article about it here.

I just downloaded Boltz-1 two days ago, so this will not be an in depth look into Boltz-1. Maybe that will come later. Right now, I just wanted to try it out.

Downloading and installing Boltz-1 was easy; clone the GitHub repo and you are ready to go.

I used the reference H5N1 HA amino acid sequence that I used for this post. I extracted the sequence from the GenBank file.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
aa_from_gb.py - extract the amino acid sequence from a GenBank file
author: Bill Thompson
license: GPL 3
copyright: 2025-01-09
"""
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def main():
    gb_file = "/home/bill/boltz/boltz/H5N1/data/HA_reference.gb"
    aa_file = "/home/bill/boltz/boltz/H5N1/data/HA_reference.fasta"

    with open(gb_file, "r") as handle:
        for record in SeqIO.parse(handle, "genbank"):
            # Each 'record' in this loop is a full GenBank record
            for feature in record.features:
                # We look for the 'CDS' feature, which usually contains the protein translation
                if feature.type == "CDS":
                    # Check if the 'translation' qualifier is present
                    if "translation" in feature.qualifiers:
                        protein_seq = feature.qualifiers["translation"][0]    
                        protein_id = "A|protein"     # use Boltz format
                        
                        # make a record for the amino acid sequence
                        record = SeqRecord(Seq(protein_seq), id = protein_id, description = '')
                        SeqIO.write(record, aa_file, 'fasta')
                        
if __name__ == "__main__":
    main()

Boltz-1 requires a multiple sequence alignment (MSA) in a3m format for proteins. If you don't have one, you can ask Boltz-1 to generate the MSA using the mmseqs2 server.

For simple examples running Boltz-1 is dead easy. 

$ time boltz predict H5N1/data/HA_reference.fasta --use_msa_server  --num_workers 12 --output_format pdb
Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                                               | 0/1 [00:00<?, ?it/s]Generating MSA for H5N1/data/HA_reference.fasta with 1 protein entities.
COMPLETE: 100%|████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.48s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/bill/miniforge3/envs/boltz/lib/python3.13/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|████████████████████████████████████████████| 1/1 [3:28:45<00:00,  0.00it/s]Number of failed examples: 0
Predicting DataLoader 0: 100%|████████████████████████████████████████████| 1/1 [3:28:45<00:00,  0.00it/s]
[2]+  Done                    emacs examples/prot.fasta &> /dev/null

real    211m4.420s
user    83m39.415s
sys     112m0.400s

Processing the sequences was a bit slow on my not-super-fast desktop box, but the timing was acceptable given what it was doing. Boltz-1 will output its prediction in either mmcif or pdb format. I used the pdb format so I could use MATLAB's molviewer function to view the final prediction.


Not bad.

I hope open-source models like Botz-1 are the future of LLMs. Too much public data is locked up in proprietary models. Models like this put a new tool in the hands of researchers who are not at multibillion dollar corporations. 

No comments:

Post a Comment