Boltz-1 Democratizing Biomolecular Modeling

 Boltz-1 is pretty cool. Boltz-1 is an open-source deep learning model for predicting biomolecular structures based on their sequences. According to the developers, Boltz-1 achieves AlphaFold3 level accuracy. They have released training and inference code, model weights, datasets, and benchmarks under the MIT open license. They're democratizing biomolecular modeling. You can read the introductory paper here and a press article about it here.

I just downloaded Boltz-1 two days ago, so this will not be an in depth look into Boltz-1. Maybe that will come later. Right now, I just wanted to try it out.

Downloading and installing Boltz-1 was easy; clone the GitHub repo and you are ready to go.

I used the reference H5N1 HA amino acid sequence that I used for this post. I extracted the sequence from the GenBank file for the Influenza A virus (A/cattle/Texas/24-008749-003/2024(H5N1) PP755589.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
aa_from_gb.py - extract the amino acid sequence from a GenBank file
author: Bill Thompson
license: GPL 3
copyright: 2025-01-09
"""
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def main():
    gb_file = "/home/bill/boltz/boltz/H5N1/data/HA_reference.gb"
    aa_file = "/home/bill/boltz/boltz/H5N1/data/HA_reference.fasta"

    with open(gb_file, "r") as handle:
        for record in SeqIO.parse(handle, "genbank"):
            # Each 'record' in this loop is a full GenBank record
            for feature in record.features:
                # We look for the 'CDS' feature, which usually contains the protein translation
                if feature.type == "CDS":
                    # Check if the 'translation' qualifier is present
                    if "translation" in feature.qualifiers:
                        protein_seq = feature.qualifiers["translation"][0]    
                        protein_id = "A|protein"     # use Boltz format
                        
                        # make a record for the amino acid sequence
                        record = SeqRecord(Seq(protein_seq), id = protein_id, description = '')
                        SeqIO.write(record, aa_file, 'fasta')
                        
if __name__ == "__main__":
    main()

Boltz-1 requires a multiple sequence alignment (MSA) in a3m format for proteins. If you don't have one, you can ask Boltz-1 to generate the MSA using the mmseqs2 server.

For simple examples running Boltz-1 is dead easy. 

$ time boltz predict H5N1/data/HA_reference.fasta --use_msa_server  --num_workers 12 --output_format pdb
Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                                               | 0/1 [00:00<?, ?it/s]Generating MSA for H5N1/data/HA_reference.fasta with 1 protein entities.
COMPLETE: 100%|████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.48s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/bill/miniforge3/envs/boltz/lib/python3.13/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|████████████████████████████████████████████| 1/1 [3:28:45<00:00,  0.00it/s]Number of failed examples: 0
Predicting DataLoader 0: 100%|████████████████████████████████████████████| 1/1 [3:28:45<00:00,  0.00it/s]
[2]+  Done                    emacs examples/prot.fasta &> /dev/null

real    211m4.420s
user    83m39.415s
sys     112m0.400s

Processing the sequences was a bit slow on my not-super-fast desktop box, but the timing was acceptable given what it was doing. Boltz-1 will output its prediction in either mmcif or pdb format. I used the pdb format so I could use MATLAB's molviewer function to view the final prediction.


Not bad.

Here is another view.




I hope open-source models like Botz-1 are the future of LLMs. Too much public data is locked up in proprietary models. Models like this put a new tool in the hands of researchers who are not at multibillion dollar corporations. 

H5N1 Part 2


 https://en.wikipedia.org/wiki/Hemagglutinin_(influenza)#/media/File:Flu_und_legende_color_c.jpg

In a previous post, I looked at the countries posting H5N1 HA sequences to GISAID and the host species that the samples were drawn from. This time I want to look at the variation in the data to identify possible mutations.

It has been reported that a glutamine to leucine mutation at residue 226 of the HA amino acid sequence increased specificity of host recognition from avian to human. 

In order to get a sense of the current mutation landscape as reported to GISAID, I aligned 3,432 sequences from the time period January 1 2024 to December 12, 2024 using MAFFT. The sequences were aligned to a reference sequence, PP755589 HA sequence from GenBank. See also this GitHub page.

time mafft --6merpair --maxambiguous 0.05 --preservecase --thread -1 --addfragments data/gisaid_epiflu_sequence_HA_2024_12_12.fasta data/HA_reference.fasta > data/gisaid_HA_2024_12_12_aln.fasta

To process the sequences, I used BioPython to read the aligned sequences  into a dictionary.

def read_alignment_file(filename: str) -> tuple[dict[str, SeqRecord.SeqRecord], Align.MultipleSeqAlignment]:
    """ Read an MSA FASTA file

    Parameters
    ----------
    filename : str
        MSA file

    Returns
    -------
    tuple[dict[str, SeqRecord.SeqRecord], Align.MultipleSeqAlignment]
        A tuple:
            a dictionary with record ID as key and SeqRecords as value
            a BioAlign Multiple Sequence Alignment
    """
    alignment = AlignIO.read(open(filename), "fasta")

    align_dict = dict()
    for record in alignment:
        align_dict[record.id] = record
            return (align_dict, alignment)

Next, the aligned positions were mapped to the reference sequence positions.

def ref_pos_to_alignment(align: dict[str, SeqRecord.SeqRecord], ref_id: str) -> dict[int, int]:
    """Map aligned positions to reference sequence

    Parameters
    ----------
    align : Dictionary returned by read_alignment_file
        A BioPython MSA
    ref_id : str
        ID of reference seq

    Returns
    -------
    dict[int, int]
        A dictionary
            key: alignment position
            value: position in reference sequence, or -1 id a gap position
    """
    ref_seq = align[ref_id].seq

    pos_map = dict()
    ref_pos = 0    # 0 based
    for pos in range(len(ref_seq)):
        if ref_seq[pos] in ['A', 'C', 'G', 'T']:
            pos_map[pos] = ref_pos
            ref_pos += 1
        else:
            pos_map[pos] = -1

    return pos_map

We then select the aligned columns containing variation.

def get_varying_columns(align: Align.MultipleSeqAlignment,
                        consensus_cutoff: float = 1.0,
                        start: int = 0, end: int = -1) -> dict[int, VaryingCol]:
    """Iterate across columns and find columns in alignment that vary more than consensus_cutoff

    Parameters
    ----------
    align : Align.MultipleSeqAlignment
        A BioPython MSA
    consensus_cutoff : float, optional
        Cutoff for consensus. Only columns with consensus freq less that this value are accepted, by default 1.0
    start : int, optional
        Start searching at this columns, by default 0
    end : int, optional
        last column to search. If -1, end is last column. , by default -1

    Returns
    -------
    dict[int, VaryingCol]
        A dictionary
            key: column position
            value: a VaryingCol object
    """
    if end == -1:
        end = align.get_alignment_length()
        
    variant_cols = dict()
    for col in range(start, end):
        c = Counter(align[:, col])
        common = c.most_common(1)
        if common[0][0] in ['A', 'C', 'G', 'T']:
            denom = sum([c[k] for k in ['A', 'C', 'G', 'T', '-']])
            freq = common[0][1] / denom
            if freq < consensus_cutoff:
                # variant_cols[col] = (freq, tuple(list(align[:, col])))
                variant_cols[col] = VaryingCol(freq, tuple(list(align[:, col])))

    return variant_cols

    return pos_map

The function returns a VaryingCol object in a dictionary.

@dataclass(frozen=True)
class VaryingCol:
    """
    A dataclass for holding info about an aligned column with variation
    """
    pct:    float
    data:   tuple   # tuple so that it is immutable

Once we have the columns, we can count the mutations and retrieve the changed codons.


def count_mutations(variant_cols: dict[int, tuple[float, list[float, list[str]]]],
                    pos_map: dict[int, int],
                    align_length: int) -> dict[int, MutationCounts]:
    """Count the common nucleotides and alternates

    Parameters
    ----------
    variant_cols : dict[int, tuple[float, list[float, list[str]]]]
        A dictionary returned by et_varying_column
    pos_map : dict[int, int]
        A dictionary returned by ref_pos_to_alignment
    align_length : int
        The number of rows in the alignment

    Returns
    -------
    dict[str, dict[str, int | float | str]]
        A dictionary
            key: reference position
            value: a MutationCounts object
    """
    mutations = dict()
    for pos, rec in variant_cols.items():
        ref_pos = pos_map[pos]
        nucleotides = rec.data
        c = Counter(nucleotides)
        ref_nuc = c.most_common(2)[0][0]
        ref_freq = c.most_common(2)[0][1] / align_length
        alt_nuc = c.most_common(2)[1][0]
        alt_freq = c.most_common(2)[1][1] / align_length
        mutations[ref_pos] = MutationCounts(align_pos = pos, ref_pos = ref_pos,
                                            ref_nuc = ref_nuc, ref_freq = ref_freq, 
                                            alt_nuc = alt_nuc, alt_freq = alt_freq)
        
    return mutations

def get_codons(gb_ref: SeqRecord, mutations:  dict[int, MutationCounts]) -> None:
    """get codons and alternate codons for mutated positions

    Parameters
    ----------
    gb_ref : SeqRecord
        Genbank record of reference sequnec
    mutations : dict[int, MutationCounts]
        a dictionary of MutationCounts for mutatated columns returned by count_mutations
        key: reference seq column
        value: a MutationCounts object
        
    Note: mutations is altered by this function
    """
    seq = list(gb_ref.seq)
    for pos in range(len(seq)):
        if pos in mutations.keys():
            mutations[pos].codon_pos = int(pos / 3) * 3
            
            mutations[pos].codon = "".join(seq[mutations[pos].codon_pos:mutations[pos].codon_pos+3])
            mutations[pos].aa = str(Seq(mutations[pos].codon).translate())
            mutations[pos].aa_name = seq3(mutations[pos].aa)
            
            pos_in_codon = pos % 3
            s = list(mutations[pos].codon)
            s[pos_in_codon] = mutations[pos].alt_nuc
            mutations[pos].alt_codon = "".join(s)
            mutations[pos].alt_aa = str(Seq(mutations[pos].alt_codon).translate())
            mutations[pos].alt_aa_name = seq3(mutations[pos].alt_aa)

The MutationCounts class is defined as:


@dataclass
class MutationCounts:
    """
    A data class for holding mutation data
    """
    align_pos:      int     # positions are 0 indexed
    ref_pos:        int
    ref_nuc:        str
    ref_freq:       float
    alt_nuc:        str
    alt_freq:       float
    codon_pos:      int = -1 
    codon:          str = ""
    aa:             str = ""
    aa_name:        str = ""
    alt_codon:      str = ""
    alt_aa:         str = ""
    alt_aa_name:    str = ""
    
    def to_list(self) -> list:
        """convert self to a list

        Returns
        -------
        list
            A list of elements from self
        """
        return [self.ref_pos + 1, self.align_pos + 1,
                self.ref_freq, self.alt_freq,
                self.ref_nuc, self.codon_pos + 1,
                self.aa, self.aa_name, self.codon,
                self.alt_nuc, self.alt_aa, self.alt_aa_name, self.alt_codon,
                self.aa + str(self.codon_pos + 1) + self.alt_aa,
                'syn' if self.aa == self.alt_aa else 'non_syn']

The rest of the code just writes mutated columns as a CSV file.  The first 10 columns are shown.

reference_positionalignment_posref_freqalt_freqref_nucleotidecodon_posref_aaaa_nameref_codonalt_nucleotidealt_aaalt_aa_namealt_codonmutationsynonomous
4370.86636771300448430.11689088191330343G4EGluGAGAKLysAAGE4Knon_syn
18510.77399103139013450.22002989536621823A16LLeuCTATLLeuCTTL16Lsyn
931260.78385650224215240.21614349775784752A91QGlnCAAGQGlnCAGQ91Qsyn
1712060.74618834080717490.22511210762331837A169LLeuCTACLLeuCTCL169Lsyn
1742100.75665171898355750.24275037369207772C172CCysTGCTCCysTGTC172Csyn
1772130.74917787742899860.2508221225710015C175DAspGACTDAspGATD175Dsyn
1782140.83826606875934230.16053811659192826C178LLeuCTATLLeuTTAL178Lsyn
1952310.75156950672645740.2436472346786248A193PProCCATPProCCTP193Psyn
2102460.75605381165919280.24394618834080717C208DAspGACTDAspGATD208Dsyn
2132490.74319880418535120.25650224215246636C211CCysTGCTCCysTGTC211Csyn

The good news, at least so fa, is that the Gln to Leu in at position 226 in amino acid sequence isn't found with a significance in the data.

You can get the full source code on GitHub.


H5N1

The H5N1 strain of avian influenza (bird flu) is a significant health concern. Although human infections of H5N1 are relative rare, the effect on bird populations has been devastating in some areas. In addition to birds, H5N1 infections have been found in domestic cattle, cats, goats, and other mammals.  So far, Human cases appear to come from contact with infected animals. Although there are cases of undetermined origin, effective human-to-human transmission hasn't been determined.

H5N1 belongs to the A subtype of the influenza virus. It's a negative sense RNA virus.

I wanted to get a picture of H5N1 genomics. In particular, I am interested in what is happening in the HA protein. The viral genome consists of eight proteins: PB1, PB2, PA, HA, NP, NA, M, and NS. See  H5N1 genetic structure for more details and references.

I'm particularly interested in changes in the hemagglutinin (HA) protein. HA is found on the surface of viral envelope. It is responsible for binding the viron to the host cell surface, so it's a key element in infection. In particular, a single change of a Gln to Leu in the HA protein at position 226 in amino acid sequence switches HA's binding specificity to bind to human cell receptors. See https://www.science.org/doi/10.1126/science.adt0180.


https://en.wikipedia.org/wiki/Hemagglutinin_(influenza)#/media/File:PDB_1hgd_EBI.jpg

On December 12 2024, I downloaded 3,432 HA genomic sequences covering the time period from 2024-01-01 to 2024-12-12 from GISAID's EpiFlu database. EpiFlu doesn't provide a metadate file for the sequences, but you can specify much of the same information to be included in the FASTA header.  For example, 

>A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882

The elements are separated by "|" marks. 

A/environment/India/240005/2024 indicates the A subtype, the host or source of the sample, the country of origin, sequence number, and year.
EPI_ISL_19521827 is the GISAID ID.
A_/_H5N1 indicates influenza subtype and virus type.
2.3.4.4b is the virus clade.
2024-06-15 is the collection date.
HA is the protein type.
 EPI3625882 is the accession number.

For a first step, I reformatted the headers into a table so I could see where the data came from and which host the sample was drawn from. One problem is that submissions were not formatted uniformly. Let's examine hosts and country of origin to see the problem. For example, sometimes country is entered as USA and sometimes by state. Hosts are reported in a variety of ways for the same species. For example, CHICKEN and HEN seem to be the same species, or maybe not. I may have missed a few items in converting the host and country names.

Here's the R code for reading the FASTA headers and generating a dataframe.


#' fasta2dataframe - read a GISAID EpiFlu fasta file to a dataframe
#'
#' @param fasta_file - a string containg that fasta file name
#' @param remove_ref - boolean. Should reference sequence be removed from dataframe
#' @param ref_row - the row number of teh reference sequence, default = 1. Ignored if remove_ref = FALSE
#' @param min_seq_length - sequences with few elements are removed
#'
#' @returns - a dataframe with 16 columns
#' "Influenza_type"   "Host"             "Country"          "ID"               "Year"             "Isolate_name"     "Isolate_ID"      
#' "Type"             "Clade"            "Collection_date"  "Segment"          "Lineage"          "DNA_Accession_no" "seq.name"        
#' "seq.text"         "seq_length"
#'
fasta2dataframe <- function(fasta_file, 
                            remove_ref = FALSE, 
                            ref_row = 1,
                            min_seq_length = 1700) {
  require(phylotools)  # for reading the FASTA file
  require(tidyverse)
  
  # a list of US states for converting atates to USA
  states <- c("Alabama", "Alaska", "Arizona",  "Arkansas", "California", "Colorado",       
              "Connecticut", "Delaware", "Florida","Georgia", "Hawaii",
              "Idaho","Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
              "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota",      
              "Mississippi", "Missouri", "Montana",  "Nebraska", "Nevada",  
              "New_Hampshire", "New_Jersey", "New_Mexico", "New_York", "North_Carolina", "North_Dakota",
              "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode_Island", 
              "South_Carolina", "South_Dakota", "Tennessee", "Texas", "Utah", 
              "Vermont", "Virginia", "Washington", "West,Virginia", "Wisconsin", "Wyoming")
  
  df <- read.fasta(fasta_file)
  if(remove_ref) {
    df <- df %>% slice(-ref_row)
  }
  
  # split sequence  IDs and get rid of any sequence that doesn't have full information
  df <- df %>% separate_wider_delim(seq.name, 
                                    delim = "|", 
                                    names = c("Isolate_name", "Isolate_ID", "Type", "Clade", 
                                              "Collection_date", "Segment", "Lineage", "DNA_Accession_no"),
                                    too_many = "debug", too_few = "debug")
  df <- df %>% filter(`seq.name_ok`)
  
  df <- df %>% 
    separate_wider_delim(`Isolate_name`, 
                         delim = "/", 
                         names = c("Influenza_type", "Host", "Country", "ID", "Year"), 
                         too_many = "debug", too_few = "debug")
  df <- df %>% filter(`Isolate_name_ok`)
  
  # map states to USA and map the German entries to Germany. Foramt a few others.
  df <- df %>% mutate(Country = if_else(Country %in% states, "USA", Country))
  df <- df %>% mutate(Country = if_else(grepl("Germany", Country), "Germany", Country))
  df <- df %>% mutate(Country = if_else(grepl("Nordrhein-Westfalen", Country), "Germany", Country))
  df <- df %>% mutate(Country = if_else(Country %in% c("Sao_Paulo", "Rio_de_Janeiro"), "Brazil", Country))
  df <- df %>% mutate(Country = if_else(Country %in% c("Kagoshima", "Ishikawa", "Iwate", "Hokkaido", "Hiroshima"), "Japan", Country))
  
  # Combine common hosts
  df <- df %>% mutate(Host = str_to_upper(Host))
  df <- df %>% mutate(Host = if_else(grepl("DUCK", Host), "DUCK", Host))
  df <- df %>% mutate(Host = if_else(grepl("MALLARD", Host), "DUCK", Host))
  df <- df %>% mutate(Host = if_else(grepl("HEN", Host), "CHICKEN", Host))
  df <- df %>% mutate(Host = if_else(grepl("CAT", Host), "CAT", Host))
  df <- df %>% mutate(Host = if_else(grepl("FELINE", Host), "CAT", Host))
  df <- df %>% mutate(Host = if_else(grepl("GOOSE", Host), "GOOSE", Host))
  df <- df %>% mutate(Host = if_else(grepl("CROW", Host), "CROW", Host))
  df <- df %>% mutate(Host = if_else(grepl("FOX", Host), "FOX", Host))
  
  # filter sequence columns
  df <- df %>% filter(str_length(seq.text) >= min_seq_length) 
  df <- df %>% mutate(seq.text = str_to_upper(seq.text))
  df <- df %>% mutate(seq_length = str_length(seq.text))
  
  # Since we removed the rows with missing values, remove the debug columns
  df <- df %>% select(-c("Isolate_name_ok", "Isolate_name_pieces", "Isolate_name_remainder"))
  df <- df %>% select(-c("seq.name_ok", "seq.name_pieces", "seq.name_remainder"))
  
  return(df)
}

There's a lot of detail there, but the transformations are pretty simple. The results are a dataframe with 16 columns. Here are the first 10 rows. I left out the column with the sequences.


Influenza_typeHostCountryIDYearIsolate_nameIsolate_IDTypeCladeCollection_dateSegmentLineageDNA_Accession_noseq.name
AENVIRONMENTIndia2400052024A/environment/India/240005/2024EPI_ISL_19521827A_/_H5N12.3.4.4b6/15/2024HA EPI3625882A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882
ACHICKENIndia2400082024A/chicken/India/240008/2024EPI_ISL_19521829A_/_H5N12.3.4.4b6/15/2024HA EPI3625898A/chicken/India/240008/2024|EPI_ISL_19521829|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625898
ACROWIndia2400132024A/crow/India/240013/2024EPI_ISL_19521828A_/_H5N12.3.4.4b6/15/2024HA EPI3625890A/crow/India/240013/2024|EPI_ISL_19521828|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625890
AMUTE_SWANMangystau022024A/Mute_Swan/Mangystau/02/2024EPI_ISL_19259709A_/_H5N12.3.4.4b1/9/2024HA EPI3433322A/Mute_Swan/Mangystau/02/2024|EPI_ISL_19259709|A_/_H5N1|2.3.4.4b|2024-01-09|HA||EPI3433322
ACROWJapanB0802024A/large-billed_crow/Hokkaido/B080/2024EPI_ISL_18932042A_/_H5N12.3.4.4b1/22/2024HA EPI3067350A/large-billed_crow/Hokkaido/B080/2024|EPI_ISL_18932042|A_/_H5N1|2.3.4.4b|2024-01-22|HA||EPI3067350
ADAIRY_COWUSA24-008766-0012024A/dairy_cow/Kansas/24-008766-001/2024EPI_ISL_19145035A_/_H5N12.3.4.4b3/21/2024HA EPI3297748A/dairy_cow/Kansas/24-008766-001/2024|EPI_ISL_19145035|A_/_H5N1|2.3.4.4b|2024-03-21|HA||EPI3297748
AGOOSEGermany2024AI012912024A/Barnacle_Goose/Germany-SH/2024AI01291/2024EPI_ISL_19014078A_/_H5N12.3.4.4b2/9/2024HA EPI3158594A/Barnacle_Goose/Germany-SH/2024AI01291/2024|EPI_ISL_19014078|A_/_H5N1|2.3.4.4b|2024-02-09|HA||EPI3158594
ACHICKENGermany2024AI008602024A/chicken/Germany-MV/2024AI00860/2024EPI_ISL_19014079A_/_H5N12.3.4.4b2/2/2024HA EPI3158602A/chicken/Germany-MV/2024AI00860/2024|EPI_ISL_19014079|A_/_H5N1|2.3.4.4b|2024-02-02|HA||EPI3158602
ADAIRY_COWUSA24-010303-0022024A/dairy_cow/Michigan/24-010303-002/2024EPI_ISL_19194255A_/_H5N12.3.4.4b4/3/2024HA EPI3364348A/dairy_cow/Michigan/24-010303-002/2024|EPI_ISL_19194255|A_/_H5N1|2.3.4.4b|2024-04-03|HA||EPI3364348
ACHICKENItaly24VIR9865-72024A/laying_hen/Italy/24VIR9865-7/2024EPI_ISL_19554808A_/_H5N12.3.4.4b11/9/2024HA EPI3651272A/laying_hen/Italy/24VIR9865-7/2024|EPI_ISL_19554808|A_/_H5N1|2.3.4.4b|2024-11-09|HA||EPI3651272

We can generate a list of the top 10 countries submitting sequences,


> df_H5N1 %>% group_by(Country) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
   Country            n
   <chr>          <int>
 1 USA             2627
 2 Czech_Republic   231
 3 Japan             64
 4 Poland            62
 5 BC                45
 6 Italy             43
 7 Germany           38
 8 Austria           26
 9 Bangladesh        26
10 Vietnam           18
> 

We can also get a list of the most common hosts in the data.

> df_H5N1 %>% group_by(Host) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
   Host           n
   <chr>      <int>
 1 DAIRY_COW   1928
 2 CHICKEN      539
 3 TURKEY       167
 4 DUCK         158
 5 GOOSE         81
 6 CROW          62
 7 CAT           54
 8 MUTE_SWAN     54
 9 GOAT          15
10 SANDERLING    12
> 

Next post, I'll look at the common mutations in the data set.

You can get the source code for creating the table on GitHub.

Hallucinations

Created by Dall-e

https://chatgpt.com/g/g-2fkFE8rbu-dall-e/c/6772fc7f-2754-800d-ab66-c2271a4fbde9

I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.

We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful.

It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.

Andrej Karpathy

I hate the term "hallucination" when applied to Large Language Models (LLMs). Hallucination implies that there is some sort of entity that is mis-perceiving reality, like a teenager in a bad 60s LSD movie. In one sense LLMs can't hallucinate, there is no entity to mis-perceive anything. When an LLM provides information at odds with reality, for example a reference to an article that doesn't exist, it's not misperceiving reality, it's simply doing what it does. LLMs build strings of words that make some sort of sense to humans. Those word strings may or may not conform to our human perception of reality

LLMs don't have a model of reality. They have a model of language as trained from the internet. In whatever sense internet language reflects reality, that's as close to real world knowledge as an LLM will get.

In order to make LLMs more accurate, they need to be connected to other modules that more closely model reality. For example, to do arithmetic without errors, the LLM needs a calculator agent.



2024

 2024 was a difficult year for me. Bev Thompson died on May 17. Bev was my life partner, soulmate, wife, guidance, and guardian angel.



I'm going to try and write more in 2025.

SARS-CoV-2 JN.1

COVID-19 is still with us and it's underlying virus, SARS-CoV-2, is still evolving. The latest variant of concern is JN.1. As of February 2, 2024, the CDC estimates that JN.1 is the viral source of about 93% of all US COVID cases.

JN.1 evolved from BA.2.86. BA.2.86 has a large number changes in the spike protein when compared to the original Wuhan sequence. The mutations gave BA.2.86 improved ability to evade the human immune system.

In order to compare JN.1 changes to to BA.2.86, on January 22 I downloaded metadata for 16,460,375  sequences from GISAID. I also downloaded the full collection of FASTA SARS-CoV-2 sequences. 

The analysis that follows is similar to this post. GISAID FASTA headers changed so I had to update some of code. I also wanted to add argument types to the Python code. 

I used R to extract the BA.2.86 and JN.1 sequences from the metadata file.

filter_chunk <- function(pango_lineage) {
    function(df, pos) {
      df <- suppressMessages(as_tibble(df, .name_repair = 'universal'))
      df <- df %>% 
        filter(Pango.lineage == pango_lineage) %>%
        mutate(Collection.date = 
                 as.Date(Collection.date, format = '%Y-%m-%d')) %>%
        mutate(Submission.date = 
                 as.Date(Submission.date, format = '%Y-%m-%d'))
    }
}  


system.time(df_BA_2_86 <- read_tsv_chunked('data/metadata.tsv.gz',
                                           DataFrameCallback$new(filter_chunk(pango_lineage = 'BA.2.86')),
					   chunk_size = 1000000))

system.time(df_JN_1 <- read_tsv_chunked('data/metadata.tsv.gz',
                                        DataFrameCallback$new(filter_chunk(pango_lineage = 'JN.1')),
					chunk_size = 1000000))

write_tsv(df_BA_2_86, 'data/BA_2_86.tsv')
write_tsv(df_JN_1, 'data/JN_1.tsv')

I plotted the variant counts.

system.time(df_variants_world <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), country = NULL, title = 'World'))
system.time(df_variants_usa <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), title = 'USA'))




As you can see, BA.2.86 didn't get much traction. JN.1 quickly took over.

Some of the FASTA sequences were shorter than full length and some had a large number of Ns. The Ns represent area where sequence identity wasn't determined.

ggplot(df_BA_2_86, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('BA_2_86')
ggplot(df_JN_1, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('JN_1')



I selected the BA.2.86 and JN.1 sequences from the FASTA file.

time python select_pango_sequences.py -i data/sequences.fasta.gz -m /data/BA_2_86.tsv -o /data/BA_2_86.fasta
time python select_pango_sequences.py -i data/sequences.fasta.gz -m data/JN_1.tsv -o data/JN_1.fasta

Next, the short sequences and sequences with a larger fraction of Ns were removed.

time python gisaid_remove_seqs.py -i data/BA_2_86.fasta -m /data/BA_2_86.tsv -o data/BA_2_86_processed.fasta -n 0.05 -l 28000
time python gisaid_remove_seqs.py -i data/JN_1.fasta -m data/JN_1.tsv -o data/JN_1_processed.fasta -n 0.05 -l 28000

Each set of sequences was aligned with MAFTT using the original Wuhan sequence as a reference..

time mafft --auto --preservecase --thread -1 --addfragments data/BA_2_86_processed.fasta data/EPI_ISL_402124.fasta > data/BA_2_86_aln.fasta
time mafft --auto --preservecase --thread -1 --addfragments data/JN_1_processed.fasta data/EPI_ISL_402124.fasta > data/JN_1_aln.fasta

Once the sequences are aligned, I counted the mutations the positions that differed from the Wuhan sequence.

time python gisaid_mutations.py -i data/BA_2_86_aln.fasta -o output/BA_2_86_mutations.csv -g data/NC_045512.2.gb
time python gisaid_mutations.py -i data/JN_1_aln.fasta -o output/JN_1_mutations.csv -g data/NC_045512.2.gb

The full mutation tables can be found on GitHub.

Finally, we can read the mutations in R and see the differences.

> BA_2_86_mutations <- read_csv('output/BA_2_86_mutations.csv')
> JN_1_mutations <- read_csv('output/JN_1_mutations.csv')
> JN_1_mutations %>% filter(! (mutation %in% BA_2_86_mutations$mutation)) %>% select(mutation, product)

  mutation product             
  <chr>    <chr>               
1 NA       NA                  
2 G282G    nsp3                
3 K1155R   nsp3                
4 R252K    nsp6                
5 L44L     nsp9                
6 C285C    3'-to-5' exonuclease
7 L455S    surface glycoprotein
8 F19L     ORF7b  

There is only one difference in the spike protein (surface glycoprotein) but it's enough to increase the transmissibility of JN.1 over BA.2.86.

The code for this analysis is similar to the code for this post, but has been updated to include argument types and the updated GISAID FASTA headers. The code can be found on GitHub.


Thanks wsltty folks!


I use this blog to post small programming projects that implement some aspect of a subject that I'm interested in. This post is different.

The computer that I use for development uses Windows 11. For most of my programming work, I use WSL2. Last week, Windows update KB5032190 caused havoc with several apps that I use regularly: VSCode, wsltty, Task Scheduler, and Windows itself.

I use VSCode as my IDE. It operates with WSL via an exposed port. VSCode is actually running in Windows, but connects to WSL so it can run Linux compilers, apps, etc. After the update, it refused to connect to WSL. I fixed this problem the way many Windows app problems are solved; I deleted VSCode and reinstalled. After that it worked. I have no idea why it failed initially.

I run a backup program every night at 1:00 am. It is launched by the Windows Task Scheduler. The actual program is ancient. It was written in perl about ten years ago. It has been operating flawlessly on different systems for that time. Last week, Task Scheduler began to randomly skip launching the task. No errors or any events are reported, other apps were launched, but the backup app was simply ignored. I submitted feedback to Microsoft on this issue. I haven't received a reply yet.

After the update, icons mysteriously disappeared from my taskbar. Clicking on the blank space where the icon should be will bring up the window. This has supposedly been fixed in the Windows Insider channel. I'm hesitant to join the Insider program because I worry that I'll be facing broken apps on a regular basis.

wsltty is my go-to terminal for Linux. Its roots are in mintty, a terminal emulator for Cygwin. It's a simple terminal window that behaves like a Linux terminal. When wsltty began to fail, I found several viable workarounds, but I still preferred wsltty. I submitted an issue on the wsltty GitHub page. It quickly became apparent that many other suffered the same problem. Some people dove into the problem and located where it arose. Within a week, the good folks who support wsltty provided an updated version. This is the sort of thing that makes open source so great!

THANK YOU WSLTTY FOLKS!!!