The Analytic Garden: Protein Classification Using Compression

A few days ago I saw this post on Hacker News, Ziplm: Gzip-Backed Language Model. It described ziplm, a simple language model built using lossless compression. The Hacker news article led me to this paper, “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors by Jiang et al. This paper describes a text classification model based on compression methods such as gzip. The authors claim that the results are competitive with non-pretrained deep learning methods on some datasets. It even outperforms BERT in tests.

The most interesting feature of the paper is that it contains Python code implementing the method in fourteen lines of code. When compared to the code of a deep learning language model like BERT, that is some impressive compression.

Using a lossless compression method like gzip as a language model makes sense. Lossless compression implements a simple form of "understanding" of the text. It tries to reference the position of a character or string of characters from their last occurrence, and define how many of these characters can be repeated. It replaces the character pattern by the pattern and a count of the occurrences.

The idea behind using compression for text classification is that text from similar classes show similar patterns. Since lossless compression methods capture regular patters in the text, text from similar classes should compress in a similar way.

More specifically, if C() represents the compressed size of some text. $C(xy) - C(x)$ represents how many more bytes to we need to encode the text y based on the information in x. xy is the concatenation of x and y.

To implement a text classifier, we use a simplified version of information distance, using C(x) to approximate the Kolmogorov complexity of x. The approximation, called the normalized compression distance (NCD) is defined

\[NCD(x,y) = \frac{{C(xy) - \min (C(x),C(y))}}{{\max (C(x),C(y))}}\].

See the Jiang paper for a derivation.

Given, the NCD, the paper displays Python code to implement a simple classifier.

Classifying Proteins

In a previous post, I used hyperdimensional computing (HDC) to classify human and yeast proteins sequences. I decided to try to use the NCD to cluster the same set of proteins.

Before looking at the code, I want to point out a few things. The data set is broken into training and test sets. The code in the paper repeatedly compresses the training data. I decided to compress all of the sequences on input rather than compressing them each time the NCD was calculated.

The second item to point out is that when I implemented the classifier, it performed much better classifying yeast protein sequences than the human sequences. It turned out that there were a number of very short sequences in the human collection. The compressed length of the short sequences was longer than the original sequences. In other words, there were no repeatable patterns in the sequences. I decided to eliminate those sequences.

The paper implements a k-means type of clustering where k = 2. As this blog post points out, this seems like an odd choice given the way the ties are broken. In the final code, I used k = 1, i.e. choose the class with the minimum NCD.

def read_fasta(filename: str) -> list[tuple[SeqRecord, str, int]]:
    """read a collection of sequences from a FASTA file.

    Parameters
    ----------
    filename : str
       The file containing human and yeast sequences in FASTA format

    Returns
    -------
    list[tupe[SeqRecord, str, int]]
        a list of tuples, (BioPython SeqRecord, sequence species, compressed length of sequence)
    """
    seq_recs = []
    for seq in SeqIO.parse(filename, "fasta"):
        if seq.id.find("HUMAN") != -1:
            seq_type = "HUMAN"
        else:
            seq_type = "YEAST"

        comp_len = len(gzip.compress(str(seq.seq).encode()))
        if comp_len < len(str(seq.seq)):
            seq_recs.append((seq, seq_type, comp_len))

    return seq_recs

Once we have the compressed sequences, we can implement a classifier to identify human vs. yeast sequences.

def predict(seq_recs: list[tuple[SeqRecord, str, int]], 
            training_idx: list[int], 
            test_idx: list[int],
            k: int = 2) -> list[str]:
    """
    Predict protein species using gzip compression

    Parameters
    ----------
    seq_recs : list[tuple[SeqRecord, str, int]]
        a list of tuples (BioPython SeqRecord, species name, compressed sequence length)
    training_idx : list[int]
        a list of integers indicating which sequences to use for training
    test_idx : list[int]
        a list of integers indicating which sequences to use for testing
    k : int, optional
        number of top comparisons to consider, by default 2

    Returns
    -------
    list[str]
        a list of species predictions

    Note
    ----
        The paper uses k = 2. In the code below, we use k = 1.
    """
    predictions = []

    # loop through the test set and compare each to the training sequences
    for ix in test_idx:
        (x , _, lx) = seq_recs[ix]
        dist = []
        for iy in training_idx:
            (y , _, ly) = seq_recs[iy]
            xy = " ".join([str(x.seq), str(y.seq)])
            lxy = len(gzip.compress(xy.encode()))
            dxy = (lxy - min(lx, ly)) / max(lx, ly)
            dist.append(dxy)
        
        idx = np.argsort(np.array(dist))
        top_k = [seq_recs[training_idx[i]][1] for i in idx[:k]]
        prediction = max(set(top_k), key = top_k.count)
        predictions.append(prediction)

    return predictions

To classify the sequences,

def main():
    data_file = '/mnt/d/Documents/analytic_garden/hyperdim/data/sapiens_yeast_proteins.fasta'

    seq_recs = read_fasta(data_file)
    training_idx, test_idx = get_training_test_index(len(seq_recs))

    # make predictions and count them
    predictions = predict(seq_recs, training_idx, test_idx, k = 1)
    m = np.sum(np.array([predictions[i] == seq_recs[test_idx[i]][1] for i in range(len(predictions))]))
    print(m / len(predictions))

The dats file is originally from here.

Does it Work?

It works, sort of...

$ time python src/gzip_classification.py
0.6391752577319587

real    0m3.451s
user    0m3.277s
sys     0m0.050s

The code consistently gets between 60 and 70% correct classification. This is much below the HDC version and certainly well below a trained deep learning model. Still, for fourteen lines of code, it's not bad.

The code is available on GitHub.

The Analytic Garden

Protein Classification Using Compression

Classifying Proteins

Does it Work?

No comments:

Post a Comment

Labels

Contributors

wfmu