A few days ago I saw this post on Hacker News, Ziplm: Gzip-Backed Language Model. It described ziplm, a simple language model built using lossless compression. The Hacker news article led me to this paper, “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors by Jiang et al. This paper describes a text classification model based on compression methods such as gzip. The authors claim that the results are competitive with non-pretrained deep learning methods on some datasets. It even outperforms BERT in tests.
The most interesting feature of the paper is that it contains Python code implementing the method in fourteen lines of code. When compared to the code of a deep learning language model like BERT, that is some impressive compression.
Using a lossless compression method like gzip as a language model makes sense. Lossless compression implements a simple form of "understanding" of the text. It tries to reference the position of a character or string of characters from their last occurrence, and define how many of these characters can be repeated. It replaces the character pattern by the pattern and a count of the occurrences.
The idea behind using compression for text classification is that text from similar classes show similar patterns. Since lossless compression methods capture regular patters in the text, text from similar classes should compress in a similar way.
More specifically, if C() represents the compressed size of some text. $C(xy) - C(x)$ represents how many more bytes to we need to encode the text y based on the information in x. xy is the concatenation of x and y.
To implement a text classifier, we use a simplified version of information distance, using C(x) to approximate the Kolmogorov complexity of x. The approximation, called the normalized compression distance (NCD) is defined
\[NCD(x,y) = \frac{{C(xy) - \min (C(x),C(y))}}{{\max (C(x),C(y))}}\].
See the Jiang paper for a derivation.
Given, the NCD, the paper displays Python code to implement a simple classifier.
Classifying Proteins
In a previous post, I used hyperdimensional computing (HDC) to classify human and yeast proteins sequences. I decided to try to use the NCD to cluster the same set of proteins.
Before looking at the code, I want to point out a few things. The data set is broken into training and test sets. The code in the paper repeatedly compresses the training data. I decided to compress all of the sequences on input rather than compressing them each time the NCD was calculated.
The second item to point out is that when I implemented the classifier, it performed much better classifying yeast protein sequences than the human sequences. It turned out that there were a number of very short sequences in the human collection. The compressed length of the short sequences was longer than the original sequences. In other words, there were no repeatable patterns in the sequences. I decided to eliminate those sequences.
The paper implements a k-means type of clustering where k = 2. As this blog post points out, this seems like an odd choice given the way the ties are broken. In the final code, I used k = 1, i.e. choose the class with the minimum NCD.
def read_fasta(filename: str) -> list[tuple[SeqRecord, str, int]]: """read a collection of sequences from a FASTA file. Parameters ---------- filename : str The file containing human and yeast sequences in FASTA format Returns ------- list[tupe[SeqRecord, str, int]] a list of tuples, (BioPython SeqRecord, sequence species, compressed length of sequence) """ seq_recs = [] for seq in SeqIO.parse(filename, "fasta"): if seq.id.find("HUMAN") != -1: seq_type = "HUMAN" else: seq_type = "YEAST" comp_len = len(gzip.compress(str(seq.seq).encode())) if comp_len < len(str(seq.seq)): seq_recs.append((seq, seq_type, comp_len)) return seq_recs
def predict(seq_recs: list[tuple[SeqRecord, str, int]], training_idx: list[int], test_idx: list[int], k: int = 2) -> list[str]: """ Predict protein species using gzip compression Parameters ---------- seq_recs : list[tuple[SeqRecord, str, int]] a list of tuples (BioPython SeqRecord, species name, compressed sequence length) training_idx : list[int] a list of integers indicating which sequences to use for training test_idx : list[int] a list of integers indicating which sequences to use for testing k : int, optional number of top comparisons to consider, by default 2 Returns ------- list[str] a list of species predictions
Note ---- The paper uses k = 2. In the code below, we use k = 1. """ predictions = [] # loop through the test set and compare each to the training sequences for ix in test_idx: (x , _, lx) = seq_recs[ix] dist = [] for iy in training_idx: (y , _, ly) = seq_recs[iy] xy = " ".join([str(x.seq), str(y.seq)]) lxy = len(gzip.compress(xy.encode())) dxy = (lxy - min(lx, ly)) / max(lx, ly) dist.append(dxy) idx = np.argsort(np.array(dist)) top_k = [seq_recs[training_idx[i]][1] for i in idx[:k]] prediction = max(set(top_k), key = top_k.count) predictions.append(prediction) return predictions
def main(): data_file = '/mnt/d/Documents/analytic_garden/hyperdim/data/sapiens_yeast_proteins.fasta' seq_recs = read_fasta(data_file) training_idx, test_idx = get_training_test_index(len(seq_recs)) # make predictions and count them predictions = predict(seq_recs, training_idx, test_idx, k = 1) m = np.sum(np.array([predictions[i] == seq_recs[test_idx[i]][1] for i in range(len(predictions))])) print(m / len(predictions))
Does it Work?
$ time python src/gzip_classification.py 0.6391752577319587 real 0m3.451s user 0m3.277s sys 0m0.050s
No comments:
Post a Comment