In a previous post, I wrote about Hyperdimensional computing (HDC) and showed some simple code illustrating the use of high dimensional vectors (HDVs). There is a large technical literature about HDC, but it's spread over several different disciplines and under several names, including Hyperdimensional Computing, Vector Symbolic Architectures (VSA), Holographic Reduced Representations (HRR), and others. There are variations in implementation among the different models, but all use some form of HDVs. A two part survey, part 1 and part 2, provide a comprehensive description of the field and the various approaches.
Despite the number of academic articles, I was only able to find two working software packages implementing HDC, VSA for R, implementing the HRR model, and torchhd for Python.
VSA
VSA implements the HRR model. I was able to install VSA on R 4.2.3, but when I tried to install it with the latest version of R, 4.3.0, it fails to install due to an update in the BLAS libraries.
In HRR, hypervectors are real-valued and their components are iid with a normal distribution with mean 0 and variance 1/dim rather than binary or bipolar values. The binding operation is implemented by circular convolution of the two vectors rather than component multiplication.
VSA provides functions for creating and printing VSA vectors; binding, unbinding, superposition, and scaling functions; functions for computing the similarity of two VSA vectors; and functions for creating VSA memories, and finding closest matches.
The documentation for VSA is a bit spare. You have to read between the lines and I couldn't find many examples of code using the VSA library online.
VSA overloads '+' to perform bundling of two hypervectors and '*' to perform binding (circular convolution).
Let's use VSA to see what the dollar of Mexico is:
> library(vsa) Loading required package: abind > > usa <- newVec() # newVec creates a new random HRR hypervector > mex <- newVec() > usd <- newVec() > peso <- newVec() > country <- newVec() > currency <- newVec() > > usa_rec <- country * usa + currency * usd > mex_rec <- country * mex + currency * peso >
> # correspondence between usa and mex > V <- (! usa_rec) * mex_rec # ! takes inverse of the vector cosine(!x, x) ~ 0 >
> # usd * (! usa_rec) * mex_rec ~ peso > dollar_role_us <- usd * V # dollar's role in US > > cosine(dollar_role_us, peso) # peso is the dollar of mexico simval: 0.3355377 > > cosine(dollar_role_us, usd) simval: -0.005725137 >
Torchhd
import torch, torchhd d = 10_000 # size of hypervectors usa, mex = torchhd.random(2, d) # United States and Mexico usd, peso = torchhd.random(2, d) # United States and Mexico currency # label variables country, currency = torchhd.random(2, d) usa_rec = torchhd.bundle(torchhd.bind(country, usa), torchhd.bind(currency, usd)) mex_rec = torchhd.bundle(torchhd.bind(country, mex), torchhd.bind(currency, peso)) dollar_role_us = torchhd.bind(usd, usa_rec) # query dollar_of_mexico = torchhd.bind(dollar_role_us, mex_rec) # Peso is the dollar of Mexico torchhd.cosine_similarity(dollar_of_mexico, peso) Out[12]: MAPTensor(0.4987) # dollar is not the dollar of Mexico torchhd.cosine_similarity(dollar_of_mexico, usd) Out[13]: MAPTensor(0.0086)
class Encoder(): """A class to encode sequences as torchhd hypervectors """ _amino_acids = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y'] def __init__(self, dim: int = 10_000): self._dim = dim self._trimers = self._trimer_hdv() # initialize all possible trimers def encode(self, seq: SeqRecord) -> torchhd.tensors.map.MAPTensor: """Encode a BioPython SeqRecord as a torchhd tensor hypervector The sequence is broken into overlapping 3-mers. Each 3-mer is encoded as a hypervector. Hypervectors are bound and quantized. Parameters ---------- seq : SeqRecord a BioPython sequence record Returns ------- torchhd.tensors.map.MAPTensor a quantized hypervector """ hdv = torchhd.empty(len(seq) - 2, self._dim) # a hypervector for each 3-mer idx = 0 for pos in range(len(seq) - 2): hdv[idx, :] = self._trimers[seq.seq[pos:(pos+3)]] idx += 1 return torchhd.hard_quantize(torchhd.multiset(hdv)) def _trimer_hdv(self) -> dict[str, torchhd.tensors.map.MAPTensor]: """generate hdvs of all 21*21*21 possible amino acid trimers (20 + stop codon) Returns ------- dict[str, np.ndarray] a dictionary, key: amino acid trimer, value: random hdv encoding trimer """ trimer_hdvs = dict() for aa1 in self._amino_acids: for aa2 in self._amino_acids: for aa3 in self._amino_acids: trimer_hdvs[aa1 + aa2 + aa3] = torchhd.random(1, self._dim) return trimer_hdvs
seq_recs, seq_species = read_fasta(data_file) training_idx, test_idx = get_training_test_index(len(seq_recs)) # split the data encoder = Encoder(dim = dim) # build a prototype hdv for each species by bundling like sequences yeast_prototype = torchhd.empty(1, dim) human_prototype = torchhd.empty(1, dim) for i in training_idx: hdv = encoder.encode(seq_recs[i]) if seq_species[i] == "HUMAN": human_prototype = torchhd.bundle(human_prototype, hdv) else: yeast_prototype = torchhd.bundle(yeast_prototype, hdv) # shrink the dimensions to match encoding and quantize yeast_prototype = torch.squeeze(torchhd.hard_quantize(yeast_prototype)) human_prototype = torch.squeeze(torchhd.hard_quantize(human_prototype))
# test predictions and print accuracy predictions = [] for i in test_idx: hdv = encoder.encode(seq_recs[i]) predictions.append(prediction(human_prototype, yeast_prototype, hdv)) print(np.mean(np.array(predictions, dtype = str) == np.array(seq_species, dtype = str)[test_idx]))
$ time python src/protein_torchhd.py 0.875 real 0m17.619s user 1m56.212s sys 0m5.462s
No comments:
Post a Comment