A Couple of Hyperdimensional Computing Libraries

In a previous post, I wrote about Hyperdimensional computing (HDC) and showed some simple code illustrating the use of high dimensional vectors (HDVs). There is a large technical literature about HDC, but it's spread over several different disciplines and under several names, including Hyperdimensional Computing, Vector Symbolic Architectures (VSA), Holographic Reduced Representations (HRR), and others. There are variations in implementation among the different models, but all use some form of HDVs. A two part survey, part 1 and part 2, provide a comprehensive description of the field and the various approaches.

Despite the number of academic articles, I was only able to find two working software packages implementing HDC, VSA for R, implementing the HRR model, and torchhd for Python. 

VSA

VSA implements the HRR model. I was able to install VSA on R 4.2.3, but when I tried to install it with the latest version of R, 4.3.0, it fails to install due to an update in the BLAS libraries. 

In HRR, hypervectors are real-valued and their components are iid with a normal distribution with mean 0 and variance 1/dim rather than binary or bipolar values. The binding operation is implemented by circular convolution of the two vectors rather than component multiplication.

VSA provides functions for creating and printing VSA vectors; binding, unbinding, superposition, and scaling functions; functions for computing the similarity of two VSA vectors; and functions for creating VSA memories, and finding closest matches.

The documentation for VSA is a bit spare. You have to read between the lines and I couldn't find many examples of code using the VSA library online.

VSA overloads '+' to perform bundling of two hypervectors and '*' to perform binding (circular convolution).

Let's use VSA to see what the dollar of Mexico is:

> library(vsa)
Loading required package: abind
> 
> usa <- newVec()   # newVec creates a new random HRR hypervector
> mex <- newVec()
> usd <- newVec()
> peso <- newVec()
> country <- newVec()
> currency <- newVec()
> 
> usa_rec <- country * usa + currency * usd
> mex_rec <- country * mex + currency * peso
> 
> # correspondence between usa and mex
> V <- (! usa_rec) * mex_rec  # ! takes inverse of the vector cosine(!x, x) ~ 0
>
> # usd * (! usa_rec) * mex_rec ~ peso
> dollar_role_us <- usd * V   # dollar's role in US 
>
> cosine(dollar_role_us, peso) # peso is the dollar of mexico
simval: 0.3355377
>
> cosine(dollar_role_us, usd)
simval: -0.005725137
>

Torchhd

Torchhd is a Python library for hyperdimensional computing. Since it is built on top of Pytorch, you get the ability to use Pytorch tensor operations as well as features specific to hypervectors. The GitHub site has several examples of the library's integration with Pytorch. 

Let's revisit the question of "what is the dollar of Mexico?"


import torch, torchhd

d = 10_000 # size of hypervectors

usa, mex = torchhd.random(2, d)  # United States and Mexico

usd, peso = torchhd.random(2, d)  # United States and Mexico currency

# label variables
country, currency = torchhd.random(2, d)

usa_rec = torchhd.bundle(torchhd.bind(country, usa), torchhd.bind(currency, usd))

mex_rec = torchhd.bundle(torchhd.bind(country, mex), torchhd.bind(currency, peso))

dollar_role_us = torchhd.bind(usd, usa_rec)

# query
dollar_of_mexico = torchhd.bind(dollar_role_us, mex_rec)

# Peso is the dollar of Mexico
torchhd.cosine_similarity(dollar_of_mexico, peso)
Out[12]: MAPTensor(0.4987)

# dollar is not the dollar of Mexico
torchhd.cosine_similarity(dollar_of_mexico, usd)
Out[13]: MAPTensor(0.0086)

For a more complicated example, consider the problem of distinguishing human protein sequences from years protein sequences from here. We'll use an approach similar to that used in the previous post. We will encode each sequence as a bundle of hypervectors representing the overlapping 3-mers. 


class Encoder():
    """A class to encode sequences as torchhd hypervectors

    """
    _amino_acids = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 
                    'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y']

    def __init__(self, dim: int = 10_000):
        self._dim = dim
        self._trimers = self._trimer_hdv()  # initialize all possible trimers

    def encode(self, seq: SeqRecord) -> torchhd.tensors.map.MAPTensor:
        """Encode a BioPython SeqRecord as a torchhd tensor hypervector
            The sequence is broken into overlapping 3-mers. Each 3-mer is encoded as a hypervector.
            Hypervectors are bound and quantized.

        Parameters
        ----------
        seq : SeqRecord
            a BioPython sequence record

        Returns
        -------
        torchhd.tensors.map.MAPTensor
            a quantized hypervector 
        """
        hdv = torchhd.empty(len(seq) - 2, self._dim) # a hypervector for each 3-mer
        idx = 0
        for pos in range(len(seq) - 2):
            hdv[idx, :] = self._trimers[seq.seq[pos:(pos+3)]]
            idx += 1

        return torchhd.hard_quantize(torchhd.multiset(hdv))
            
    def _trimer_hdv(self) -> dict[str, torchhd.tensors.map.MAPTensor]:
        """generate hdvs of all 21*21*21 possible amino acid trimers (20 + stop codon)

        Returns
        -------
        dict[str, np.ndarray]
            a dictionary, key: amino acid trimer, value: random hdv encoding trimer
        """
        trimer_hdvs = dict()
        for aa1 in self._amino_acids:
            for aa2 in self._amino_acids:
                for aa3 in self._amino_acids:
                    trimer_hdvs[aa1 + aa2 + aa3] = torchhd.random(1, self._dim)

        return trimer_hdvs 

We split the data into training and test sets. We encode each training sequence and bundle the resulting hypervectors into human and yeast prototype hypervectors.


    seq_recs, seq_species = read_fasta(data_file)
    training_idx, test_idx = get_training_test_index(len(seq_recs))  # split the data

    encoder = Encoder(dim = dim)

    # build a prototype hdv for each species by bundling like sequences
    yeast_prototype = torchhd.empty(1, dim)
    human_prototype = torchhd.empty(1, dim)
    for i in training_idx:
        hdv = encoder.encode(seq_recs[i])
        if seq_species[i] == "HUMAN":
            human_prototype = torchhd.bundle(human_prototype, hdv)
        else:
            yeast_prototype = torchhd.bundle(yeast_prototype, hdv) 

    # shrink the dimensions to match encoding and quantize
    yeast_prototype = torch.squeeze(torchhd.hard_quantize(yeast_prototype))
    human_prototype = torch.squeeze(torchhd.hard_quantize(human_prototype))

Finally we compare each sequence in the test set to the prototypes to see which species it is closer to.

# test predictions and print accuracy
    predictions = []
    for i in test_idx:
        hdv = encoder.encode(seq_recs[i])
        predictions.append(prediction(human_prototype, yeast_prototype, hdv))
        
    print(np.mean(np.array(predictions, dtype = str) == np.array(seq_species, dtype = str)[test_idx]))

It does a reasonable job at predicting the species from the test set.

$ time python src/protein_torchhd.py
0.875

real    0m17.619s
user    1m56.212s
sys     0m5.462s

The torchhd code for the human/yeast example is available at https://github.com/analytic-garden/hyperdimensional_computing

No comments:

Post a Comment