High Dimensional Vectors

Hyperdimensional  computing recently made the news. This article gives some examples of hyper vector databases.

In a previous post, we looked at hyperdimensional computing (HDC) using high dimensional vectors. As we saw, orthogonality is an important property of high-d vectors. We said that two random (-1, 1) high-d vectors were orthogonal. What do we really mean by that?

In these lecture notes, Kothari and Arora show the following result,

\[P\left( {\left| {\cos ({\theta _{x,y}})} \right| > \sqrt {\frac{{\log (c)}}{N}} } \right) < \frac{1}{c}\].

What this is saying is that probability that the absolute value of the cosine of the angle between randomly generated high-d vectors is greater that a simple function of N and c. N is the dimension of the random vectors. c is an arbitrary constant. We can choose c to adjust the probability.

What is a good choice for c

If we choose $c = {e^{0.01N}}$, then $\sqrt {\frac{{\log (c)}}{N}}  = 0.1$.

In other words, 

\[P\left( {\left| {\cos ({\theta _{x,y}})} \right| > 0.1} \right) < {e^{ - 0.01N}}\]

If N = 10,000, ${e^{ - 0.01N}}$ is a very small number, 3.7e-44. . The cosine of the angle between randomly chosen high d vectors is only greater than 0.1 with a very small probability.

What angle has a cosine of 0.1?

import numpy as np

np.arccos(0.1) * 180/np.pi
Out[159]: 84.26082952273322

Two randomly 10,000 dimension vectors have a vanishingly small probability of being  outside of the range -84.3 to 84.2 degrees. They are orthogonal.

We could safely choose $\frac{{\sqrt c }}{2}$ random vectors and be confident that they are mutually orthogonal.

Let's see how varying N affects the probability that the cos of the angle between two high-d vectors is greater than 0.1.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def cos_prob(N, t = 0.1):
    c = np.exp(N * t**2)
    return N, 1/c, np.sqrt(c)/2

columns = ["N", "prob", "m"]
df = pd.DataFrame([cos_prob(n) for n in np.array(list(range(1, 11))) * 1000], columns = columns)

print(df)

sns.lineplot(x = df["N"], y = np.log(df["prob"]))
plt.ylabel("log prob")
plt.show()

       N          prob             m
0   1000  4.539993e-05  7.420658e+01
1   2000  2.061154e-09  1.101323e+04
2   3000  9.357623e-14  1.634509e+06
3   4000  4.248354e-18  2.425826e+08
4   5000  1.928750e-22  3.600245e+10
5   6000  8.756511e-27  5.343237e+12
6   7000  3.975450e-31  7.930067e+14
7   8000  1.804851e-35  1.176926e+17
8   9000  8.194013e-40  1.746714e+19
9  10000  3.720076e-44  2.592353e+21

For N = 1000, the probability of the cosine of the angle between randomly chosen vectors being greater than 0.1 is 4.5e-05, but we only feel safe choosing 74 vectors. At N = 2000, the probability of being outside the range is 2e-09 and we can be confident in choosing over a million random vectors.

The log probability of being out side of the range -84.3 to 84.2 decreases linearly with increasing N.

We can adjust the maximum angle also. If we choose, $c = {e^{\frac{N}{4}}}$, $\sqrt {\frac{{\log (c)}}{N}}  = 0.5$ or about 60 degrees, still orthogonal.

No comments:

Post a Comment