PeARS Fruit Fly is the new PeARS project, under active development. It is aimed at making the PeARS Web indexer smaller, more efficient, and more accurate.
Inspired by the olfactory system of the species Drosophila Melanogaster (the common fruit fly), the indexer is a shallow, partially connected neural network which transforms the text of Web documents into a compact, searchable representation. Given its tiny size and simplicity, a fruit fly model can be stored and run on any device with great efficiency.
The Fruit Fly Algorithm (FFA), a technique derived from the olfactory system of the species Drosophila melanogaster, implements a kind of local-sensitivity hashing relying on random projections. The usefulness of the biological algorithm for computer science was first noted by Dasgupta et al (2016), who used it for hashing pre-trained document and image vectors. In our project, we aim to take the algorithm one step further and make it learn document representations directly from raw text.
The Fruitfly Algorithm mimics the olfactory system of the fruit fly, which assigns a pattern of binary activations to a particular smell (i.e., a combination of multiple chemicals), using sparse connections between just two neuronal layers. This mechanism allows the fly to 'conceptualise' its environment and to appropriately react to new smells by relating them to previous experiences.
In a computer science context, the FFA can be regarded as a very simple and elegant algorithm, implemented as a small feedforward neural network. We use here below the descriptions of Herbelot and Preissner (2019) and Herbelot and Preissner (2020), who ported the original algorithm to a Natural Language Processing setting and made the fly learn word embeddings.
The figure below shows the architecture of the system.
The PN (Projection Neurons) layer takes features from raw text documents (see [this wiki page](https://github.com/PeARSearch/PeARS-fruit-fly/wiki/1.1.-Baselines) for more details). It is partially connected to the KC (Kenyon Cell) layer via random projections of a certain size. The *k* most activated Kenyon Cells give out a locality-sensitive hash for each document. That is, the hashes can be used to compute levels of similarity between any pair of documents.
We briefly describe below the pipeline for computing a document vector with the FFA.
As features, we use a pre-computed sentencepiece vocabulary, which chunks a document into wordpieces (i.e. character n-grams). Given a text document d, we first encode d in terms of our wordpiece vocabulary.
We then produce a frequency vector for d. That is, we simply count the number of times that each wordpiece in our vocabulary occurs in the document. In order to avoid giving too much importance to very frequent words like articles or prepositions, we reweigh the vector using the log probabilities computed by sentencepiece on the larger corpus. This 'flattens' the frequency distribution of the words in the document.
We only keep the k most important words for the document, where k is a hyperparameter that can be tuned for the algorithm.
Now, we use the resulting vector as input to the FFA. As explained above, the FFA aggregates input values via random projections linking the input layer to the 'Kenyon Cell' layer of the architecture. In our case, it means that certain combinations of words will more or less strongly activate particular Kenyon cells.
Example: say there is a random projection of size 5 linking input words 'cat', 'run', 'think', 'of', 'elaborate' to Kenyon Cell number 36 (KC36), and that our document contains the words cat, run and of, but not elaborate or think. Then the weights of cat / run / of will be added together to give the activation of KC36.
In the end, the IDs of the n Kenyon Cells with the highest activations are kept to produce a binary hash of the document. Again, n is a hyperparameter that can be tuned.