So remember those fruit flies? Remember how we got them to produce quality Web documents representations? We're now trying to make them smaller. And for that, we're going to have to make many baby fruit flies...
Recall that the quality of the fruit fly representations are linked to the specific number and nature of the random projections inside the architecture. Better projections can potentially improve the accuracy of the model in document classification tasks, while reducing their overall number makes for a more compact model, easier to run on entry-level hardware. Following this reasoning, we optimize our fruit flies with a genetic algorithm. The advantage of the genetic algorithm over the simple hyperparameter search we've performed before is that it allows us to share genetic material (projections) between flies and to move from fully random projections to a set optimized for document representation.
A genetic algorithm (GA) is a searching algorithm that finds the solution to a problem through an evolution process. It is inspired by natural selection in the biological world. In each generation, good individuals are kept for the next generation, or passed through a crossbreed and mutation process to produce offsprings. Through many generations, individuals with good characteristics become dominant in the population, so that we have a higher chance to find the ideal solution to our optimization problem.
The quality of the evolution process depends on multiple pre-defined components: a good representation of individuals (a.k.a. chromosome), a fitness function to score the individual's performance, selection-crossover-mutation strategies, and other hyper-parameters. We are going to apply a GA to our Fruit Fly Algorithm (FFA).
Our goal is to find an ideal projection matrix of the FFA, and an ideal winner-take-all rate (WTA, in percentage). These weights and WTA rates have to satisfy two criteria: 1. High classification accuracies on all 3 datasets (Web of Science, Wikipedia, 20newsgroup); and 2. Small number of non-zero elements in the Kenyon (KC) layer. The second criterion helps boost the time of execution when applying FFA in a large retrieval task. There is a trade-off in the two criteria: the more non-zero elements in the KC layer, the better classification scores are.
In the current version, the fitness function is defined as the average accuracy over the development set of our three datasets:
In order to keep the KC layer as small as possible, we implement a strategy where the first generation of flies only contains a reduced number of projections (10-100), and subsequent generations grow in size, with new offsprings adding 10 to 30 more projections to the set they received from their parents. This allows us to monitor the effect of size on accuracy.
There are two components to represent: the projection (weight) matrix and the WTA rate.
The weight matrix is a sparse, binary matrix. Each row represents one Kenyon cell, and each column represents projection neurons. 1-bit element represents a connection from a Kenyon cell to a projection neuron, while 0-bit element denotes the absence of a connection. We apply the following operations to the flies' matrices:
The WTA is a float number. Again, two operations apply to the WTA during evolutionn:
We apply tournament selection: randomly choose two candidates, and select the one with a higher fitness value. We repeat this process until the population size is full. Moreover, we mix the tournament method with elitism by keeping two individuals having the highest fitness values. Be aware that the pool of candidates is just a sample of the whole population. The number of candidates is controlled by a hyper-parameter again.
We conduct a side experiment to compare what selection strategy works best in our problem: tournament, tournament and elitism, roulette wheel, stochastic universal sampling, and rank selection. The results suggest the mix between tournament and elitism is the best.
Starting from a small population of 10, the GA shows early positive impact on our three datasets, as demonstrated by the graph below.
The graph shows the mean accuracy of each generation on the validation set of each dataset, as training progresses. As we can see, accuracy on the three datasets improves sharply in the first 100 generations and flattens afterwards. Notably, accuracy continues to increase but at a much smaller rate. This is important from the point of view of storage and efficiency: the flies in the last generations have increased in size compared to their ancestor, so there is a trade-off between the amount of extra accuracy bought by new genetic material. We can use the learning curve of our flies to decide on a good compromise between accuracy and efficiency.
When training on individual datasets, we obtain validation scores as high or higher than in the original setting with hyperparameter search: 78% on Web of Science, 92% on Wikipedia, and 81% on 20newsgroups. We also try to train on the three datasets jointly, which forces each generation of flies to search for the best general-purpose projections. In that setting, our best results are 77% (WoS), 91% (Wikipedia) and 70% (20newsgroups). The lower result on 20newsgroups indicates that it is in some way 'incompatible' with the features of both WoS and Wikipedia and flies are drawn to optimizing on the two most similar datasets. This is a result to take into account in the future, given the wide variety of texts present on the Web.
In terms of model size, the advantages of the genetic algorithm become evident. On the Web of Science dataset, the 337th generation reaches 76% accuracy on the validation set with a KC layer of 5200 nodes and a WTA of 22%: compare this with the best setting from our hyperparameter search, which used 8800 nodes and a WTA of 15% (accuracy 77%). On the Wikipedia dataset, the 361th generation reaches 91% accuracy on the validation set with a KC layer of 5600 nodes and a WTA of 16%: compare this with the best setting from our hyperparameter search, which used 8200 nodes and a WTA of 19% (accuracy 92%). On the 20newsgroups dataset, finally, the convergence is less good. But the 255th generation of flies nevertheless achieves 67% accuracy with 4400 nodes and a WTA of 29% (the initial results was 70% accuracy with 8700 nodes and a WTA of 20%).