First baseline results from the Fruit Fly Algorithm

PeARS team has been hard at work, evaluating a biological algorithm for Web document classification (read more about the project here). If experimental results are positive, the so-called 'Fruit Fly Agorithm' will be integrated into the PeARS framework, making the system even more lightweight and efficient, both in terms of storage and performance.

And... initial results are out! On this page, we report on the performance of the FFA 'out of the box'. Evaluation takes place over a document classification task, which assesses whether our document vectors are of sufficient quality to be accurately categorised into different topics.

Evaluation datasets

We are using two standard datasets for document classification, 20_newsgroups and Web of Science (links below). We also built our own dataset from Wikipedia pages linked to categories.

The model

Our aim is to evaluate the FFA's ability to generate good document vectors. Each document in the three above datasets is passed as input to the FFA, and hashed. The hashes are then used to classify documents, using a multiclass logistic regression classifier (we use the sklearn implementation, available here.)

Hyperparameter tuning

We tune the fruit fly model and classifier concurrently, using Bayesian Optimization. We show below the range of values considered for each hyperparameter.

Model hyperparameters

  • Number of Kenyon Cells (Number KC): 3000-9000
  • Size of random projections (Proj. size): 2-10
  • Percentage of KCs to retain in the final hash (WTA): 2-20
  • Number of keywords from document (Num. top words): 10-250

Classification hyperparameters

  • C: 1-100
  • Max. iterations: set at 2000 for 20newsgroups dataset, 50 for the other two datasets

Results

We first report below our results on the validation data, showing the 5 best sets of hyperparameters for each dataset.

20_newsgroups dataset

ScoreNumber KCProj. sizeNum. top wordsWTAC parameter
0.7043873072232023
0.7039873942481498
0.703985597241145
0.7038863872411193
0.7035896062471998

The average score for the 5 settings above on the test set is: 0.8021

Web of Science dataset

ScoreNumber KCProj. sizeNum. top wordsWTAC parameter
0.77128834102421593
0.7707866282441949
0.770088915250184
0.7700840952471698
0.769980159171869

The average score for the 5 settings above on the test set is: 0.7833

Wikipedia dataset

ScoreNumber KCProj. sizeNum. top wordsWTAC parameter
0.9186827162411940
0.9179860482041835
0.9178880852401183
0.91728786524316100
0.9171899592491080

The average score for the 5 settings above on the test set is: 0.9179

Discussion

The fruit fly gives very decent performance 'out-of-the-box' (its performance is in fact at state-of-the-art level on the 20newsgroup dataset: compare with results given in this page, using a heavier architecture). Nevertheless, we expect it can be improved both in terms of raw performance and/or size of the model, and this is what the next steps of the project will investigate.