The PeARS framework relies on a distributed infrastructure where users individually index part of the Web into coherent topics. It would be useful to have a list of pre-determined topics which users can contribute to, which would cover wide areas of knowledge. We describe below how to identify such topics and seed them with document representations, providing preliminary PeARS pods that users will be able to expand with new content.
In order to cover as many topics as possible, we set out to extract topics from the English Wikipedia. We assumed that the resource is the best starting point to create a structured ontology of human knowledge, and therefore, a grid to classify Web content. The basic idea is that, if Wikipedia contains the categories dog, quantum physics and online retailers, there will be Web content which can be coherently classified into those three categories. A person using the fruit fly to index Web documents on dogs could then contribute their hashes to the dog category.
Moreover, Wikipedia itself links to external Web content, which can be used to seed topics with fruit fly representations. We took advantage of URLs linked in a given Wikipedia page and the categories that have been assigned to that page. For example, scrolling down the Wikipedia page on Algorithm, we find the headline "External Links". This allowed us to identify Web content related to the categories of the page, scrape the corresponding HTML, hash the documents, and finally, associate these newly created hashes to relevant categories. In this specific example case, the categories are: Algorithms, Mathematical logic, and Theoretical computer science.
The categories tagged in Wikipedia pages can be overwhelmingly specific, such as Documentary films about high school in the United States. For purposes of evaluation, we needed more broad and general topics or as we call here, meta-categories; in this case, Documentary films suits better our scope. Therefore, we developed a method to semi-automatically divide categories into meta-categories.
We first collected URLs and their webpage from 674,529 Wikipedia categories. Then, we created meta-categories by grouping category names with the same n-grams at a frequency higher than 100. This allowed us to have 491 meta-categories using a total of 141,938 categories, an average of 289 categories per meta-category. We removed meta-categories that do not fit in a semantically meaningful topic, such as 'groups' or the ones that would fit in too many of them namely '20th century' and 'New Zealand', i.e., the latter two meta-categories could also be part of in 'sports people', 'business people', 'poets', etc.
Finally, we checked the distribution of Web documents that had been collected for those meta-categories. As a starting point, we decided to keep the ones with more than 1,880 documents each, to provide a preliminary set of seeded PeARS pods. In total, our final dataset consists of 338,744 documents divided into 180 meta-categories.
After preparing our Wikipedia-category datasets, we process them with the fruit fly algorithm and create a vector representation for each document.