Introduction

Grouperfish is built to perform text clustering for Firefox Input. Due to its generic nature, it also serves as a testbed to prototype machine learning algorithms.

How does it work?

Grouperfish is a document transformation system, for high throughput applications.

Roughly summarized:

  • users put documents into Grouperfish using a REST interface
  • transformations are performed on one or several subsets of these documents.
  • results can be retrieved by users over the REST interface
  • all components are distributed for high volume applications

What can be done?

Assume a scenario where a steady stream of documents is generated. For example:

  • user feedback
  • software crash reports
  • twitter messages

Now, these documents can be processed to make them more useful. For example:

  • clustering (grouping related documents together, detecting common topics)
  • classification (associating documents with predefined categories including spam)
  • trending (identifying new topics over time).

Vocabulary

Grouperfish users can assume one of three roles (or any combination thereof):

Document Producer
Some user (usually another piece of software) that will put documents into the System.
Result Consumer
Some user/software that gets the generated results.
Admin
A user who configures which subsets of documents to transform, but also how and when to do that.