Introduction¶

Grouperfish is built to perform text clustering for Firefox Input. Due to its generic nature, it also serves as a testbed to prototype machine learning algorithms.

How does it work?¶

Grouperfish is a document transformation system, for high throughput applications.

Roughly summarized:

users put documents into Grouperfish using a REST interface
transformations are performed on one or several subsets of these documents.
results can be retrieved by users over the REST interface
all components are distributed for high volume applications

What can be done?¶

Assume a scenario where a steady stream of documents is generated. For example:

user feedback
software crash reports
twitter messages

Now, these documents can be processed to make them more useful. For example:

clustering (grouping related documents together, detecting common topics)
classification (associating documents with predefined categories including spam)
trending (identifying new topics over time).

Vocabulary¶

Grouperfish users can assume one of three roles (or any combination thereof):

Document Producer: Some user (usually another piece of software) that will put documents into the System.
Result Consumer: Some user/software that gets the generated results.
Admin: A user who configures which subsets of documents to transform, but also how and when to do that.