Spam Filtering Using Compression Models
Andrej Bratko and Bogdan Filipic

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. Since messages are modeled as sequences of characters, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We present experimental results indicating that compression models perform well in comparison to established spam filters. We also show that the method is extremely robust to noise, which should make such filters difficult to defeat.

Note: Certain algorithms used in this work are implemented in the PSMSLib software package.
Note: This technical report has been superseded by a journal paper.

Download:
 [pdf] [ps]

Download slides (compiled from slides presented at TREC 2005 and Microsoft Research):
 [pdf]

Download poster (presented at TREC 2005):
 [pdf]

BibTex entry:
@techreport { bratko05using,
  author      = {Bratko, A. and Filipi\v{c}, B.},
  title       = {Spam Filtering Using Compression Models},
  institution = {Department of Intelligent Systems, Jo\v{z}ef Stefan Institute, Ljubljana, Slovenia},
  number      = {IJS-DP-9227},
  year        = {2005} }