OSBF-Lua


Text classification module for the Lua Programming Language
and a production class anti-spam in Lua using the module

#1 in CEAS 2008 Spam Filter Live Challenge

Winner of TREC's Spam Track 2006


Overview  ·  What's new  ·  Download and Contributions  ·  Installation  ·  Manual  ·  Credits  ·  Contact


Overview

OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.

The OSBF algorithm is a typical Bayesian classifier but enhanced with two techniques that I originally developed for the CRM114 project: Orthogonal Sparse Bigrams - OSB, for feature extraction, and Exponential Differential Document Count - EDDC (a.k.a Confidence Factor), for automatic feature selection. Combined, these two techniques produce a highly accurate classifier. OSBF was developed focused on two classes, SPAM and NON-SPAM, so the performance for more than two classes may not be the same.

spamfilter.lua is an anti-spam filter written in Lua using the OSBF-lua module.  It takes special advantage of EDDC to introduce TONE-HR, a highly effective training method. The combination of OSB, EDDC and TONE-HR to enhance a classical Bayesian classifier resulted in the best spam filtering performance in TREC's Spam Track 2006 and in CEAS 2008 Live Challenge.

The Confidence Factor was officially introduced in the paper "Exponential Differential Document Count - A Feature Selection Factor for Improving Bayesian Filters Accuracy", presented in the MIT Spam Conference - 2006, after being in experimental use for more than a year in both projects: CRM114 and OSBF-Lua. The conference slides are also available.

The OSB technique was officially announced in the paper "Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering", a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.

The CRM114 implementation of OSBF was one of the classifiers submitted to the TREC's Spam Track 2005 by the CRM114 team, but its first results were not good because of a bug. Later, the bug was fixed and the OSBF-Lua version was submitted to the track coordinator, prof. Gordon Cormack, for an extra evaluation. The new results were comparable to those of the best participants, with the advantage of being 5 to 10 times faster. Our notebook paper comments on the results of the four filters submitted by the CRM114 team: OSBF, Winnow, OSB Unique and OSB.

OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribution includes a copy of the license in the file gpl.txt.

What's new

 Download and Contributions

Installation

OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.

Installation steps:

If you want to install in the default dir you must be root to do the "make install" step. If you don't have root access, you may set PREFIX to point to a dir you have write access to, for instance $HOME/lib. You need to add the new installation dir to LUA_CPATH so that Lua loader can find osbf.so.

Ex: Installing in $HOME/lib

<edit config and set PREFIX to $HOME/lib>

$ mkdir $HOME/lib

$ make install

$ export LUA_CPATH=$LUA_CPATH:$HOME/lib/?.so

After osbf module is properly installed, you may want to install the spamfilter, a Lua script that uses the OSBF-Lua module to classify and tag messages as spam or non-spam (ham) according to the score they get, or to the white/blacklists, if any:

 make install_spamfilter

The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created

The next step is to configure your email account to use the spamfilter:

# set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc were installed

OSBF_LUA_DIR=/usr/local/osbf-lua # change '/usr/local' to your PREFIX

OSBF_LUA_USER_DIR=$HOME/osbf-lua

# let the Lua interpreter find the "osbf" module.

# uncomment if you installed a local copy of the osbf module (e.g. no root access)

#LUA_CPATH=$HOME/lib/?.so:$LUA_CPATH


:0fw: .msgid.lock

* < 350000 # don't check messages greater than 350000 bytes

| $OSBF_LUA_DIR/spamfilter.lua --udir $OSBF_LUA_USER_DIR

Check your installation by sending a message to yourself with the following command in the subject line:

help <your password>

You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:

stats <your password>

You should get a statistics report on the just created databases.

From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:


Tag

Meaning

[--]

almost sure it's a spam - score <= -20

[-]

probably it's a spam (reinforcement zone) - score < 0 and > -20

[+]

probably it's not spam (reinforcement zone) - score >=0 and < 20

[++]


almost sure it's not spam - score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages.

If the classification is wrong you must train the filter replying (you must do a "Reply", not a "Forward") the message back to yourself, replacing the subject with the correspondent training command:

learn <password> spam

or

learn <password> nonspam

The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.

If you make a mistake, you should undo the training with the command "unlearn". Ex:

unlearn <password> spam

if wrongly trained as spam.

Training when the classification is wrong is essential for accuracy. Training when in the reinforcement zone, called reinforcement, is highly recommended for increasing and keeping the accuracy high. After you have a well trained filter, say 99% or better accuracy, you may want to reduce the reinforcement zone, eg. [-10, 10], so as not to do many reinforcements a day. You may change the reinforcement zone, tags, etc, by editing the file spamfilter_config.lua.

As of version 2.0.2, there's a new feedback mechanism, based on an HTML form report sent to the user, which completely removes the nuisance of sending training commands in the subject line for each message. The training report shows a table with the messages in the cache dir with scores between -20 and +20, containing Date, From, Subject and a list of actions, for each message, up to 50 messages per  report. The user selects the proper action for each message and click on the "Send Actions" button to generate a pre-formatted training message, ready to be sent.

The training report can be sent to the users on scheduled times, from a cron job. After 10 learnings on each class, the training report suggests the proper training action for each message, based on its score, and colors each line of the table in red when the suggested action is "Train as Spam", or in blue for "Train as Non-spam". This makes the training process even easier, because most of the time the user doesn't have to do nothing more than just click the "Send Actions" button, for well trained databases. See the script cache_report.lua for details.

Credits

The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the primary copyright.

The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project, as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier. Bill Yerazunis holds the secondary copyright on the OSBF-Lua lib.

Contact

For more information please email me. Comments are welcome!