Text classification module for the
Winner of TREC's Spam Track 2006
Overview · What's new · Download and Contributions · Installation · Manual · Credits · Contact
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.
The OSBF algorithm is a typical Bayesian classifier but enhanced with two techniques that I originally developed for the CRM114 project: Orthogonal Sparse Bigrams - OSB, for feature extraction, and Exponential Differential Document Count - EDDC (a.k.a Confidence Factor), for automatic feature selection. Combined, these two techniques produce a highly accurate classifier. OSBF was developed focused on two classes, SPAM and NON-SPAM, so the performance for more than two classes may not be the same.
spamfilter.lua is an anti-spam filter written in Lua using the OSBF-lua module. It takes special advantage of EDDC to introduce TONE-HR, a highly effective training method. The combination of OSB, EDDC and TONE-HR to enhance a classical Bayesian classifier resulted in the best spam filtering performance in TREC's Spam Track 2006 and in CEAS 2008 Live Challenge.
The Confidence Factor was officially introduced in the paper "Exponential Differential Document Count - A Feature Selection Factor for Improving Bayesian Filters Accuracy", presented in the MIT Spam Conference - 2006, after being in experimental use for more than a year in both projects: CRM114 and OSBF-Lua. The conference slides are also available.
The OSB technique was officially announced in the paper "Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering", a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.
The CRM114 implementation of OSBF was one of the classifiers submitted to the TREC's Spam Track 2005 by the CRM114 team, but its first results were not good because of a bug. Later, the bug was fixed and the OSBF-Lua version was submitted to the track coordinator, prof. Gordon Cormack, for an extra evaluation. The new results were comparable to those of the best participants, with the advantage of being 5 to 10 times faster. Our notebook paper comments on the results of the four filters submitted by the CRM114 team: OSBF, Winnow, OSB Unique and OSB.
OSBF-Lua is free
software and is released under the GPL version 2. You can get a copy
of the license at GPL.
includes a copy of the
license in the file gpl.txt.
See the full CHANGES
The sources can be downloaded from LuaForge.
OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.
Install Lua with dynamic loading enabled:
For linux, execute "make linux" and "make install". For other OS, read the instructions in the INSTALL file and your OS documentation on how to create shared libs.
You might want to change the occurrences of the O2 flag in CFLAGS to O3, in all makefiles, for increased speed.
Install the OSBF-Lua module:
edit the "config" file to suit to your platform - not necessary for Linux - or to change the default installation PREFIX dir (/usr/local).
If you want to install in the default dir you must be root to do the "make install" step. If you don't have root access, you may set PREFIX to point to a dir you have write access to, for instance $HOME/lib. You need to add the new installation dir to LUA_CPATH so that Lua loader can find osbf.so.
Ex: Installing in $HOME/lib
<edit config and set PREFIX to $HOME/lib>
After osbf module
is properly installed, you may want to install the
spamfilter, a Lua script that uses the OSBF-Lua module to classify
and tag messages as spam or non-spam (ham) according to the score
they get, or to the white/blacklists, if any:
The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created
The next step is to configure your email account to use the spamfilter:
do the following steps under your account, not as root
create your local osbf-lua dir:
create your log and cache dirs:
Note: Old messages in the cache dir should be deleted regularly, typically from a cron job, to preserve disk space.
Check Christian Siefkes' trainfilter
for his clean-up script.
copy the spamfilter config file to your dir:
edit spamfilter_config.lua to set your password
change the current dir to your osbf-lua dir and create the spamfilter databases (spam.cfc and nonspam.cfc)
lua /usr/local/osbf-lua/create_databases.lua # change '/usr/local' to your PREFIX
add the following lines to your .procmailrc
set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc
# change '/usr/local' to your PREFIX
let the Lua interpreter find the "osbf" module.
uncomment if you installed a local copy of the osbf module (e.g. no
< 350000 # don't check messages greater than 350000 bytes
OBS: The "osbf-lua" dir and all files and dirs under it must be writable by the user or group that procmail runs under.
Check your installation by sending a message to yourself with the following command in the subject line:
You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:
You should get a statistics report on the just created databases.
From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:
almost sure it's a spam - score <= -20
probably it's a spam (reinforcement zone) - score < 0 and > -20
probably it's not spam (reinforcement zone) - score >=0 and < 20
almost sure it's not spam - score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages.
If the classification is wrong you must train the filter replying (you must do a "Reply", not a "Forward") the message back to yourself, replacing the subject with the correspondent training command:
The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.
If you make a mistake, you should undo the training with the command "unlearn". Ex:
if wrongly trained as spam.
when the classification is wrong is essential for accuracy. Training
when in the reinforcement zone, called reinforcement,
highly recommended for increasing and keeping the accuracy
high. After you have a well trained filter, say 99% or better
accuracy, you may want to reduce the reinforcement zone, eg. [-10,
10], so as not to do many reinforcements a day. You may change the
reinforcement zone, tags, etc, by editing the file spamfilter_config.lua.
As of version 2.0.2, there's a new feedback mechanism, based on an HTML form report sent to the user, which completely removes the nuisance of sending training commands in the subject line for each message. The training report shows a table with the messages in the cache dir with scores between -20 and +20, containing Date, From, Subject and a list of actions, for each message, up to 50 messages per report. The user selects the proper action for each message and click on the "Send Actions" button to generate a pre-formatted training message, ready to be sent.
The training report can be sent to the
users on scheduled times, from a cron job. After 10 learnings on each
class, the training report suggests the proper training action for each
message, based on its score, and colors each line of the table in red
when the suggested action is "Train as Spam", or in blue for "Train as
Non-spam". This makes the training process even easier, because most of
the time the user doesn't have to do nothing more than just click the
"Send Actions" button, for well trained databases. See the script cache_report.lua for details.
The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the primary copyright.
The OSB technique,
as well as the OSBF classify and learn codes are based on the OSB and
OSBF I originally developed for the CRM114 project, as a derivative
work based on Bill Yerazunis' CRM114 Markovian classifier. Bill
Yerazunis holds the secondary copyright on the OSBF-Lua lib.
For more information please email me. Comments are welcome!