Sunday, May 03, 2009

General-purpose command-line Bayesian filter

The idea for using Bayesian classification (e.g., the kind of probabilistic estimation of whether an e-mail is spam, based on how well its features match the features of things you have previously called "spam") for filtering things like Twitter status update is now out there (via Tao of Mac). As I often view such updates from the command line, I decided to give it a shot.

There's a wonderful little program called dbacl which stands for digramic Bayesian classifier. It's really nicely documented and installs without much trouble under OS X. I only had one installation issue: There was some weird error message about a file called "missing" being missing, but this did not interfere with installation. One perplexing trait of the program: When I copy example commands from the documentation, they sometimes fail. The reason is that the examples had options on the end of the command, like:
% dbacl -c one -c two -c three sample4.txt -N
The solution was to just rearrange the flags so the filename is at the end.

I wrote this little (specific purpose) pair of scripts for applying Bayesian filtering to Facebook updates (as pulled from the RSS feed).

The training script looks like this:

#!/usr/bin/perl
$DBACL_PATH="/Users/surly/.dbacl";
#TODO: capture keypresses instantly with Term::ReadKey;
@lines=`curl --fail "http://www.facebook.com/feeds/friends_status.php?id=[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\//g"`;
$i=0;
while($lines[$i])
{ $cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
$lines[$i]=~s/\s?\<\/title\>//g;
if ($cat=~/ok/) { print " $lines[$i]";}
elsif ($cat=~/bad/) { print "-$lines[$i]";}
elsif ($cat=~/urgent/) { print "*$lines[$i]";}
#categorization prediction, done
print "Categorize as: o)k, b)ad, u)rgent:\n";
$cat=<>;$cat=~s/\n//;
if ($cat=~/o/)
{ `echo "$lines[$i]">>$DBACL_PATH/ok.d`;
print "Categorized as ok.\n";}
elsif ($cat=~/b/)
{ `echo "$lines[$i]">>$DBACL_PATH/bad.d`;
print "Categorized as bad.\n";}
elsif ($cat=~/u/)
{ `echo "$lines[$i]">>$DBACL_PATH/urgent.d`;
print "Categorized as urgent.\n";}
$i++;
}

It marks each status update with a symbol to indicate what the prediction would be based on the existing set of information.
dbacl has one counter-intuitive feature; it does not have any memory. One must keep all the information that one wants to train the filters on (here, in the files ok.d, bad.d, and urgent.d) and then refresh the filter by training it on the entire corpus, as is done in the script below. Briefly, it fetches the updates, trains the filters (using regular expressions for defining Bayesian features, specifically so I can filter on the @somebody grammar that Twitter uses), and then displays all the stati, using ANSI escape codes to emphasize or de-emphasize them based on their categorization.


#!/usr/bin/perl
$DBACL_PATH="/Users/surly/.dbacl";
@lines=`curl --fail "http://www.facebook.com/feeds/friends_status.php?id=[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\//g"`;

`dbacl -l $DBACL_PATH/ok -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/ok.d`;
`dbacl -l $DBACL_PATH/bad -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/bad.d`;
`dbacl -l $DBACL_PATH/urgent -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/urgent.d`;
$i=0;
while($lines[$i])
{ $lines[$i]=~s/\s?\<\/title\>//g;
$cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
if ($cat=~/ok/)
{ print "$lines[$i]";}
elsif ($cat=~/bad/)
{ print "\033[0;37;48m$lines[$i]";
print "\033[0m";}
elsif ($cat=~/urgent/)
{ print "\033[0;34;48m$lines[$i]";
print "\033[0m";}
$i++;
}


This is a first cut. The dbacl documentation suggests not using two-word combinations when the corpus is small, due to the possibility of overfitting. Much remains to be learned about the art of Bayesian filtering.

Think of the possibilities for using Bayesian filtering on things beyond spam filtering. Imagine using it to filter through social networks or music (see TheFilter). Imagine a web browser that watches what links you click on and which pages you bookmark (or mark as "good") and then highlights links that it predicts you will like, possibly based on pre-fetching and evaluating the link contents. I love the feeling of unbounded potential.



Further reading:
A much improved status update reader with Bayesian filtering

No comments: