There's a wonderful little program called dbacl which stands for digramic Bayesian classifier. It's really nicely documented and installs without much trouble under OS X. I only had one installation issue: There was some weird error message about a file called "missing" being missing, but this did not interfere with installation. One perplexing trait of the program: When I copy example commands from the documentation, they sometimes fail. The reason is that the examples had options on the end of the command, like:
% dbacl -c one -c two -c three sample4.txt -N
The solution was to just rearrange the flags so the filename is at the end.
I wrote this little (specific purpose) pair of scripts for applying Bayesian filtering to Facebook updates (as pulled from the RSS feed).
The training script looks like this:
#!/usr/bin/perl@lines=`curl --fail "http://www.facebook.com/feeds/friends_status.php?id=[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\
$DBACL_PATH="/Users/surly/.dbacl";
#TODO: capture keypresses instantly with Term::ReadKey;
$i=0;
while($lines[$i])
{ $cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
$lines[$i]=~s/\s?\<\/title\>//g;
if ($cat=~/ok/) { print " $lines[$i]";}
elsif ($cat=~/bad/) { print "-$lines[$i]";}
elsif ($cat=~/urgent/) { print "*$lines[$i]";}
#categorization prediction, done
print "Categorize as: o)k, b)ad, u)rgent:\n";
$cat=<>;$cat=~s/\n//;
if ($cat=~/o/)
{ `echo "$lines[$i]">>$DBACL_PATH/ok.d`;
print "Categorized as ok.\n";}
elsif ($cat=~/b/)
{ `echo "$lines[$i]">>$DBACL_PATH/bad.d`;
print "Categorized as bad.\n";}
elsif ($cat=~/u/)
{ `echo "$lines[$i]">>$DBACL_PATH/urgent.d`;
print "Categorized as urgent.\n";}
$i++;
}
It marks each status update with a symbol to indicate what the prediction would be based on the existing set of information.
dbacl has one counter-intuitive feature; it does not have any memory. One must keep all the information that one wants to train the filters on (here, in the files ok.d, bad.d, and urgent.d) and then refresh the filter by training it on the entire corpus, as is done in the script below. Briefly, it fetches the updates, trains the filters (using regular expressions for defining Bayesian features, specifically so I can filter on the @somebody grammar that Twitter uses), and then displays all the stati, using ANSI escape codes to emphasize or de-emphasize them based on their categorization.
@lines=`curl --fail "http://www.facebook.com/feeds/friends_status.php?id=[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\
#!/usr/bin/perl
$DBACL_PATH="/Users/surly/.dbacl";
`dbacl -l $DBACL_PATH/ok -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/ok.d`;
`dbacl -l $DBACL_PATH/bad -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/bad.d`;
`dbacl -l $DBACL_PATH/urgent -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/urgent.d`;
$i=0;
while($lines[$i])
{ $lines[$i]=~s/\s?\<\/title\>//g;
$cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
if ($cat=~/ok/)
{ print "$lines[$i]";}
elsif ($cat=~/bad/)
{ print "\033[0;37;48m$lines[$i]";
print "\033[0m";}
elsif ($cat=~/urgent/)
{ print "\033[0;34;48m$lines[$i]";
print "\033[0m";}
$i++;
}
This is a first cut. The dbacl documentation suggests not using two-word combinations when the corpus is small, due to the possibility of overfitting. Much remains to be learned about the art of Bayesian filtering.
Think of the possibilities for using Bayesian filtering on things beyond spam filtering. Imagine using it to filter through social networks or music (see TheFilter). Imagine a web browser that watches what links you click on and which pages you bookmark (or mark as "good") and then highlights links that it predicts you will like, possibly based on pre-fetching and evaluating the link contents. I love the feeling of unbounded potential.
Further reading:
A much improved status update reader with Bayesian filtering
No comments:
Post a Comment