Saturday, May 09, 2009

In search of the ideal Twitter command-line client

After testing out a bunch of candidates, my new default command-line Twitter client is Twyt. Do not be fooled by the spartan front page. Active development is taking place. It's written in Python and (as of version 0.9.2) has the virtue of offering the following command-line options for the "check stati of friends" command:

-s SINCE, --since=SINCE
The date or ID of a message to list status messages
-P PAGE, --page=PAGE Lists the PAGEth page of status updates
-n COUNT, --count=COUNT
The number of statuses to retrieve (max 200)

Unfortunately, it seems to be suffering from the same malady as other programs: No matter what value I put in for the --since flag, I still get the same 20 stati... The --page flag does work though, and the --count flag is doing something...

However, I also like the twixer client because it has implemented a few commands that I have not found in any other command line program. According to its command line options, it can 1) list your friends, 2) list your followers, 3) show the little information box on a user, and 4) delete stati and direct messages.

Unfortunately (and maybe I am doing something wrong here), but as of version 0.1.1, it will only show the little information box on me, no matter who I put in. But status deletion works great!

Sunday, May 03, 2009

General-purpose command-line Bayesian filter

The idea for using Bayesian classification (e.g., the kind of probabilistic estimation of whether an e-mail is spam, based on how well its features match the features of things you have previously called "spam") for filtering things like Twitter status update is now out there (via Tao of Mac). As I often view such updates from the command line, I decided to give it a shot.

There's a wonderful little program called dbacl which stands for digramic Bayesian classifier. It's really nicely documented and installs without much trouble under OS X. I only had one installation issue: There was some weird error message about a file called "missing" being missing, but this did not interfere with installation. One perplexing trait of the program: When I copy example commands from the documentation, they sometimes fail. The reason is that the examples had options on the end of the command, like:
% dbacl -c one -c two -c three sample4.txt -N
The solution was to just rearrange the flags so the filename is at the end.

I wrote this little (specific purpose) pair of scripts for applying Bayesian filtering to Facebook updates (as pulled from the RSS feed).

The training script looks like this:

#TODO: capture keypresses instantly with Term::ReadKey;
@lines=`curl --fail "[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\//g"`;
{ $cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
if ($cat=~/ok/) { print " $lines[$i]";}
elsif ($cat=~/bad/) { print "-$lines[$i]";}
elsif ($cat=~/urgent/) { print "*$lines[$i]";}
#categorization prediction, done
print "Categorize as: o)k, b)ad, u)rgent:\n";
if ($cat=~/o/)
{ `echo "$lines[$i]">>$DBACL_PATH/ok.d`;
print "Categorized as ok.\n";}
elsif ($cat=~/b/)
{ `echo "$lines[$i]">>$DBACL_PATH/bad.d`;
print "Categorized as bad.\n";}
elsif ($cat=~/u/)
{ `echo "$lines[$i]">>$DBACL_PATH/urgent.d`;
print "Categorized as urgent.\n";}

It marks each status update with a symbol to indicate what the prediction would be based on the existing set of information.
dbacl has one counter-intuitive feature; it does not have any memory. One must keep all the information that one wants to train the filters on (here, in the files ok.d, bad.d, and urgent.d) and then refresh the filter by training it on the entire corpus, as is done in the script below. Briefly, it fetches the updates, trains the filters (using regular expressions for defining Bayesian features, specifically so I can filter on the @somebody grammar that Twitter uses), and then displays all the stati, using ANSI escape codes to emphasize or de-emphasize them based on their categorization.

@lines=`curl --fail "[fill-in]&key=[the-blanks]&format=rss20" -A "Mozilla/4.0"|grep title|grep -v "s Friends"|perl -p -i -e "s/^\s+//"|perl -p -i -e "s/'/\'/g"|perl -p -i -e "s/ \'s/'s/g"|perl -p -i -e "s/\&/\&/g"|perl -p -i -e "s/\//g"`;

`dbacl -l $DBACL_PATH/ok -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/ok.d`;
`dbacl -l $DBACL_PATH/bad -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/bad.d`;
`dbacl -l $DBACL_PATH/urgent -g "^([[:alpha:]]+)" -g "[^[:alpha:]]([[:alpha:]]+)" -g "^(@[[:alpha:]]+)" -g "[^[:alpha:]](@[[:alpha:]]+)" $DBACL_PATH/urgent.d`;
{ $lines[$i]=~s/\s?\<\/title\>//g;
$cat=`echo "$lines[$i]"|dbacl -v -c ok -c bad -c urgent`;
if ($cat=~/ok/)
{ print "$lines[$i]";}
elsif ($cat=~/bad/)
{ print "\033[0;37;48m$lines[$i]";
print "\033[0m";}
elsif ($cat=~/urgent/)
{ print "\033[0;34;48m$lines[$i]";
print "\033[0m";}

This is a first cut. The dbacl documentation suggests not using two-word combinations when the corpus is small, due to the possibility of overfitting. Much remains to be learned about the art of Bayesian filtering.

Think of the possibilities for using Bayesian filtering on things beyond spam filtering. Imagine using it to filter through social networks or music (see TheFilter). Imagine a web browser that watches what links you click on and which pages you bookmark (or mark as "good") and then highlights links that it predicts you will like, possibly based on pre-fetching and evaluating the link contents. I love the feeling of unbounded potential.

Further reading:
A much improved status update reader with Bayesian filtering