Sunday, May 30, 2010

Command-line Bayesian Twitter reader with ad-blocking

UPDATE: The method below and the correspond script recently stopped working since Twitter changed their login method to "OAuth". The actual necessary changes will be a) switching to an updated Twitter client like TTYtter (which looks pretty awesome) and b) maybe revising the parsing of the Twitter client output. I've just started working on this and don't know when or if I will release updates. The script still works for Facebook updates, as is.

Following up on my initial post on command-line Bayesian filtering and the subsequent update, I have now generalized the script to incorporate not just Facebook status updates but Twitter posts as well.

I've been refining this script over the last year. In addition to providing a convenient way of incorporating Bayesian filtering in the tracking of posts from Twitter and Facebook, it now has the ability to review a particular Twitter user's posting history (via the "rewind" option) and to whitelist or blacklist arbitrary terms, which can be useful in permanently filtering out undesirable subjects or ads. And it tries to auto-expand those annoying shortened URLs and display the actual title of the web page being referenced.

This script requires that your system has the following programs: links (terminal-based web browser), curl, dbacl (Bayesian decision engine), twyt (command-line Twitter interface).

(I actually use another command line Twitter client (twixer) for following and unfollowing other people's accounts. And, while Twyt should work for posting on Twitter, I often use a third program for posting (twerp) which I started using before I found Twyt.)

Typical usage of the Bayesian twi script:

Teapot:~ surly$ twi
0: [5325528218] StephenAtHome: wearing a mask is
so sweaty. i don’t know how those scooby-doo
villians did it. (Sun Nov 01 00:40:05 2009 via web)
o)k, b)ad, u)rgent, open l)ink, c)opy, add to q)uotes, r)ewind, <CR>=next

The number "0" is just an index that counts down to zero to help identify the status messages you will be processing in this session. The number in brackets comes from the twyt program and is a unique Twitter message ID.

The options presented after the post are:
o)k: Add the post to the "ok" data file (for stuff you want to see more of in the future) and go to the next item.
b)ad: Add the post to the "bad" data file (stuff you don't want to see) and go to the next item.
u)rgent: Add the post to the "urgent" data file and go to the next item. This is a category I came up for for messages that say things like "Party at my place tonight" or "Free T-shirts for the first ten respondents". All I am really doing with them at the moment is highlighting them in a brighter color, but my idea is that in some future revision, these might be actively fetched and brought to my attention. At the moment, so few messages qualify as "urgent" that it's still a half-baked idea.
open l)ink: Opens the first extracted hyperlink in Safari.
c)opy: Copies the text of the post to the OS X system clipboard.
add to q)uotes: Appends the post to a text file with my collection of quotes. In the future, I might cause this to also mark the post as a "favorite" on Twitter, but this is not a priority for me.
r)ewind: The rewind feature was designed to compensate for situations where your Bayesian filter causes posts to filter below your radar and then suddenly another one from the same person bubbles up to the surface, but you don't understand it because you are missing the context. Rewinding causes the program to start going backwards through the locally cached history of all posts (in the "all.d" file) searching for posts from the user in question. It will keep feeding you previous posts as you keep hitting keys (or until you type q for "quit"). This feature only works on Twitter accounts at the moment, due to the way it is extracting the name to grep for.
<CR>=next: You've got to hit the carriage return to advance to the next post. If you type "o" and hit enter, the post will be categorized as "ok". If you just hit enter, your Bayesian filter will learn nothing from this post. Initially, it's good to give your filter as much definite feedback as you can, but once you are happy with its performance, you can just hit enter all the way through if you like.

Importantly, but perhaps not obviously, any set of these commands can be entered for a given post, and they will all be executed. Example:
o)k, b)ad, u)rgent, open l)ink, c)opy, add to q)uotes, r)ewind, <CR>=next

will categorize the item as "ok", open the associated hyperlink, and add the item to the quotes file. There is nothing stopping you from categorizing the item as ok, bad, and urgent, but this will almost certainly just confuse the Bayesian filter.

Finally you will be prompted to "Categorize which other posts as ok?"

When the program is finished processing all the new status messages, it will post an accuracy rating for the current session (basically, if you don't mark any of the "ok" messages as bad or correct any of the "bad" ones to be ok, it thinks it has 100% accuracy). You can use this as a rough metric for how good the current training level is.

Limitations: The current version of Twyt does not seem to be picking up the newfangled "Retweet" posts. Based on the messages that I am missing, this is something of a net advantage.

The twi script can be found here.