19 במרץ 2009

TwitGraph-en


By popular demand I'm reposting this one in English. I usually keep this blog Hebrew only as I wanted it to approach Israeli developers but as I said, I was asked to post that one in en as well and who could resist such a request? ;)

For a while I've been trying to solve the following problem: how to effectively get feedback from users? Specifically online users. How does one measure success of a product launch or a campaign?
It's quite obvious I'm not the one first to think of this problem and I'm sure there's already an established industry out there working just on that but still... I had one specific problem that I just couldn't get solved any other way. So I built that solution my own hands. It was fun doing so, so I'm sharing the experience here.

Problem definition: I work on a product that has massive web attention. Once we release a new feature of the product I'd like to see how the community reacts. To do so I've set up several blog searches and fed them to Google Reader. I've also subscribed to a twitter search and fed that rss feed to GReader as well, but that wasn't enough. Some days I get dozens or hundreds of tweets and it gets too hard to measure both buzz and user's happiness. What I wanted to know was: 1) how many tweets are tweeted about my product and 2) were they positive or negative attitude?
So I've put the rest of the blogsphere and the rest of the world aside and simply concentrated on twitter.

I created this. A web application that graphs how many tweets a day there are on your subject of interest, plus, how many of them were positive, negative or neutral attitude.



And now... to the gory tech details. You can stop reading now if you don't care about fun software

I used google's appengine as my server, so a lot of the code is written in python. Naturally I also used a bit of javascript and family for the client side implementation. Arguably, I could have used more CSS to make it prettier, but I didn't (any volunteers that want to help with that?)

There were several interesting challenges, so I'll speak about them here and explain how I solved them.

1. Getting the data
What I want to do is query a date range, say past 7 days, and for each day graph how many tweets are there for that specific query term. So, challenge #1 is how to get that data.
Twitter has a pretty cool and slick search API which lets you search API-ish for all kinds of stuff. Here are just a few examples of it:
http://search.twitter.com/search.atom?q=twitter
http://search.twitter.com/search.atom?q=from%3Aalexiskold (from a user)
http://search.twitter.com/search.atom?q=twitter until:2009-03-01 (Until a date. You may also use from: date)

So, this API is quite nice. It not only allows you to query for an atom (xml) result but also a json and even jsonp, which is json with a callback and is useful for crossdomain requests. I'll talk a little bit about that later.
But the API has some very painful restrictions, the most important of them all is the limit on the number of results. They limit them to at most 100 for each call. Now, coming from a search engine company I can clearly see why they do that, they can't possibly allow not to limit the number of results, but as a developer that uses that API... that was a challenge.
It's important to me to get all the results and get the raw results, e,g, not aggregate b/c I'd like to run some text categorization algorithm on them later. Not that there's a way of getting aggregate results on twitter, but even if there were I wouldn't have used it.
So here's what I did: I called the API to get the first 100 results. Then I called it to get the next 100 results, and then the next 100 results... etc. That's mean, I know... API servers don't necessarily like that and they might block me sooner or later, but what other choice did I have?
This method works actually quite well for micro-trends. E.g. don't try to feed twitgraph with popular search terms such as "google" or "youtube" b/c it'll simple drown the server. That's one of the most annoying shortcomings of my service. But, for micro-trends, e.g. less popular trends it actually works pretty well, so that was really nice.
The first version was implemented almost entirely on the client side, which means that all logic was implemented in javascript and the google appengine server was hardly involved. But the second version changed that and nowadays all logic is actually implemented on the server, including recursively fetching the results.

Problem 1 solved (but only for micro-trends) by fetching results recursively.

2. How to analyze the data for positive and negative attitude?
The twitter search API has a neat feature. If you append to a search term a ":)" or a ":(" you get back happy/sad results. That's the first API that I've seen which actually uses emoticons, cool. Only problem is that it sucks :( and :( again. The results are absolute rubbish and had very poor quality, so I could not use them at all. Indeed, they would sometimes get the happy/sad sentiment right, but in most cases they would just say "don't know" and in some cases they would return the wrong answer. Bottom line: nice try, but can't use it.
So I had to do it myself. I didn't know what do to, so I posted a question on stackoverflow.
I got plenty of answers there, so now I knew what to do :)
Here's what I did: I fetched all the results from the server and then used a Naive Bayesian Classifier to tag them to :-), :-( and :-|.
Basically a naive bayesian classifier works like this:
First you train it by feeding it examples of :) tweets, of :( tweets and of :-| tweets which you prepared beforehand, and then you ask it to guess what's the sentiment of the next tweet. That works surprisingly well!
I used a bayesian classifier from here which was pretty simple to use. To bootstrap the system I fed it with a list of known good words and a list of known bad words that I found somewhere, which is BTW not ideal for a bayesian classifier, but it worked reasonably well, and then I added a dynamic learning feature, namely, as you get the search results back, as a user, you can teach twitgraph what's the correct sentiment of each and every tweet. Next time we use this data as a signal, and this turns out to be a very good signal. I've now tagged several dozens of tweets and already classification is getting really really good.

So - problem 2 solved - fetch all results and use an open-source bayesian classifier. Happy happy :)

3. How to graph the data?
That was actually the easiest part of then all!
I used the Google Visualization API javascript library which is pretty easy to use. Really, with only a few lines of code I created those nice graphs. To prove that I'll paste the two functions that draw the graphs here.


twitgraph.Grapher.prototype.drawLineChart = function() {
var aggregate = this.result.aggregate;
// Create and populate the data table.
var data = new google.visualization.DataTable();
data.addColumn('string', 'Date');
data.addColumn('number', ':-(');
data.addColumn('number', ':-)');
data.addColumn('number', ':-|');
data.addRows(aggregate.length);
for (var i = 0; i < aggregate.length; ++i) {
data.setCell(i, 0, aggregate[i].date);
data.setCell(i, 1, aggregate[i].neg);
data.setCell(i, 2, aggregate[i].pos);
data.setCell(i, 3, aggregate[i].neu);
}

// Create and draw the visualization.
twitgraph.Utils.$('twg-graph').innerHTML = '';
var chart = new google.visualization.AreaChart(twitgraph.Utils.$('twg-graph'));
chart.draw(data, {legend: 'bottom',
isStacked: true,
width: 600,
height: 300,
colors: ["#FF4848", "#4AE371", "#2F74D0"]});
}

twitgraph.Grapher.prototype.drawPieChart = function() {
var stats = this.result.stats;
// Create and populate the data table.
var data = new google.visualization.DataTable();
data.addColumn('string', 'Sentiment');
data.addColumn('number', 'Tweet count');
data.addRows(3);
data.setValue(0, 0, ':-(');
data.setValue(0, 1, stats.neg);
data.setValue(1, 0, ':-)');
data.setValue(1, 1, stats.pos);
data.setValue(2, 0, ':-|');
data.setValue(2, 1, stats.neu);

// Create and draw the visualization.
twitgraph.Utils.$('twg-graph-pie').innerHTML = '';
var chart = new google.visualization.PieChart(twitgraph.Utils.$('twg-graph-pie'));
chart.draw(data, {legend: 'none',
is3D: true,
width: 300,
height: 300,
colors: ["#FF4848", "#4AE371", "#2F74D0"]});
}

Problem 3 solved.
Well, almost... I also wanted to have the option of static images, e.g. gif. A common use case for static images is to be able to include the graph in an email which doesn't allow running any js. I solved that too, but this time using the Google Charts Service. So now static graph images (with for dynamic data which get updated every day) are also available.

Now 3 is really solved :)

4. how to embed in a 3rd party site?
I think twitgraph is useful if it can be embedded in 3rd party sites. But to do that you'd need to run an XmlHttp call across domains, which most browsers just wouldn't let you. The solution to that problem is already well known and is called jsonp, json with padding. That's actually a very well known technique which is widely used across other web services so I won't get into details and just lay out the short concepts and code here.
The idea is that you can't make an XmlHttp request across domains, but you can include JS across domains. And you can also load javascript dynamically to any web page if you'd just add a <script> element to its head at any given time, even if the page is already loaded. That javascript that you added to the page will call a callback function that you tell it to, once its loaded, and there - you have your data.
More about json and jsonp.
The code from my app is here:

function jsonp(url, callbackName) {
url += '&callback=' + callbackName;
addScript(url);
}
function addScript(url) {
var script = document.createElement("script");
script.setAttribute("src", url);
script.setAttribute("type", "text/javascript");
document.body.appendChild(script);
}

5 design
Well, what can I say... I didn't solve that one yet.. I'm bad at design so the application is still ugly. I'd be very happy to get professional help with this...

Hope the post was useful to all ya developers out there.

Would you like to contribute? The source code is here and you're welcome to drop me a line at the post comments.

תגובה 1:

אנונימי אמר/ה...

nice...
here is the same idea but in a batch of posts that give you all the 'popular trends' per hour
http://twitter-buzz.blogspot.com