Monday, February 27, 2012

Tracking user sentiments on Twitter with Twitter4j and Esper

For new comers to Complex Event Processing  and Twitter API, I hope this serves as a short tutorial and helps them get off the ground quickly.

Managing big data and mining useful information from it is the hottest discussion topic in technology right now. Explosion of growth in semi-structured data flowing from social networks like Twitter, Facebook and Linkedin is making technologies like Hadoop, Cassandra a part of every technology conversation. So as not to fall behind of competition, all customer centric organizations are actively engaged in creating social strategies. What can a company get out of data feeds from social networks? Think location based services, targeted advertisements and algorithm equity trading for starters. IDC Insights have some informative blogs on the relationship between big data and business analytics. Big data in itself will be meaningless unless the right analytic tools are available to sift through it, explains Barb Darrow in her blog post  on gigaom.com


Companies often listen into social feeds to learn customers’ interest or perception about the products. They also are trying to identify “influencers” – the one with most connections in a social graph – so they could make better offers to such individuals and get better mileage out of their marketing. The companies involved in equity trading want to know which public trading companies are discussed on Twitter and what are the users' sentiments about them. From big companies like IBM to  smaller start-ups, everyone is racing to make most of the opportunities of big data management and analytics. Much documentation about big data like this ebook from IBM 'Big Data Platform'  is freely available on the web. However a lot of this covers theory only. Jouko Ahvenainen in reply to Barb Darrow’s post above makes a good point that “many people who talk about the opportunity of big data are on too general level, talk about better customer understanding, better sales, etc. In reality you must be very specific, what you utilize and how”.

It does sound reasonable, doesn't it? So I set out to investigate this a bit further by prototyping an idea, the only good option I know. If I could do it, anybody could do it. The code is remarkably simple. But, that's exactly the point. Writing CEP framework yourself is quite complex but using it is not. Same way, Twitter makes it real easy to get to the information through REST API.



Big Data - http://www.bigdatabytes.com/managing-big-data-starts-here/

Complex Event Processing (CEP), I blogged previously (click here to read) is a critical component of the big data framework. Along with CEP, frameworks with  Hadoop are used to compile, parse and make sense out of the 24x7 stream of data from the social networks. Today,  Twitter's streaming api and CEP could be used together to capture the happiness levels of twitter users. The code I present below listens in to live tweets to generate an 'happy' event every time “lol” is found in the text of a tweet. The CEP is used to capture happy events and alert is raised every time the count of happy events exceed pre-determined number in a pre-determined time period. An assumption that a user is happy every time he or she uses “lol” is very simplistic, but it helps get the point across. In practice, gauging the users' sentiment is not that easy because it involves natural language analysis. Consider below the example that highlights the complexities of analyzing natural language.


Iphone has never been good.
Iphone has never been so good.

As you can see, addition of just one word to the sentence completely changed the meaning. Because of this reason, natural language processing is considered one of the toughest problems in computer science. You can learn “natural language processing” using free online lectures offered by Stanford University. This link  takes you directly to the first lecture on natural language analysis by Christopher Manning. But, in my opnion, the pervasive use of abbreviations in social media and in modern lingo in general, is making the task a little bit easier. Abbreviations like “lol” and “AFAIK” accurately project the meaning. The use of “lol” projects “funny” and “AFAIK” may indicate the user is “unsure” of him or herself.

The code presented below uses Twitter4j api to listen to live twitter feed and Esper CEP to listen to events and alert us when a threshold is met. You can download twitter4j binaries or source from http://twitter4j.org/en/index.html and Esper from http://esper.codehaus.org/ . Before you execute the code, make sure to create a twitter account if you don’t have one and also read Twitter’s guidelines and concepts  its streaming API here . The authentication through just username & password combination is currently allowed by Twitter but it is going to be phased out in favor of oAuth authentication in near future. Also, pay close attention to their ‘Access and Rate Limit’ section. The code below uses streaming api in one thread. Please do not use another thread at the same time to avoid hitting the rate limit. Hitting rate limits consistently can result into Twitter blacklisting your twitter ID. Also it is important to note that, the streaming API is not sending each and every tweet our way. Twitter typically will sample the data by sending 1 out every 10 tweets our way. This is not a problem however for us, as long as we are interested in patterns in the data and not in any specific tweet. Twitter offers a paid service for  businesses that need streaming data with no rate limits. Following diagram shows the components and processing of data.


Diagram. Charts & DB not yet implemented in the code


Listing 1. Standard java bean representing a happy event.




Listing 2. Esper listener is defined.




Listing 3.

Twitter4j listener is created. This listener and CEP listener start listening. Every twitter post is parsed for ‘lol’. Every time ‘lol’ is found, an happy event is generated. CEP listener raises an alert every time the total count of ‘lol’ exceeds 2 in last 10 seconds.
The code establishes a long running thread to get twitter feeds. You will see the output on the console every time threshold is met. Please remember to terminate the program, it doesn't terminate on its own.

Now that you have this basic functionality working, you can extend this prototype in number of ways. You can handle additional data feeds (from source other than Twitter) and use Esper to corelate data from the two data feeds. For visually appealing output, you can feed the output to some charting library. For example, every time Esper identifies an event, the data point is used to render a point on a line graph. If you track the ‘happy event’ this way, then the graph will essentially show the ever changing level of happiness of Twitter users over a period of time.

 Please use comment section for your feedback, +1 to share and let me know if you would like to see more postings on this subject.

6 comments:

  1. Really interesting topic, yes, it could be good if you can write more about the subject, i´m interesting in make a social listener based on twitter, facebook, etc, but also a social crm

    thanks in advance

    ReplyDelete
  2. The provided code samples are not executing as expected. Twitter API continues to generate an "TwitterException{exceptionCode=[ec814753-44a5356e 4eaddaa2-50017419]" even though I am using the latest 2.2.5 version. Also, tried the 2.2.6-SNAPSHOT version with same result. It would be quite helpful if you may be able to include list of dependencies (jar files and versions used) with articles involving code.
    Thank you for this good article.

    ReplyDelete
    Replies
    1. I have been busy some other pressing matters for a while. I will be able to publish the list if you still haven't figured this out. Other readers were able to run the code so I am assuming you were able to try it out successfully.

      Delete
  3. Outstanding! Thanks a lot!

    ReplyDelete
  4. I have a question if you don't mind. I made the code run but I really can't understand what is returning namely ??

    I must mention that i do have only 5 "lol" words on my twitter account. Also I know the meaning of the first 3 rows from below . How I can check if it right?

    Thanks!

    Nov 11, 2012 11:24:57 PM com.espertech.esper.core.service.EPServiceProviderImpl doInitialize
    INFO: Initializing engine URI 'default' version 4.7.0
    [Twitter Stream consumer-1[initializing]] INFO twitter4j.TwitterStreamImpl - Establishing connection.
    [Twitter Stream consumer-1[Establishing connection]] INFO twitter4j.TwitterStreamImpl - Connection established.
    [Twitter Stream consumer-1[Establishing connection]] INFO twitter4j.TwitterStreamImpl - Receiving status stream.
    Got a status deletion notice id:232628414350254080
    ******* lol found *****
    Got a status deletion notice id:194815941673099264
    ******* lol found *****
    Got a status deletion notice id:192334088139583489
    Got a status deletion notice id:163152528408711169
    Got a status deletion notice id:168551524618874881
    Got a status deletion notice id:175282791137816578
    Got a status deletion notice id:157517124481462272
    Got a status deletion notice id:184313328968019968
    Got a status deletion notice id:190518243914555393
    Got a status deletion notice id:267334707086237696
    Got a status deletion notice id:164048083649445888
    Got a status deletion notice id:177638354211438592
    Got a status deletion notice id:250327253127409665
    Got a status deletion notice id:175783616205426688
    Got a status deletion notice id:207170465452654592
    Got a status deletion notice id:172800786571591681
    Got a status deletion notice id:187746656199000065
    Got a status deletion notice id:168203644842414080
    Got a status deletion notice id:207993157198155777
    ******* lol found *****
    exceeded the count, actual 3
    Got a status deletion notice id:184405561717174272
    Got a status deletion notice id:163309466689863681
    Got a status deletion notice id:267769098551820289
    Got a status deletion notice id:237008857354891264
    Got a status deletion notice id:90240781918543872
    Got a status deletion notice id:161871751553363968
    Got a status deletion notice id:189475208585940994
    Got a status deletion notice id:192284264002355201
    Got a status deletion notice id:187276285989502977
    Got a status deletion notice id:171972042437042176
    Got a status deletion notice id:215204256293195776
    Got a status deletion notice id:250918457049239553
    Got a status deletion notice id:94131103731957762
    Got a status deletion notice id:267756112995049473
    Got a status deletion notice id:267725779796893696
    Got a status deletion notice id:242691765981831168

    ReplyDelete
    Replies
    1. You may want to read 'deleted message FAQ' on Twitter. The very first item there is following
      What is a delete notice?

      Twitter sends us a notification whenever a user deletes a Tweet. We pass these notifications on to you as part of your stream. If you are storing Tweets you must take account of these delete messages in order to comply with Twitter's Terms of Service.

      Delete