Tuesday, November 22, 2011

R programming - It's super!

Last week Sybase announced that its market data analytics platform Sybase RAP - The Trading Edition now supports R. R programming language enables faster algorithm development and handling of huge amount of data to provide its analysis to traders, risk managers and quantitative analysts. For more information on the offering please click here. Sybase also sponsored a webinar on this subject with reasearchers from Yale University. To listen to the webcast click here.

R's official website (http://www.r-project.org/) provides much of the required information if you want to try, but to save you some time, I would like to summerize what I learnt so far from my quick review of R.

Handling and analyzing huge data sets, performing statistical calculations and rendering the results in very professional charts are its strengths. It's performance absolutely blew me away. I tried some of its features on a csv file with 100,000 records with multiple fields and the results were generated instantaneously. Remarkable! According to one of my friends R is used heavily not only in financial companies but also in pharmaceutical or research companies where gene sequencing/analyzing takes place. Bank of America uses R in quantitative analysis. Having spent 10 years with the bank previously, I was surprised I didn't know about it then.The point is, you may not have heard about R because it is not a mainstream programming language, but it is widely used in some industries. Just to show how powerful this language is,
these 2 lines of code is enough to read a csv file with 100000 lines with "|" as separation character and render a scatter plot with values from 2 columns (Durtn & Curs) of this file.

myFile <- read.csv ("Your FileName", head=TRUE, sep="|")
plot(myFile$Durtn, myFile$Curs, main="Duration vs records", xlab="Duration", ylab="Records retrieved")

Google R for more information, but here is the gist.
Some of the features worth mentioning are -

  • R is a language and a development environment built for statistical computing and graphics.
  • It's an open source project.
  • It was built to address some of the short comings of its predecessor called "S".
  • Classes, objects, methods - concepts similar to any object oriented language.
  • The extent of its functionality does not come close to Java or C++ but it addresses a niche and provides functionality that's way easier to use than Java or c++.
  • Lists are ordered collection of objects with no need for the objects to be of the same type.
  • Data frames is a fantastic concept. R can handle multi-dimensional data sets through the use of data frames.
  • Reading a file is as easy as assinging a value to a variable.
  • Accessing a column in a file and running statistical functions on it is accomplished in just one step
  • Charting, graphing is done in one step.
  • Multivariate analysis, a set of techniques dedicated to the analysis of data sets with more than one variable, could be done through R effectively. A use case of multivariate testing is projecting the most effective user behavior (one that yields on multiple clicks) on a website, by moving the assets around on a web page.
  • The output could be sent to the console or a file system resource.
  • Packages are available to plug R into some popular IDEs.
In case you are wondering what other software/languages are used for statistical analysis, please read this thread.

Downloading and trying R is really easy. Just give it a try and let me know what you think.

Saturday, November 12, 2011

Why clique problem is important? Think social networks like Facebook.

A question was asked in last Sunday's Boston Globe, in its 'Etiquette at work' section, on how to avoid appearance of cliques. The person who asked the question has formed a group (clique) at her office that eats lunch together on a day every week. Her manager joins the group once in a while without invitation and the questioner was asking how to resolve this situation. Clearly, cliques are important part of our social life but what does that have to do with computer science?

There was a very interesting article in the MSDN's October issue by Dr. McCafferey. It's the first of a series of articles explaining what a clique is and the importance of finding the maximum clique in a graph and the algorithm to solve the problem. So what is a clique in a graph? Clique is a subset of a graph where every node is connected with every other node. Following paragraph is directly quoted from the
MSDN issue -

Maximum clique problem is encountered in wide range of applications, including social network communication analysis, computer network analysis, computer vision and many others. For graphs of even moderate size, it turns out that the maximum clique problem is one of the most challenging and intersting problems in computer science. The techniques used to solve the maximum clique problem - which include tabu search, greedy search, plateau search, real-time parameter adaptation and dynamic solution history - can be used in many other problem scenarios. In short, code that solves the maximum clique problem can be directly userful to you, and the advanced techniques employed in the alogrithm can be helpful for solving other difficult programming problems. In the following figure the set of nodes 0, 1, 3, 4 form the maximum clique.

Figure 1.
You can find more information on clique problem on Wikipaedia and many other web sites.
While working towards solving this problem, I found myself reflecting on my school years.
When I grew up in India, I disliked mathematics like every one else. Most students were true haters but  others loved it only because it provided an opportunity to score perfectly in a test. Achieving the perfect score wasn't that harder in spite of our dislike, because every teacher and parent instilled  this notion that good score in Mathematics is the yardstick of your intelligence. No matter you liked math or not, or scored perfectly or not, I seriously doubt if any student was interested in knowing why mathematics was needed at all. Any one who dared to raise the question, would typically have to face the silent treatment from the teacher.

Through out the school years, from the  day I learnt to count numbers to the days of triple integration in engineering school, I must have asked this question to myself countless number of times. However, by the time I graduated,  I was grudgingly admitting to myself that mathematics used to be and would forever be the foundation stone of so many fields of science. Despite my dislike, I thought I won't escape mathematics due to my engineering profession, but then, personal computers took over the world, India's software industry boomed and all every student could think of, was becoming a software professional. It provided a level playing field for all, engineers and non-engineers alike. Any one with decent analytical skills could become a programmer. Engineers too became software programmers en mass and marvelled at the thought of not having to do anything with mathematics any more.

Really? Would software profession be that exciting or sexy if you take away search alogrithms by Google, content distribution algorithms by Akamai or algorithms developed for ultra fast equity trading by nerds on wall street? Take away this exciting stuff and what remains is the mundane 'go to', 'if then' statements or uber present 'do while' or 'for when' statements. So, whether we like it or not, mathematics is the foundation required to solve challenging problems .

The reason for the post though, is to post Java code I wrote to solve maximum clique problem for the following graph. Readers are encouraged to see if the code works on other graphs. I didn't want to look at the solution presented by Dr. McCafferey (which by the way is going to be C# code) before I tried it myself.  Creating the data set (storing graph information like nodes and segments) is a challenge in itself. I have taken a short cut by creating a static array of strings to represent segments of the graph. My co-worker Tyler Perry worked on a c++ solution to this problem while I worked on the Java solution. If you like to see more of such posts click on +1 button at the bottom or use share widget on the right to share it with others. My code was tested on the following graph.

Let's do it a little differently this time. Instead of posting the code here, I would like to send it to you directly if requested. Please contact/comment to receive the code.

Saturday, November 5, 2011

More on the Constrained Tree Schema

Last week I posted the Java code that identifies TPCC schema as a constrained tree schema. I also explained why the constrained tree schema is important for an OLTP application. However a little more on the subject is in order, so as to make it easier to understand. First, for those who may not have looked at TPCC schema before, following is an ER diagram showing table relationships.

Figure 1. TPCC schema

Warehouse table has w_id as a primary key while District table has a composite primary key that includes d_id and d_w_id. The d_w_id column is a foreign key pointing to the warehouse table. This pattern continues on to other tables like Customer where its composite primary key consists of c_id, c_d_id and c_w_id where c_d_id and c_w_id are foreign keys pointing to District and Warehouse table respectively.

For those who have used any kind of object relational mapping architecture, this type of relationship may look totally out of ordinary. Typically, ORM tools including Hibernate are proponents of using natural key (like a sequence number) as a primary key of a table. Having a composite primary key on a table, makes it difficult to code Hibernate and java, as you have to create a separate java class for the primary key. But for the extreme OLTP applications, the use of composite primary keys make perfect sense.

In applications that process millions of transactions per minute, it's quite likely that its performance, throughput and latency will diminish as tables in which transaction data is kept start filling up rapidly. Throwing additional hardware at this problem works to an extent. Application redesign through various mechanisms including horizontal and vertical partitioning of the relational tables is used when adding hardware doesn't provide proportionate performance boost.

Many applications use range paritioning to improve performance. It works somewhat like this. Let's say a company, ABC Inc, sells customer durable goods in Unites States and has 4 operational units, one each for its 4 regions - east,west, north and south.

It's customer order entry application generates order number based on the region the order originated from. ABC Inc has built an application logic where an order generated from east region will get a number between 1 and 1 million, the one in west gets a number between 1 and 2 million. To provide better performance, ABC's IT department has partitioned the order table into 4 partitions and has hosted each partition on different servers. For many companies this works without much of a problem. However this scheme has some inherent weakness in dealing with situations such as equal load distribution. If, in case of ABC Inc, one of its region is doing twice as good as others (in terms of of its sales numbers), the server meant to capture the order of that region will be twice as active as others providing inconsistent user response time from different regions. Another factor to consider while creating table paritioning is co-location of related data. If orders table is paritioned, what do we do about other tables? Do your queries often join partioned & non-partitioned tables? Why is this important?

Continuing with ABC Inc. example - the IT department decides to partition payments table too. While it's very likely that a sales person from ABC, would like to see orders from his customers, he or she would also like to see if his or her's customers paid for the items sold to them or not. Just so the application continues to provide best scalability and latency, it is imperative that the partitions of payment table are co-located with partitions of order tables on the same server. The co-location of a customer's order and payment information on the same server is the key, but that adds to the complexity of your application. Add another related table and you could imagine how complex the design is going to be.

Hash partioning solves one of the problems, that of uneven load on different servers. What server particular data resides is decided by some hash algorithm which guarantees more even load on the server. The other problem of co-location of data is solved by the constrained tree schema architecture and hash algorithm together. Going back to our TPCC schema, the tables such as district, customer and stock all have warehouse id in their primary key. So, including warehouse id in the hash algorithm of partioning of all the tables automatically guarantees co-location of data. Which means, data related to one warehouse and its related districts, customers and stock are located on the same server. You may really want to read the "The End of Architecture Era" document for further details. It's a great read for any one remotely interested in database design and improving performance.  The low-latency.com is another great resource for those interested in extreme performing financial applications.

I hope this helps clear some of the concepts behind extreme scaling and performing application/database design. Don't forget to make comments if you have any.