Friday, March 23, 2012

Aleri - Complex Event Processing - Part I


Algo Trading - a CEP use case
Trackback URL
Sybase's Aleri streaming platform is one of the more popular products in the CEP market segment. It's is used in Sybase's trading platform - the RAP edition, which is widely used in capital markets to manage positions in a portfolio. Today, in the first of the multi-part series, I want to provide an overview of the Aleri platform and provide some code samples where required. In the second part, I will present the Aleri Studio, the eclipse based GUI that simplifies the task of modeling CEP workflow and monitor the Aleri server through a dashboard.





Fraud Detection - Another CEP use case. Trackback url

In my previous blog post on Complex Event Processing, I demonstrated the use of Esper, the open source CEP software and Twitter4J API to handle stream of tweets from Twitter.  A CEP product is much more thanhandling just one stream of data though. Single stream of data could be easily handled through the standard asynchronous messaging platforms and does not pose very challenging scalability or latency issues. But when it comes to consuming more than one real time stream of data and to analyzing it in real time, and when correlation between the streams of data is important, nothing beats a CEP platform. The sources feeding streaming platform could vary in speed, volume and complexity. A true enterprise class CEP should deal effectively with various real time high speed data like stock tickers and slower but voluminous offline batch uploads, with equal ease. Apart from providing standard interfaces, CEP should also provide an easier programming language to query the streaming data and to generate continuous intelligence through such features as pattern matching and  snapshot querying.

Sybase Trading Platform - the RAP edition. Trackback URL
To keep it simple and at high level, CEP can be broken down to three basic parts. The first is the mechanism to grab/consume source data. Next is the process of investigating that data, identifying events & patterns and then interacting with target systems by providing them the actionable items. The actionable events take different forms and formats depending on the application you are using the CEP for. An action item could be - selling an equity position based on calculated risk in a risk monitoring application. indicating potential fraud events in money laundering applications or alerting to a catastrophic event in a monitoring system by reading thousands of sensors in a chemical plant. There literally are thousands of scenarios where a manual and off-line inspection of data is simply not an option. After you go through the following section, you may want to try Aleri yourself. This link http://www.sybase.com/aleriform directly takes you to the Aleri download page. Evaluation copy valid for 90 days is freely available from Sybase’s official website. Good amount of documentation, an excellent tutorial and some sample code on the website should help you get started quickly.

 If you are an existing user of any CEP product, I encourage you to compare Aleri with that product and share it with the community or comment on this blog. By somewhat dated estimates, Tibco CEP  is the biggest CEP vendor in the market. I am not sure how much market share another leading product StreamBase has. There is also a webinar you can listen to on Youtube.com that explains CEP benefits in general and some key features of Streambase in specific. For new comers, this serves as an excellent introduction to CEP and a capital markets use case.

An application on Aleri CEP is built by creating a model using the Studio (the gui) or using Splash(the language) or by using the Aleri Modeling language (ML) - the final stage before it is deployed.

Following is a list of the key features of Splash.

  • Data Types - Supports standard data types and XML . Also supports ‘Typedef ‘ for user defined data types.
  • Access Control – a granular level access control enabling  access to a stream or modules (containing many streams)
  • SQL – another way of building a model.  Building an Aleri studio model could take longer due to its visual paradigm. Someone proficient with SQL should be able to do it much faster using Aleri SQL which is very similar to regular SQL we all know.
  • Joins - supported joins are Inner, Left, Right and Full
  • Filter expressions  - include Where, having, Group having
  • ML - Aleri SQL produces data model in Aleri modeling language (ML) – A proficient ML users might use only ML (in place of Aleri Studio and Aleri SQL)to build a model. 
  • The pattern matching language - includes constructs such as ‘within’ to indicate interval (sliding window), ‘from’ to indicate the stream of data and the interesting ‘fby’ that indicates a sequence (followed by)
  • User defined functions – user defined function interface provided in the splash allows you to create functions in C++ or Java and to use them within a splash expression in the model. 


Advanced pattern matching – capabilities are explained through example here. – Following three code segments and their explanations  are directly taken from Sybase's documentation on Aleri.
The first example checks to see whether a broker sends a buy order on the same stock as one of his or her customers, then inserts a buy order for the customer, and then sells that stock. It creates a “buy ahead” event when those actions have occurred in that sequence.

within 5 minutes
from
BuyStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Buy1,
BuyStock[Symbol=sym; Shares=n2; Broker=b; Customer=c1] as Buy2,
SellStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Sell
on Buy1 fby Buy2 fby Sell
{
if ((b = c0) and (b != c1)) {
output [Symbol=sym; Shares=n1; Broker=b];
}
}


This example checks for three events, one following the other, using the fby relationship. Because the same variable sym is used in three patterns, the values in the three events must be the same. Different variables might have the same value, though (e.g., n1 and n2.) It outputs an event if the Broker and Customer from the Buy1 and Sell events are the same, and the Customer from the Buy2 event is different.

The next example shows Boolean operations on events. The rule describes a possible theft condition, when there has been a product reading on a shelf (possibly through RFID), followed by a non-occurrence of a checkout on that product, followed by a reading of the product at a scanner near the door.

within 12 hours
from
ShelfReading[TagId=tag; ProductName=pname] as onShelf,
CounterReading[TagId=tag] as checkout,
ExitReading[TagId=tag; AreaId=area] as exit
on onShelf fby not(checkout) fby exit
output [TagId=t; ProductName=pname; AreaId=area];

The next example shows how to raise an alert if a user tries to log in to an account unsuccessfully three times within 5 minutes.

from
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login1,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login2,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login3,
LoginAttempt[IpAddress=ip; Account=acct; Result=1] as login4
on (login1 fby login2 fby login3) and not(login4)
output [Account=acct];

People wishing to break into computer systems often scan a number of TCP/IP ports for an open one, and attempt to exploit vulnerabilities in the programs listening on those ports. Here’s a rule that checks whether a single IP address has attempted connections on three ports, and whether those have been followed by the use of the “sendmail” program.

within 30 minutes
from
Connect[Source=ip; Port=22] as c1,
Connect[Source=ip; Port=23] as c2,
Connect[Source=ip; Port=25] as c3
SendMail[Source=ip] as send
on (c1 and c2 and c3) fby send
output [Source=ip];

Aleri provides many interfaces out of the box for an easy integration with source and target systems.  Through these interfaces/adapters the Aleri platform can communicate with standard relational databases, messaging frameworks like IBM MQ, sockets and file system files. Data in various formats like csv, FIX, Reuters market data, SOAP, http, SMTP is easily consumed by Aleri  through standardized interfaces.

Following are available techniques for integrating Aleri with other systems.


  • Pub/sub API is provided in Java, C++ and dot net - A standard pub/sub mechanism
  • SQL interface with SELECT, UPDATE, DELETE and INSERT statements  used through ODBC and JDBC connection.
  • Built in adapters for market data and FIX  
In the next part of this series we will look at the Aleri Studio, the gui that helps us build the CEP application the easy way.

Tuesday, March 13, 2012

Sybase IQ 15.4 - for big data analytics


Big data analytics platform, Sybase IQ 15.4 Express Edition, is now available for free. This article introduces Sybase IQ's big data features and shares some valuable resources with you. Sybase IQ, with its installation base of more than 4,500 clients all over world has always been a leading columnar database for mission critical analytic functions. With new features like native support for Map Reduce and in-database analytics, it  positions itself as a premium offering for big data analytics.

Columnar databases have been in existence for almost 2 decades now. Row level databases for OLTP transactions and Columnar databases for analytics, more or less met requirements of organizations. Extracting, transforming and loading (ETL) transactional data into analytic platform has been a big business. However, the recent focus on big data and the nature of it (semi-structured, unstructured) is changing the landscape of analytic platforms and the way ETL is used. Boundaries between OLTP and analytic platform are somewhat blurring due to advent of real-time analytics.

Due in large part to the tremedous growth of ecommerce in recent years, much of the developer talent pool was attracted to low-latency and high throuput transactional systems. Without a doubt, the same pool is gravitating towards aquisition, management and analysis of semi-structured data now. Host of start-ups and established technology companies are creating new tools, methods and methodologies in an effort to address the data deluge we are generating from our use of social networks. Even if there are myriad of options available to handle big data and analytic, some clear trends or underlying technologies have emerged to the fore. Hadoop, the open source implementation for batch operations on big data and in-memory analytic platform to provide real-time business intelligence are two of the most significant trends. Established vendors like Oracle, IBM, Informatica have been adding new products or updating their existing offerings to meet the new demands in this space.

Business intelligence gathered through only one type of data, transactional or semi-structured from social media or machine generated, is not good enough for today's organization. It's really important to glean information from each type of source and co-relate  one with the other to present a comprehensive and accurate intelligence that assists critical decision support systems. So, it's imperative that vendors provide a set of tools with good enough integration that support structured, semi-structured and totally unstructured data like audio and video. "Polyglot Persistence", the buzzword of late and made popular by Martin Fowler, addresses the need and nature of different types of data. You may have multiple systems for storage/persistence but ultimately an enterprise needs a comprehensive view of its business through one lense. This is what Sybase's IQ, the traditional columnar database product, is trying to provide by offering native support for  Hadoop and Mapreduce. There are other singificant enhancements to the latest enterprise edition of Sybase IQ. What makes this product attractive to the developer community is, its free availability. Please note, it's not an evaluation copy, but is a full function enterprise edition with the only restriction of databas size at 5 GB. This blog post is a quick summary of IQ's features and aims to point important resources that will help you in trying it out.

For uninitiated, this blog entry by William McKnight, provides an introduction to concepts of columnar databases.

Following paragraph directly taken from Sybase's web site, sums up the most important features.

Sybase® IQ 15.4 is revolutionizing “Big Data” analytics breaking down silos of data analysis and integrating it into enterprise analytic processes. Sybase IQ offers a single database platform to analyze different data – structured, semi-structured, unstructured – using different algorithms. Sybase IQ 15.4 expands these capabilities with the introduction of a native MapReduce API, advanced and flexible Hadoop integration, Predictive Model Markup Language (PMML) support, and an expanded library of statistical and data mining algorithms that leverage the power of distributed query processing across a PlexQ grid.

Some details on these features is in order.

  • User Defined Functions Enabling Map Reduce - One of the best architecture practices in a three tier application architecture is to physically and logically separate business logic and the data. But bringing data to business logic layer and then moving it back to persistence layer adds latency. The mission critical applications typically implement many strategies to reduce the latency, but the underlying theme to all those solutions is the same. That is, to keep business logic and data as close to one another as possible. Sybase IQ's native c++ API allows developers to build  user defined functions to implement proprietory algorithms. This means, you can build map reduce functions right inside the database that can yield 10X performance improvements. This also means, you can use ordinary SQL from higher level business logic keeping that layer simpler while taking advantage of map reduce based parallel processing for higher performance. The map reduce jobs are executed in parallel on the grid of servers, called Multiplex or PlexQ in Sybase IQ parlance.
  • Hadoop Integration - We discussed the need for analyzing and co-relating semi-structured data with structured data earlier. Sybase IQ provides 4 different ways in which this integration can happen. Hadoop is used to extract data points from unstructured or semi-structured data and then is used with OLTP data for further analysis. Clien-side Federation, ETL based, Query Federation and Data Federation are ways in which Hadoop integration occurs. You can request more in-depth information here.
  • PMML Support - PMML (Predictive Model Markup Language) support allows the user to create predictive models using popular tools like SAS and R. These models can be executed in automated fashion extending the already powerful analytics platform further. A  plug-in from Zementis is used to provide the PMML validation and its transformation to Java UDFs.
  • R Language Support - SQL is woefully inadequate when you need statistical analysis on structured data. But, the simplicity and wide adoption of SQL makes it an attractive query tool. The RJDBC interface in IQ allows an application to use the R programming language to perform statistical functions. R is a very popular open source programming language used in many financial applications today. Please read my blog entry 'Programming with R, it's super' for further information on R.
  • In Database Analytics - Sybase IQ in its latest version15.4, uses 'in database' analytics library called DB Lytix  from Fuzzy Logix.  This analytics engine have the ability to perform advance analytics through simple SELECT and EXECUTE statements. Sybase claims that some of these analytical functions are able to leverage MapReduce API in some data mining algorithms. DB Lytix, according to Fuzzy Logic's website, supports Mathematical and Statistical functions, Monte Carlo Simulations- uni-variate and multi-variate, Data Mining ; Pattern Recognition , Principal Component Analysis, Linear Regression, Logistic Regression, Other supervised learning methods, and Clustering.

Sybase provides detail documentation on the features of IQ on Sybase's  website. If you would like to try Sybase IQ, use this direct link. Remember, this edition is not an eval copy. It is a full featured IQ edition limited only by database size at 5GB.

As you can imagine, Sybase IQ is not the only company offering big data solutions. This gigaom article lists few others as well. With so many good options to choose from, the users have to look at factors other than just the technology to make the selection. Some of those factors are - the reputation of the brand, standing in the market, the extent of its user base and the eco-system around the product. In that regard, Sybase IQ scores very high points and is the reason for it's position in leadership quadrant by Gartner. Sybase IQ has been a leading columnar database in the market since 90's and has established a robust eco-system around it. Sybase's other products - Power Designer , the leading data modelling workbench, SAP Business Objects  for reporting and analytics, and Sybase Control Center for administration and monitoring of IQ -  support the IQ and provid one of the most comprehensive analytics platforms in the industry.