Tuesday, September 25, 2012

SAP Sybase ASE - An introduction for a newbie

Since the day of its acquisition of Sybase, SAP has made great strides in porting its business suite on to Sybase ASE (known as Adaptive Server Enterprise - the Sybase relational database). Last week's announcement from SAP is just one of the many milestones that symbolises its committment to Sybase products and highlights the successes. The highest number in SD benchmark was achieved on a 2 core HP/Linux machine by ASE. Check out the details here.

As SAP works towards its stated goal of becoming a premier database company in the world, it is but natural that more and more SAP practitioners and enthusiasts become interested in ASE and may be looking to use it not only for SAP applications but also for other in-house applications. For a java programmer, familiarity with the JDBC API is enough to use a standard RDBMS. But once in a while one is required to connect directly to the underlying database to debug an issue. Unlike production systems where you can depend on a DBA for support, you are often left to deal with problems yourself in a development environment. Quirks or peculiarities in a new relational database system can be a source of frustration. Some features that totally make sense in one database may not work the same way in other. It's those little things, quirks, that can drive one crazy when one is dealing with tighter deadlines.

I have been working with SAP Sybase Adaptive Server Enterprise for more than a year now. I feel comfortable in sharing some bits and pieces I pieced together over that period. Much of this information is available on the web and or in the product documentation, but scouring and digging through this information can take time. Also, some of the information is available at most unlikely places.  Having worked with popular databases other than ASE before, these key features seemed interesting, a bit different or I came across them too often. This is not a comprehensive list of ASE features, but it is a list to save some time for a newcomer to quickly get off the ground. So here it goes.
First, SAP allows free downloads of developer versions of most of the Sybase products. You can download the one you want from here. Consider revisiting this post before you install ASE.

1. Concept of devices and databases in ASE - They resemble the concepts of databases and schemas in Oracle.
2. Finding the version of the database you are running - Sybase products including ASE are fast evolving to cater to SAP ecosystem. Major features are being added fairly regularly. When something isn't working as expected, first thing you may want to do before calling into tech support is to find the version of the database you are running and refer to product documentation here.
  • dataserver -v 
3. Sybase environment variables are set in sybase.csh file on linux installation. The server startup/shutdown scripts could be found in following directory
  • your sybase home directoy/ASE-15_0/install
4. If you believe that an installed instance is not responding to your application code, the most likely reason could be that the thread is kept waiting due to the lack of log space. To free up the space you can use following commands. Refer to sybase documentation on what each does, but if you are not concerned about losing the running transaction, then either of the following should work. You should see your transaction going through as soon as you run the following command.
  • DUMP TRANSACTION your_database_name with no_log
Before you try above commands, you may want to look at all the running connections/threads to ASE and their status by executing sp_who stored procedure.  
5. One of the most important things you need to decide while installing the ASE is the page size. All data from one row of one table is colocated in one page to improve performance. Once defined, you can not change page size for the installed ASE instance. Below is a table showing relationship between the page sizes and corresponding maximum possible database sizes.
  • 2K page size - 4 TB
  • 4K page size - 8 TB
  • 8K page size - 16 TB
  • 16K page size - 32 TB
If you are not sure what page size the installed ASE instance is using, then run the following command from isql utility, the client used to connect to ASE.
  • select @@maxpagesize
6. If you use Hibernate for data access, you may want to visit hibernate-sybase integration page here. Spending a few minutes on the required ASE settings for hibernate could  save you from some frustration later on.
7. Once the installation is complete you can run Sybase Control Center application that provides a GUI interface to all your ASE instances. You can alternatively use isql, a command line sql interface to interact with ASE. Super admin user name is 'SA' and password could be left blank.
8. Use following stored procedures to add users. sp_addlogin just adds a login and not a user.
  • sp_addlogin - adds only a login name 
  • sp_adduser - adds a new user.
9. Make sure to assign a role with adequate privileges to the new user by executing following procedure.
  • sp_role "grant", some_role, username 
10. Before you try any sql you may want to check the 'reserved words' of ASE by using following command. I recommend to go through this list if you are porting an existing application to ASE.
  • select name from master..spt_values where type = "W"
11. Its a common practice to add an autoincrement column to a table and use it as a primary key. Autoincrement is achieved in ASE by denoting a column as IDENTITY. ASE does not allow more than one autoincrement column in a table. There is no  'sequence' as in Oracle. ASE allows adding user defined values to an autoincrement column after you run the following command.
  • set identity_insert tablename on
12. If you have used MS SQL Server, then you will find ASE's t-sql very similar as both share the same roots. T-sql is similar to pl/sql in Oracle.
13. ASE's native JDBC driver is called jconnet. JTDS, an open source driver,  can be used also. Jconnect supports changing connection properties through connection url. Just append the connection url with 'property_name=property_value'. For example
  •  'jdbc:sybase:tds:servername:portnumber/database?QUOTED_IDENTIFIER=ON'
allows quoted identifiers in a sql statement like below.
  • SELECT * from  "dbo". "User_Table"
If you want to run your existing relational database application on ASE then it is really important to pay attention to this connection property. However, if you are building all sqls as 'preparedStatements' then quoted strings are handled correctly even when quoted_identifier is not explicitly turned on.
14. Also important to note is the difference in handling table aliases. Table aliases are allowed in 'SELECT' statements but not in 'UPDATE' or 'DELETE'. For example the first sql below is valid but the second returns an error.
  • select * from tablename t1 
  • update tablename t1 set columnname=value where column2=value2
 15. Sybase Jconnect driver depends on some key metadata information for its correct working. In absence of this information you may receive sql exception with similar description as 'missing metadata'. If this happens,  it means that you missed a step during ASE installation. This missed step installs the metadata  information in the master database. This error could be eliminated by running a stored procedure after the installation. Refer to the documentation here.
16. 'Select for Update' was introduced in ASE 15.7 and you would expect it to work by default in a new install. Alas, no such luck. I highly recommend reading this document before you try this feature. You would need 'datarows' locking scheme on the table on which select is performed. You can turn 'datarows locking' ON on the enitre database or on a table using following commands.
  • sp_configure "lock scheme", 0, datarows
  • alter table tablename lock datarows
17. ASE is case sensitive by default. To display case sensitivity, sort order and character set settings, use sp_helpsort stored procedure.
18. Following information is useful if you want to programmatically access metadata information from ASE. Sysobjects table in master database holds information on all entities like tables, indexes, views in ASE. To display all user objects use following command.
  • Select * from sysobjects where type = 'U'
syscolumns table holds information of all columns from all tables.
sysreferences, sysconstraints hold information on relationships between  tables like foreign key constraints.
19. By joining above tables you can get pretty much all the information about a user table. Unfortunately there is no simple query like 'describe table' in Oracle. You can use following stored procedure to get all the details about a table, but it will include much more than just the column names and their types.
  • sp_help tablename
If you want just the column names of a user table, use following sql.
  • select sc.name from syscolumns sc, sysobjects so where sc.id=so.id and so.name='tablename'
20. Unlike Oracle, ASE truncates all trailing blanks while storing variable length varchar columns. So be careful if your code has comparisons on strings pulled from varchar columns.

I hope this list will help you install and navigate ASE in early stages. I mentioned these in particular because I had to use these commands  more often than any others. Once you spend a little bit of time on ASE, you will of course come across other things that seem more useful. I will try to list those in my next post.

Sunday, April 8, 2012

Understanding Aleri - Complex Event Processing - Part II

Aleri, the complex event processing platform from Sybase was reviewed at high level in my last post.

This week, let's review the Aleri Studio, the user interface to Aleri platform and the use of pub/sub api, one of many  ways to interface with the Aleri platform. The studio is an integral part of the platform and comes packaged with the free evaluation copy. If you haven't already done so, please download a copy from here.  The fairly easy installation process of Aleri product gets you up and running in a few minutes.

The aleri studio is an authoring platform for building the model that  defines interactions and sequencing between various data streams. It also can merge multiple streams to form one or more streams. With this eclipse based studio, you can test the models you build by feeding them with the test data and monitor the activity inside the streams in real time. Let's look at the various type of streams you can define in Aleri and their functionality.

Source Stream - Only this type of stream can handle incoming data. The operations that can be performed by the incoming data are insert, update, delete and upsert. Upsert, as the name suggests updates data if the key defining a row is already present in the stream. Else, it inserts a record in the stream.

Aggregate Stream - This stream creates a summary record for each group  defined by specific attribute. This provides functionality equivalent to 'group by' in ANSI SQL.

Copy stream - This stream is created by copying another stream but with a different retention rule.

Compute Stream - This stream allows you to use a function on each row of data to get a new computed element for each row of the data stream.

Extend Stream - This stream is derived from another stream by additional column expressions

Filter Stream - You can define a filter condition for this stream. Just like extend and compute streams, this stream applies filter conditions on other streams to derive a new stream.

Flex Stream - Significant flexibility in handling streaming data is achieved through custom coded methods. Only this stream allows you to write your own methods to meet special needs.

Join Stream - Creates a new stream by joining two or more streams on some condition. Both, Inner and Outer joins can be used  to join streams.

Pattern Stream - Pattern matching rules are applied with this stream

Union Stream - As the name suggests, this joins two or more streams with same row data structure. Unlike the join stream, this stream  includes all the data from all the participating streams.

By using some of these streams and the pub api of Aeri, I will demonstrate the seggregation of twitter live feed into two different streams. The twitter live feed is consumed by a listener from Twitter4j library. If you just want to try Twitter4j library first, please follow my earlier post 'Tracking user sentiments on Twitter'. The data received by the twitter4j listener, is fed to a source stream in our model by using the publication API  from Aleri. In this exercise we will try to separate out tweets based on their content. Built on the example from my previous post,  we will divide the incoming stream into two streams based on the content. One stream will get any tweets that consists 'lol' and the other gets tweets with a smiley ":)" face in the text . First, let's list the tasks we need to perform to make this a working example.
  1. Create a model with three streams
  2. Validate the model is error free
  3. Create a static data file
  4. Start the Aleri server and feed the static data file to the stream manually to confirm correct working of the model.
  5. Write java code to consume twitter feed. Use the publish API to publish the tweets to Aleri platform.
  6. Run the demo and see the live data as it flows through various streams. 

Image 1 - Aleri Studio - the authoring view
This image is a snapshot of the Aleri Studio with the three streams - one on the left named "tweets" is a source stream and two on the right named "lolFilter" and "smileyFilter" are of the filter type. Source stream accepts incoming data while filter streams receive the filtered data. Here is how I defined the filter conditions -
like (tweets.text, '%lol%').
tweets is the name of the stream and text is the field in the stream we are interested in. %lol% means, select any tweets that have 'lol' string in the content. Each stream has only 2 fields - id and text. The id and text maps to id and text-message sent by twitter. Once you define the model, you can check it for any errors by clicking on the check mark in the ribbon at the top. Erros if any will show up in the panel at bottom right of the image. Once your model is error free, it's time to test it.

Image 2 - Aleri Studio - real time monitoring view

The following image shows the test interface of the studio. Try running your model with a static data file first. The small red square at the top indicates that Aleri server is currently running. The console window at the bottom right shows server messages like successful starts and stops etc. The Run-test tab in the left pane, is where you pick a static data file to feed the source stream. The pane on the right shows all the currently running streams and live data processed by the streams.

The image below shows the format of the data file used to test the model

The next image shows the information flow.

Image 3 - Aleri Publishing - Information flow

The source code for this exercise is at the bottom.
Remember that you need to have twitter4j library in the build path and have Aleri server running before you run the program. Because I have not added any timer to the execution thread, the only way to stop the execution is to abort it. For brevity and to keep the code line short, I have deleted all the exception handling and logging. The code utilizes only the publishing part of the pub/sub api of Aleri.  I will demonstrate the use of sub side of the api in my next blog post.

This blog intends to provide something simple but useful to the developer community.  Feel free to leave your comment and share this article if you like it.

Friday, March 23, 2012

Aleri - Complex Event Processing - Part I

Algo Trading - a CEP use case
Trackback URL
Sybase's Aleri streaming platform is one of the more popular products in the CEP market segment. It's is used in Sybase's trading platform - the RAP edition, which is widely used in capital markets to manage positions in a portfolio. Today, in the first of the multi-part series, I want to provide an overview of the Aleri platform and provide some code samples where required. In the second part, I will present the Aleri Studio, the eclipse based GUI that simplifies the task of modeling CEP workflow and monitor the Aleri server through a dashboard.

Fraud Detection - Another CEP use case. Trackback url

In my previous blog post on Complex Event Processing, I demonstrated the use of Esper, the open source CEP software and Twitter4J API to handle stream of tweets from Twitter.  A CEP product is much more thanhandling just one stream of data though. Single stream of data could be easily handled through the standard asynchronous messaging platforms and does not pose very challenging scalability or latency issues. But when it comes to consuming more than one real time stream of data and to analyzing it in real time, and when correlation between the streams of data is important, nothing beats a CEP platform. The sources feeding streaming platform could vary in speed, volume and complexity. A true enterprise class CEP should deal effectively with various real time high speed data like stock tickers and slower but voluminous offline batch uploads, with equal ease. Apart from providing standard interfaces, CEP should also provide an easier programming language to query the streaming data and to generate continuous intelligence through such features as pattern matching and  snapshot querying.

Sybase Trading Platform - the RAP edition. Trackback URL
To keep it simple and at high level, CEP can be broken down to three basic parts. The first is the mechanism to grab/consume source data. Next is the process of investigating that data, identifying events & patterns and then interacting with target systems by providing them the actionable items. The actionable events take different forms and formats depending on the application you are using the CEP for. An action item could be - selling an equity position based on calculated risk in a risk monitoring application. indicating potential fraud events in money laundering applications or alerting to a catastrophic event in a monitoring system by reading thousands of sensors in a chemical plant. There literally are thousands of scenarios where a manual and off-line inspection of data is simply not an option. After you go through the following section, you may want to try Aleri yourself. This link http://www.sybase.com/aleriform directly takes you to the Aleri download page. Evaluation copy valid for 90 days is freely available from Sybase’s official website. Good amount of documentation, an excellent tutorial and some sample code on the website should help you get started quickly.

 If you are an existing user of any CEP product, I encourage you to compare Aleri with that product and share it with the community or comment on this blog. By somewhat dated estimates, Tibco CEP  is the biggest CEP vendor in the market. I am not sure how much market share another leading product StreamBase has. There is also a webinar you can listen to on Youtube.com that explains CEP benefits in general and some key features of Streambase in specific. For new comers, this serves as an excellent introduction to CEP and a capital markets use case.

An application on Aleri CEP is built by creating a model using the Studio (the gui) or using Splash(the language) or by using the Aleri Modeling language (ML) - the final stage before it is deployed.

Following is a list of the key features of Splash.

  • Data Types - Supports standard data types and XML . Also supports ‘Typedef ‘ for user defined data types.
  • Access Control – a granular level access control enabling  access to a stream or modules (containing many streams)
  • SQL – another way of building a model.  Building an Aleri studio model could take longer due to its visual paradigm. Someone proficient with SQL should be able to do it much faster using Aleri SQL which is very similar to regular SQL we all know.
  • Joins - supported joins are Inner, Left, Right and Full
  • Filter expressions  - include Where, having, Group having
  • ML - Aleri SQL produces data model in Aleri modeling language (ML) – A proficient ML users might use only ML (in place of Aleri Studio and Aleri SQL)to build a model. 
  • The pattern matching language - includes constructs such as ‘within’ to indicate interval (sliding window), ‘from’ to indicate the stream of data and the interesting ‘fby’ that indicates a sequence (followed by)
  • User defined functions – user defined function interface provided in the splash allows you to create functions in C++ or Java and to use them within a splash expression in the model. 

Advanced pattern matching – capabilities are explained through example here. – Following three code segments and their explanations  are directly taken from Sybase's documentation on Aleri.
The first example checks to see whether a broker sends a buy order on the same stock as one of his or her customers, then inserts a buy order for the customer, and then sells that stock. It creates a “buy ahead” event when those actions have occurred in that sequence.

within 5 minutes
BuyStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Buy1,
BuyStock[Symbol=sym; Shares=n2; Broker=b; Customer=c1] as Buy2,
SellStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Sell
on Buy1 fby Buy2 fby Sell
if ((b = c0) and (b != c1)) {
output [Symbol=sym; Shares=n1; Broker=b];

This example checks for three events, one following the other, using the fby relationship. Because the same variable sym is used in three patterns, the values in the three events must be the same. Different variables might have the same value, though (e.g., n1 and n2.) It outputs an event if the Broker and Customer from the Buy1 and Sell events are the same, and the Customer from the Buy2 event is different.

The next example shows Boolean operations on events. The rule describes a possible theft condition, when there has been a product reading on a shelf (possibly through RFID), followed by a non-occurrence of a checkout on that product, followed by a reading of the product at a scanner near the door.

within 12 hours
ShelfReading[TagId=tag; ProductName=pname] as onShelf,
CounterReading[TagId=tag] as checkout,
ExitReading[TagId=tag; AreaId=area] as exit
on onShelf fby not(checkout) fby exit
output [TagId=t; ProductName=pname; AreaId=area];

The next example shows how to raise an alert if a user tries to log in to an account unsuccessfully three times within 5 minutes.

LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login1,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login2,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login3,
LoginAttempt[IpAddress=ip; Account=acct; Result=1] as login4
on (login1 fby login2 fby login3) and not(login4)
output [Account=acct];

People wishing to break into computer systems often scan a number of TCP/IP ports for an open one, and attempt to exploit vulnerabilities in the programs listening on those ports. Here’s a rule that checks whether a single IP address has attempted connections on three ports, and whether those have been followed by the use of the “sendmail” program.

within 30 minutes
Connect[Source=ip; Port=22] as c1,
Connect[Source=ip; Port=23] as c2,
Connect[Source=ip; Port=25] as c3
SendMail[Source=ip] as send
on (c1 and c2 and c3) fby send
output [Source=ip];

Aleri provides many interfaces out of the box for an easy integration with source and target systems.  Through these interfaces/adapters the Aleri platform can communicate with standard relational databases, messaging frameworks like IBM MQ, sockets and file system files. Data in various formats like csv, FIX, Reuters market data, SOAP, http, SMTP is easily consumed by Aleri  through standardized interfaces.

Following are available techniques for integrating Aleri with other systems.

  • Pub/sub API is provided in Java, C++ and dot net - A standard pub/sub mechanism
  • SQL interface with SELECT, UPDATE, DELETE and INSERT statements  used through ODBC and JDBC connection.
  • Built in adapters for market data and FIX  
In the next part of this series we will look at the Aleri Studio, the gui that helps us build the CEP application the easy way.

Tuesday, March 13, 2012

Sybase IQ 15.4 - for big data analytics

Big data analytics platform, Sybase IQ 15.4 Express Edition, is now available for free. This article introduces Sybase IQ's big data features and shares some valuable resources with you. Sybase IQ, with its installation base of more than 4,500 clients all over world has always been a leading columnar database for mission critical analytic functions. With new features like native support for Map Reduce and in-database analytics, it  positions itself as a premium offering for big data analytics.

Columnar databases have been in existence for almost 2 decades now. Row level databases for OLTP transactions and Columnar databases for analytics, more or less met requirements of organizations. Extracting, transforming and loading (ETL) transactional data into analytic platform has been a big business. However, the recent focus on big data and the nature of it (semi-structured, unstructured) is changing the landscape of analytic platforms and the way ETL is used. Boundaries between OLTP and analytic platform are somewhat blurring due to advent of real-time analytics.

Due in large part to the tremedous growth of ecommerce in recent years, much of the developer talent pool was attracted to low-latency and high throuput transactional systems. Without a doubt, the same pool is gravitating towards aquisition, management and analysis of semi-structured data now. Host of start-ups and established technology companies are creating new tools, methods and methodologies in an effort to address the data deluge we are generating from our use of social networks. Even if there are myriad of options available to handle big data and analytic, some clear trends or underlying technologies have emerged to the fore. Hadoop, the open source implementation for batch operations on big data and in-memory analytic platform to provide real-time business intelligence are two of the most significant trends. Established vendors like Oracle, IBM, Informatica have been adding new products or updating their existing offerings to meet the new demands in this space.

Business intelligence gathered through only one type of data, transactional or semi-structured from social media or machine generated, is not good enough for today's organization. It's really important to glean information from each type of source and co-relate  one with the other to present a comprehensive and accurate intelligence that assists critical decision support systems. So, it's imperative that vendors provide a set of tools with good enough integration that support structured, semi-structured and totally unstructured data like audio and video. "Polyglot Persistence", the buzzword of late and made popular by Martin Fowler, addresses the need and nature of different types of data. You may have multiple systems for storage/persistence but ultimately an enterprise needs a comprehensive view of its business through one lense. This is what Sybase's IQ, the traditional columnar database product, is trying to provide by offering native support for  Hadoop and Mapreduce. There are other singificant enhancements to the latest enterprise edition of Sybase IQ. What makes this product attractive to the developer community is, its free availability. Please note, it's not an evaluation copy, but is a full function enterprise edition with the only restriction of databas size at 5 GB. This blog post is a quick summary of IQ's features and aims to point important resources that will help you in trying it out.

For uninitiated, this blog entry by William McKnight, provides an introduction to concepts of columnar databases.

Following paragraph directly taken from Sybase's web site, sums up the most important features.

Sybase® IQ 15.4 is revolutionizing “Big Data” analytics breaking down silos of data analysis and integrating it into enterprise analytic processes. Sybase IQ offers a single database platform to analyze different data – structured, semi-structured, unstructured – using different algorithms. Sybase IQ 15.4 expands these capabilities with the introduction of a native MapReduce API, advanced and flexible Hadoop integration, Predictive Model Markup Language (PMML) support, and an expanded library of statistical and data mining algorithms that leverage the power of distributed query processing across a PlexQ grid.

Some details on these features is in order.

  • User Defined Functions Enabling Map Reduce - One of the best architecture practices in a three tier application architecture is to physically and logically separate business logic and the data. But bringing data to business logic layer and then moving it back to persistence layer adds latency. The mission critical applications typically implement many strategies to reduce the latency, but the underlying theme to all those solutions is the same. That is, to keep business logic and data as close to one another as possible. Sybase IQ's native c++ API allows developers to build  user defined functions to implement proprietory algorithms. This means, you can build map reduce functions right inside the database that can yield 10X performance improvements. This also means, you can use ordinary SQL from higher level business logic keeping that layer simpler while taking advantage of map reduce based parallel processing for higher performance. The map reduce jobs are executed in parallel on the grid of servers, called Multiplex or PlexQ in Sybase IQ parlance.
  • Hadoop Integration - We discussed the need for analyzing and co-relating semi-structured data with structured data earlier. Sybase IQ provides 4 different ways in which this integration can happen. Hadoop is used to extract data points from unstructured or semi-structured data and then is used with OLTP data for further analysis. Clien-side Federation, ETL based, Query Federation and Data Federation are ways in which Hadoop integration occurs. You can request more in-depth information here.
  • PMML Support - PMML (Predictive Model Markup Language) support allows the user to create predictive models using popular tools like SAS and R. These models can be executed in automated fashion extending the already powerful analytics platform further. A  plug-in from Zementis is used to provide the PMML validation and its transformation to Java UDFs.
  • R Language Support - SQL is woefully inadequate when you need statistical analysis on structured data. But, the simplicity and wide adoption of SQL makes it an attractive query tool. The RJDBC interface in IQ allows an application to use the R programming language to perform statistical functions. R is a very popular open source programming language used in many financial applications today. Please read my blog entry 'Programming with R, it's super' for further information on R.
  • In Database Analytics - Sybase IQ in its latest version15.4, uses 'in database' analytics library called DB Lytix  from Fuzzy Logix.  This analytics engine have the ability to perform advance analytics through simple SELECT and EXECUTE statements. Sybase claims that some of these analytical functions are able to leverage MapReduce API in some data mining algorithms. DB Lytix, according to Fuzzy Logic's website, supports Mathematical and Statistical functions, Monte Carlo Simulations- uni-variate and multi-variate, Data Mining ; Pattern Recognition , Principal Component Analysis, Linear Regression, Logistic Regression, Other supervised learning methods, and Clustering.

Sybase provides detail documentation on the features of IQ on Sybase's  website. If you would like to try Sybase IQ, use this direct link. Remember, this edition is not an eval copy. It is a full featured IQ edition limited only by database size at 5GB.

As you can imagine, Sybase IQ is not the only company offering big data solutions. This gigaom article lists few others as well. With so many good options to choose from, the users have to look at factors other than just the technology to make the selection. Some of those factors are - the reputation of the brand, standing in the market, the extent of its user base and the eco-system around the product. In that regard, Sybase IQ scores very high points and is the reason for it's position in leadership quadrant by Gartner. Sybase IQ has been a leading columnar database in the market since 90's and has established a robust eco-system around it. Sybase's other products - Power Designer , the leading data modelling workbench, SAP Business Objects  for reporting and analytics, and Sybase Control Center for administration and monitoring of IQ -  support the IQ and provid one of the most comprehensive analytics platforms in the industry.

Monday, February 27, 2012

Tracking user sentiments on Twitter with Twitter4j and Esper

For new comers to Complex Event Processing  and Twitter API, I hope this serves as a short tutorial and helps them get off the ground quickly.

Managing big data and mining useful information from it is the hottest discussion topic in technology right now. Explosion of growth in semi-structured data flowing from social networks like Twitter, Facebook and Linkedin is making technologies like Hadoop, Cassandra a part of every technology conversation. So as not to fall behind of competition, all customer centric organizations are actively engaged in creating social strategies. What can a company get out of data feeds from social networks? Think location based services, targeted advertisements and algorithm equity trading for starters. IDC Insights have some informative blogs on the relationship between big data and business analytics. Big data in itself will be meaningless unless the right analytic tools are available to sift through it, explains Barb Darrow in her blog post  on gigaom.com

Companies often listen into social feeds to learn customers’ interest or perception about the products. They also are trying to identify “influencers” – the one with most connections in a social graph – so they could make better offers to such individuals and get better mileage out of their marketing. The companies involved in equity trading want to know which public trading companies are discussed on Twitter and what are the users' sentiments about them. From big companies like IBM to  smaller start-ups, everyone is racing to make most of the opportunities of big data management and analytics. Much documentation about big data like this ebook from IBM 'Big Data Platform'  is freely available on the web. However a lot of this covers theory only. Jouko Ahvenainen in reply to Barb Darrow’s post above makes a good point that “many people who talk about the opportunity of big data are on too general level, talk about better customer understanding, better sales, etc. In reality you must be very specific, what you utilize and how”.

It does sound reasonable, doesn't it? So I set out to investigate this a bit further by prototyping an idea, the only good option I know. If I could do it, anybody could do it. The code is remarkably simple. But, that's exactly the point. Writing CEP framework yourself is quite complex but using it is not. Same way, Twitter makes it real easy to get to the information through REST API.

Big Data - http://www.bigdatabytes.com/managing-big-data-starts-here/

Complex Event Processing (CEP), I blogged previously (click here to read) is a critical component of the big data framework. Along with CEP, frameworks with  Hadoop are used to compile, parse and make sense out of the 24x7 stream of data from the social networks. Today,  Twitter's streaming api and CEP could be used together to capture the happiness levels of twitter users. The code I present below listens in to live tweets to generate an 'happy' event every time “lol” is found in the text of a tweet. The CEP is used to capture happy events and alert is raised every time the count of happy events exceed pre-determined number in a pre-determined time period. An assumption that a user is happy every time he or she uses “lol” is very simplistic, but it helps get the point across. In practice, gauging the users' sentiment is not that easy because it involves natural language analysis. Consider below the example that highlights the complexities of analyzing natural language.

Iphone has never been good.
Iphone has never been so good.

As you can see, addition of just one word to the sentence completely changed the meaning. Because of this reason, natural language processing is considered one of the toughest problems in computer science. You can learn “natural language processing” using free online lectures offered by Stanford University. This link  takes you directly to the first lecture on natural language analysis by Christopher Manning. But, in my opnion, the pervasive use of abbreviations in social media and in modern lingo in general, is making the task a little bit easier. Abbreviations like “lol” and “AFAIK” accurately project the meaning. The use of “lol” projects “funny” and “AFAIK” may indicate the user is “unsure” of him or herself.

The code presented below uses Twitter4j api to listen to live twitter feed and Esper CEP to listen to events and alert us when a threshold is met. You can download twitter4j binaries or source from http://twitter4j.org/en/index.html and Esper from http://esper.codehaus.org/ . Before you execute the code, make sure to create a twitter account if you don’t have one and also read Twitter’s guidelines and concepts  its streaming API here . The authentication through just username & password combination is currently allowed by Twitter but it is going to be phased out in favor of oAuth authentication in near future. Also, pay close attention to their ‘Access and Rate Limit’ section. The code below uses streaming api in one thread. Please do not use another thread at the same time to avoid hitting the rate limit. Hitting rate limits consistently can result into Twitter blacklisting your twitter ID. Also it is important to note that, the streaming API is not sending each and every tweet our way. Twitter typically will sample the data by sending 1 out every 10 tweets our way. This is not a problem however for us, as long as we are interested in patterns in the data and not in any specific tweet. Twitter offers a paid service for  businesses that need streaming data with no rate limits. Following diagram shows the components and processing of data.

Diagram. Charts & DB not yet implemented in the code

Listing 1. Standard java bean representing a happy event.

Listing 2. Esper listener is defined.

Listing 3.

Twitter4j listener is created. This listener and CEP listener start listening. Every twitter post is parsed for ‘lol’. Every time ‘lol’ is found, an happy event is generated. CEP listener raises an alert every time the total count of ‘lol’ exceeds 2 in last 10 seconds.
The code establishes a long running thread to get twitter feeds. You will see the output on the console every time threshold is met. Please remember to terminate the program, it doesn't terminate on its own.

Now that you have this basic functionality working, you can extend this prototype in number of ways. You can handle additional data feeds (from source other than Twitter) and use Esper to corelate data from the two data feeds. For visually appealing output, you can feed the output to some charting library. For example, every time Esper identifies an event, the data point is used to render a point on a line graph. If you track the ‘happy event’ this way, then the graph will essentially show the ever changing level of happiness of Twitter users over a period of time.

 Please use comment section for your feedback, +1 to share and let me know if you would like to see more postings on this subject.

Friday, February 24, 2012

End of ERP as we know it?

A friend of mine on Facebook drew my attention to this blog post, 'End of ERP' by Tien Tzuo on Forbes.com. With the professional lives of millions tied to ERP in some way, I can imagine the buzz this post must be creating. SAP, being the The biggest ERP software maker in the world and the parent company of my employer, I read this with interest. So as to not be influenced by others' arguments, I haven't read any responses to this post yet.

If you haven't already, you can read the original post by Tien Tzuo here.
To get your opinion on this matter, I have created a short survey of only 5 questions that you can access by clicking here. I will publish the results of the survey soon. A link to the survey also appears at the bottom of this post for your convenience. In my opinion, this notable (reputation derived from the fact that it appeared on Forbes) post is way biased, as many posts often are. Could Tien's earlier job at SalesForce.com as a marketing officer be the reason? Predicting the end of something epic or a most trusted technology, is sure to generate a lot of buzz, which is what bloggers often set out to do. The post would have been a lot better and valuable had he compared ERP's strengths and weaknesses and explained why the weaknesses are so glaring that ERP customers would be willing to walk away from ERP, something so crucial to their existence. There is no success for a case that lacks even a semblance of honest acknowledgment of the other side of the argument.

In support of his argument, Tien mentions some key changes in consumer behavior and consumption patterns. The change in the ways customers engage with a company is driving ERP to its inevitable death. This is the main theme in 'End of ERP'. The services based consumption is rapidly increasing, but it can be applied only to so many things. By focusing on this alone, isn’t Tien forgetting the business processes around other product segments? Like food, energy, health and vehicles, there are simply too many things we cannot subscribe to and consume remotely. All standard functions of an ERP are still required for those sectors, aren't they? A customer may stop buying cars and instead rent from Zipcar, but cars will still have to be made, sold and bought. How would companies manage their businesses and have consolidated views of them without ERP?

ERP modules - Credit (http://www.abouterp.com/)

Tien also mentions companies like SalesForce.com and touts their successes as the proof that companies are moving away from ERP. SalesForce doesn’t offer anything other than CRM, does it? Does it provide finance, HR or materials management modules of ERP? I guess not. You can’t just run a big company effectively by mish- mashing different services from ten different vendors. That's why ERP exists and will keep it's market share in the enterprise segment. I do agree, however, that cloudification (I know, I know it's not a word in the English dictionary) of business functions is an irreversible trend. Oracle and SAP’s acquisitions of Taleo and SuccessFactors, respectively, are an indication of their grudging acceptance of this fact. The key to their success is not the demand for ERP in the cloud, which is ever present, but their ability to integrate acquired companies and their products to provide the same kind of comprehensive tool set as ERP.

“End of ERP” concludes by highlighting some key business requirements that according to Tien, are not met by ERP today. Without going in to details, it suffices to say that ERP is not meant to be a silver bullet for all business problems. It does what it does while ERP providers and its ecosystem try to find solutions to the unresolved business problems. Doesn’t business intelligence (BI) software aim to solve the kind of issues he mentions? The case in point is, there are a number of ways to mine information that you need. The importance of BI is undeniable and that's what vendors are investing millions in. The enormous response to SAP's in-memory analytics appliance HANA is just an example of how innovative products will meet the business requirements of today. While the business problems mentioned in the post may be genuine, they simply highlight opportunities for ERP’s improvement and do not in any way spell doom for it.

Make your voice heard? Take the Survey

Saturday, February 11, 2012

Complex Event Processing - a beginner's view

Using a Complex Event Processing is not so complex. Well, initially at least.
A substantial amount of information is available on the web on CEP products and functionality. But,if you are like me, you want to test run a product/application with little patience for reading detailed documentation. So when I was evaluating CEP as an engine for one of our future products, I decided to just try it out using a business scenario I knew from my past experience working with a financial company. For the impatient developers like me, what could be better than using a free and open source product. So, I decided to use 'Esper', an open source product based on Java and was able to write the code (merely 3 java classes) to address business case below.
But first a little about CEP and a shameless plug of our product. My apologies. :-)
Complex Event Processing has been gaining significant ground recently. The benefits of CEP are widely understood in some verticals such as financial and insurance industries, where it is actively deployed to perform various business critical tasks. Monitoring, Fraud detection and algorithmic trading are some of those critical tasks that depend on CEP to integrate multiple streams of real-time data, identify patterns and generate actionable events for an organization.
My current employer, Sybase Inc is one of the leading suppliers of CEP. Aleri, the Sybase CEP product, is widely used in financial services industry and it is the main component of Sybase's leading solution,'RAP - The Trading Edition'. Aleri is also sold as a separate product. Detailed information about the product is available here. http://www.sybase.com/products/financialservicessolutions/complex-event-processing.

The high level architecture of a CEP application is shown in the diagram below.
Figure 1.

Now on to the best part. The business requirement -
The important aspect of CEP that fascinates me is its ability to co-relate events or data points from different
streams or from within the same data stream. To elaborate, take an example of a retail bank that has a fraud
monitoring system in place. The system flags every cash transaction over $10,000 for a manual review. What this means is a large cash transaction (a deposit or withdrawal) in an account raises the anti-money laundering event from the monitoring system. Such traditional monitoring systems can easily be circumvented /exploited by simple tricks such as depositing more than one check with smaller amounts. What happens if an account holder deposits 2 checks of $6000 in a day or 5 checks of $2500 in a day? Nothing. The system can't catch it. The CEP provides a way to define rules with a time frame criterion. For example, you could specify a rule to raise a flag when some one deposits more than $10000 in cash in a 12 hour window. Get it?
Follow the steps below to see how easy it is to implement CEP to meet this business requirement.
Download latest Esper version (4.5.0 at the time of this writing) from here. http://espertech.com/download/
Unzip the package in a separate folder.
Create a Java project and reference the Esper jar files from this folder.
Create a standard java bean for an event - which here is an Deposit account with a name and amount attributes.

Listing 1.

The next listing is for creating an event type, the sql like query to create an event and to register a listener on
that query. The code generates an event any time one of the two deposit accounts AccountA and AccountB is deposited with more than 100000 in a time frame of 10 seconds (this is where you specify the time window). Because this is just a test, I have put the event generation functionality together with other code, but in real life the deposit amounts would be fed from deposit transaction processing system based on some messaging framework. The code is easy enough to follow. First we create the initial configuration. Then we add a type of event we want. A query with criterion for selecting the event is created next. As you can see the amount is summed up over sliding windows of 10 seconds and it creates an event when total of the amount in that time frame for a particular account exceeds 100000. A listener is created next and it is registered on the query.

 Listing 2

The next listing is the listener. Every time an event is generated in the time window specified in the query, it gets added to the newEvents collection.

 Listing 3

Easy enough, right? The expression language itself is fairly easy to understand because of its similarities to standard SQL syntax. Although the real life implementation could become complex based on the type and number of feeds and events you want to monitor, the product in itself is simple enough to understand. Many of the commercial CEP products offer excellent user interface to create the type of events, queries and reports.

Complex event processing is still a growing field and the pace of its adoption will only increase as companies try to make sense of all the streams of data flowing in. The amount of semi-structured and other type of data (audio, video) has already surpassed the amount of traditional relational data. It's easy to gauge the impact of good CEP application at a time when stock trading companies are already gleaning clues from twit feeds from twitter.

Hope this helps the curious. Don't forget to click +1 or Like, if you like it.

Saturday, January 14, 2012

Hibernate or JDBC - play smart

We know how ORM frameworks like Hibernate, JPA make a developer's life easier. Given a choice, we would always code in objects for manipulating data. But, once in a while you come across a use case where JDBC code clearly trumps the Hibernate in terms of performance. I came across such a use case, where I was required to parse thousands of sql statements and input the data into relational tables.What I found was even after I followed the Hibernate's best practices, Hibernate code was still twice slower than JDBC. This post is to generate conversation and to invite expert opinion on how to determine when to not use Hibernate. Does my use case a perfect example of this? Do you know of any other use cases?

Typically Hibernate's best practices on improving performance include following techniques related to my use case.

  • Use batching while performing batch inserts/updates
  • Use second level cache
  • Specify the objects for caching
  • Clearing session to avoid object caching at first level

And so forth .......

I found that the last option, flushing the session often was most helpful for my project. But I wanted to post this information to invite help, guidance, advice from Hibernate experts on performance tuning in general, and to verify if what I did (use JDBC) was the best option for the fastest performance.

To more curious folks, here are the top 5 urls in google search on 'hibernate performance tuning', but none of those discuss the use case I have here.


The application where I had to use JDBC in favor of Hibernate due to performance reasons was really very simple. Continuously streamed SQL data was parsed and went into three relational tables - parent, child and a child of child. Let's call them tables A, B and C respectively for simplicity. There was 1-1 relation between A and B while C had variable number of rows for each row of table B.
Table relationship

The code I wrote with Hibernate added a sample of 100K rows to A, 100K rows to B and around 200K rows to table C. There was no need to read the data while it was being written. The 'for loop' I wrote repeated 100K times with each instance of the loop adding row to table A first, followed by table B and table C next. Before debugging my code (Hibernate) due to very very sluggish performance, the speed of the program progressively decreased and almost started crawling before it threw 'out of memory' error. I had to add session.flush()  at the end of each cycle of the loop. Adding session.flush() statement greatly improved performance of writes and because there was no need to read the data at the time of writing session flushing did not cost performance penalty.

Typically in any real life OLTP applications some user sessions will always be reading the data while other user sessions are actively writing it. Hibernate's best practices suggest the use of second level cache for better performance and in most cases it works well when multiple sessions are adding data. Remember, Hibernate caches recently added data automatically. Second level cache is involved when multiple sessions are writing and reading data but in the scenario I discussed, I had only one session that was writing and no session was reading the data. Hence turning off or on the second level cache had no impact on performance at all.

But a slightly modified scenario could potentially be a real life use case. What are the Hibernate best practices to achieve optimum performance in such a use case. The modified scenario is to have one session writing the data and multiple sessions reading the data. For example, results and statistics of Summer Olympics being fed to a data store by some feed engine, while participants and viewers are viewing the data in real time. What steps could we take to get the best read/write performance using Hibernate. Please consider for discussion sake that data from feed engine is not a bulk upload but more of a live stream data accepted using Hibernate code (I guess it's not really a best practice). Do we turn second level cache off, do we flush session on each write cycle, batch the write operations or?

Of course, the performance tests using simulated work load provides the best answer, but as an architect how do we propose a solution that ultimately is validated by performance tests. Following table shows the performance numbers I got in my test runs with different configurations. The code was really simple and it performed operations I mentioned earlier.

Hibernate Vs JDBC
 If you are interested please add a comment and I will send out the code to you. Please note the code was run from within Eclipse on Windows XP with 512 MB max memory settings in Eclipse. I used Sybase 15.5 ASE server to host the tables. The relationship between the tables is really simple each having a primary key and a foreign key for one to many relation between table B and C and one to one relation between A and B. It's clear that flushing the session after each write cycle (after adding 1 record in table A, 1 in table B and multiple records in table C) was most important factor in getting the faster execution time. But the JDBC code still out performed Hibernate code. It was twice as fast Hibernate. See the last row in the table.

Another interesting finding was that batching (20 as suggested by documentation) actually performed worse than non-batching. This application from complexity perspective was really simple - not many tables, simple object structure, simple relationship, no complex SQLs. So it wasn't too difficult or cumbersome to use JDBC for the best performance.
There is abundance of technical literature available on internet on every technical subject. We read and follow guidelines from well regarded sources without thinking twice. But in some occassions it's useful to try things out and validate for yourself. That's what I was forced to do and I was glad I did it. Because of it I got some interesting results to share with the community. All hibernate enthusiasts, feel free to comment and advise.