Simple & Practical

ASE, HANA and much more

2014-01-25T20:43:00.001-05:00

After a long time I am back to writing this blog again. Much has happened in SAP ecosystem since my last post. SAP HANA has gone to cloud, ODATA has become the standard way of communicating between different layers in SAP products and SAP business suite customers are increasingly using Adaptive Server (ASE) as the underlying database.

As I continue to work on ODATA and ASE, new things continue to surprise me. These things, I call them quirks - may not be that surprising for an ASE veteran, but they are for users of some other databases in the market. These quirks/peculiarities or special features, whatever they are, have been discussed in various forums. One particular helpful resource I found is a creation of one of my co-workers. Check it out ( http://www.sypron.nl/main.html ) to find wealth of interesting ASE facts compiled over the years.

Today, I will discuss a feature of ASE that an average user may not have come across. It's an sql command -SET CHAINED ON/OFF that controls the transaction behavior. If your code is not explicitly starting a database transaction for series of sql commands, then SET CHAINED OFF command tells ASE to start a transaction. In the jdbc code you can do connection.setAutoCommit(true) to get similar functionality. The chained mode is particularly important if you are trying to execute a stored procedure on ASE. The successful execution of ASE stored procedure is dependent on using the correct chained mode. What bugs me is that, as a user of the stored procedure you need to know which mode was used when that stored procedure was created in the database.

If you happen to use autocommit(true) while executing a stored procedure that was created with 'set chained on' setting, then you will get a nasty error that looks like this "this procedure may be run only in chained transaction mode, SET CHAINED ON will cause the current session to run in chained transaction mode" . I have seen countless posts on internet and recently in SAP ecosystem discussion forums about this error, highlighting the frustration experienced by developers new to ASE. Thankfully ASE provides a system stored procedure that can set the mode of a stored procedure to "ANY", which allows that stored procedure to be executed in both settings - autocommit true and false. However it's not used widely judging from the number of people that experience this problem. If you do not want to get bitten by this nasty problem in runtime you can choose to do either of these two.

1. If you have the access to the underlying database (you need to be the owner of the procedure or have an SA role) you can run sp_procxmode command to make the stored procedure in question run in any mode. Assuming the name of the stored procedure is sp_proc_1 the command would be
execute sp_procxmode sp_proc_1, "anymode"

2. You can do this at run-time in your JDBC code. Execute sp_sprocxmode procedure with only one argument - the name of the stored procedure. The result set of this statement will give you what mode this procedure needs. Assuming, it requires 'unchained', you can execute conn.setAutoCommit(true) and then execute the sp_proc_1 procedure.

The second option is the better one that will avoid certain headaches later on, if you ever want to execute a stored procedure in ASE. Regardless of the mode in which the stored procedure was created, I recommend you implement this strategy. There are other interesting facts about stored procedures and why stored procedures are important again is something I would to write in my next post.

Adaptive Server IDENTITY quirks

2013-06-08T15:22:00.002-04:00

In my previous posts I tried to provide some key information to stimulate your interest in SAP Sybase products. If you are an application developer on any SAP platform, it will pay to know more about Adaptive Server Enterprise ( the former Sybase ASE) platform as the SAP business suite is successfully ported on it now. If you are just starting to explore ASE, you will be in a similar boat as I was about 2 years back. As I work more and more on it for various projects, more and more things continually surprise me. Having come to ASE from other popular databases, some of the features about ASE totally threw me a curve ball.

One such feature is called IDENTITY. This is very similar to AUTOINCREMENT column in other databases. IDENTITY is an automatically incremented counter, the value of which can be assigned to a column of a table. As one would expect, you can assign a starting value for the column and ASE increments the value with every insert into the table without any problem. In addition, ASE allows you to bulk add records to any table with an identity column, there by accepting explicit identity column values.

It's important to understand that IDENTITY columns are routinely used as primary keys on a table because of its auto-increment capability.You might also believe that the name 'IDENTITY' itself suggests that the value of the column in each row is unique and it is, as long as you leave it to ASE but the strange thing is ASE does not create a unique constraint on the column, there by leaving it to the user to keep it unique when IDENTITY_INSERT on the table is turned ON. By using this option, the table owner or db administrator can insert any number of duplicate entries into IDENTITY column as they want, which in my mind defeats the purpose of having an identity column. By default IDENTITY_INSERT is turned off, hence identity column receiving a duplicate entry is not normal. But once you turn it ON for any reason, you have to be extra careful while entering values manually. You can avoid this problem by denoting the identity column as a primary key column. Which bring me to the next point, how to assign a primary key to the table in ASE.
There are two ways you could do it, using Create or Alter Table syntax or through sp_primarykey stored procedure. A thing you may want to keep in mind while using sp_primarykey is that the stored procedure does not create a unique constraint on the column, which again is a huge surprise to me. Here is what ASE documentation says about sp_primarykey stored procedure.

Executing sp_primarykey adds the key to the syskeys table. Only the owner of a table or view can define its primary key. sp_primarykey does not enforce referential integrity constraints; use the primary key clause of the create table or alter table command to enforce a primary key relationship.
Define keys with sp_primarykey, sp_commonkey, and sp_foreignkey to make explicit a logical relationship that is implicit in your database design. An application program can use the information.
A table or view can have only one primary key. To display a report on the keys that have been defined, execute sp_helpkey.
The installation process runs sp_primarykey on the appropriate columns of the system tables.

Notice the lack of mention of 'unique constraint' from above which will make you believe that unique constraint is implicitly implied because you are making a column a primary key. But that's not so.
So a unique combination of denoting a column as identity column and a primary key you will have unintended consequences of getting duplicate values in the identity primary columns. Here is an example.\

CREATE TABLE test (id int IDENTITY, desc varchar(10))
sp_primarykey test, id
SET IDENTITY_INSERT test OFF
INSERT INTO test (desc) VALUES('abcd')
SET IDENTITY_INSERT test ON
INSERT INTO test(id, desc) VALUES(1, 'fghk')

Now you have two rows with duplicate id value of 1. Neither, defining as IDENTITY nor PRIMARY KEY could prevent getting a duplicate value in the id column. This could have disaster effect on your business application if it relies on this column to get a unique row. So, to avoid this type of situation follow the guidelines as below.

Always define primary keys through Create or Alter table syntax.
Check the uniqueness of identity column of table after you insert its value manually OR create a unique constraint on the identity column right after table creation.
If you are turning IDENTITY_INSERT on then turn it off in the same transaction, so as to avoid any other transaction in the same session adding a duplicate value to identity. It should be somewhat reassuring to developers that the change in IDENTITY_INSERT setting is not persistent. It's persistence is lost once the session ends.

I hope this important point helps you use IDENTITY carefully and saves you some trouble shooting in case of erroneous results from your business application. Happy development.

ODATA for Dummies

2013-02-09T17:17:00.006-05:00

In a world full of open source and open standards what's so important about one more protocol or specification, right? Why should you know about it, right? Right, but technology professionals are required to know a lot more technologies today than they were in the past because of the rapidly shifting software landscape. Gone are the days when you could master one technology and keep your job forever. At any moment you could be asked to look into some new technology and be expected to work on it or create a proof of concept in an effort to evaluate its fit to your current project. ODATA may be one such technology you may not have worked in or heard of, but worth knowing about.

If you already know about ODATA specification and how it works, please stop reading. This is a very high level but useful information on ODATA meant for newcomers.

Promoted by Microsoft ( do I hear Microsoft haters leaving already? ) ODATA is a protocol to access information over the web. ODATA defines a standard way of exposing any information through industry standard HTTP protocol. It also follows the architecture popularly known as REST. You can get details of the protocol and much more here. If you want to expose your data to your clients using ODATA, you will need much deeper understanding of the specification. However if you just want to consume the data as a client, you could be up and running in a few minutes to a few hours depending on your level of expertise. This post will have you create a sample and test it within minutes.

Let's take an example of a company that we all know about - Netflix. It uses odata service to publish its movie catalog. Viewing the movie catalog or searching for a title could easily be done by just accessing certain urls through your browser. You just need to know how to build further urls based on the information you get from a url. ODATA provides the response to your http request in a standard XML format that follows either Atom or Json format. Try these urls yourself and see how easy it is get to the information. The odata services are consumed through urls that are intuitive and easy to understand.

http://odata.netflix.com will show what type of information you could get from this website. The URL will resolve to http://odata.netflix.com/v2/Catalog/. The results will show that you could get a list of Genres, Titles and Languages. Now, suppose you want to see all the titles Netflix has, then simply add 'Titles' to the previous url as- http://odata.netflix.com/v2/Catalog/Titles. This will generate all the titles and associated information. If you want to know all the details (metadata) about all the entities (collections) simply type in this url. http://odata.netflix.com/v2/Catalog/$metadata.This results in information on collections, the entities they are made up of and the attributes of each entity there in. This just shows you how easy it is to get to this basic information. Now you are ready to dig deeper.

The data that shows up on your screen may have been stored in a table in a relational database or in an xml database or in a a file in the file system, it does not matter. The information is available to you through the http get method (browser sends 'get' method to the end point). Irrespective of the type and location of the data storage, specific data is likely identified by a unique key. If you know the unique identifier of the entity you want to access, then you could use it to get to more information. The following example shows how e Titles are uniquely identified by an 'id' field. This information - that id is the unique identifier of a title is gotten from the previous metadata query we executed. Spend a few minutes reviewing the results of metadata query.
http://odata.netflix.com/v2/Catalog/Titles('13bLK').

When you don't know the unique identifier you could always search titles based on a name (name is an attribute of Title entity as revealed by the metadata query.)
http://odata.netflix.com/v2/Catalog/Titles?$filter=(Name%20eq%20'Rod%20Stewart')

The count of total number of titles Netflix owns is easily found by using the following url. As of 2/8/2013 the count is 160050. Odata defines many such functions count, length, replace, trim etc.
http://odata.netflix.com/v2/Catalog/Titles/$count

So this is what odata specification provides. It specifies the format of the url, the functions and operators you could use in the url to get to the data you want. REST based architecture exposes all the information through http metods - in this case - 'GET' is what we have seen so far.

Now a little bit on to how providers like Netflix create these services. If service provider is exposing mechanisms for creating, updating and deleting the data through odata producer and if the information that's being exposed is in relational database, they need to map the standard CRUD methods to http methods. So create is mapped to post, delete to delete, update to put and select to get. The essential tasks of the producer would be to parse the request, get the http method, get the content, call the back end where the data is stored, perform the mapped functions and convert the retrieved data to Atom or Json format before it's sent back to the client. Vendors like Microsoft and Google (odata4j) are offering function libraries that makes it easy to create your own producer. If you just want to consume the odata services there are many client libraries like popular datajs.

Now you are ready to create your own client.
Here is a snippet of code I got from datajs tutorial. Save this as any html document and use it in any modern browser. You will see all the genres of all the content from the Netflix. Before running the sample don't forget to download datajs-1.1.0.min.js directory from here and save it to your c:\datajs directory.

I hope this post will give you enough information to try things on your own. I also hope to have another post some time later on the same subject. Odata is actively used by companies such as Microsoft, SAP, Netflix, Facebook and Ebay. Only time will tell if this standard becomes widely used technology or not, but it already has gathered enough following to attract our attention. Odata.org website. Good luck. Your comments are welcome. Please feel free to share if you like it.

SAP Sybase ASE - An introduction for a newbie

2012-09-25T10:41:00.002-04:00

Since the day of its acquisition of Sybase, SAP has made great strides in porting its business suite on to Sybase ASE (known as Adaptive Server Enterprise - the Sybase relational database). Last week's announcement from SAP is just one of the many milestones that symbolises its committment to Sybase products and highlights the successes. The highest number in SD benchmark was achieved on a 2 core HP/Linux machine by ASE. Check out the details here.

As SAP works towards its stated goal of becoming a premier database company in the world, it is but natural that more and more SAP practitioners and enthusiasts become interested in ASE and may be looking to use it not only for SAP applications but also for other in-house applications. For a java programmer, familiarity with the JDBC API is enough to use a standard RDBMS. But once in a while one is required to connect directly to the underlying database to debug an issue. Unlike production systems where you can depend on a DBA for support, you are often left to deal with problems yourself in a development environment. Quirks or peculiarities in a new relational database system can be a source of frustration. Some features that totally make sense in one database may not work the same way in other. It's those little things, quirks, that can drive one crazy when one is dealing with tighter deadlines.

I have been working with SAP Sybase Adaptive Server Enterprise for more than a year now. I feel comfortable in sharing some bits and pieces I pieced together over that period. Much of this information is available on the web and or in the product documentation, but scouring and digging through this information can take time. Also, some of the information is available at most unlikely places. Having worked with popular databases other than ASE before, these key features seemed interesting, a bit different or I came across them too often. This is not a comprehensive list of ASE features, but it is a list to save some time for a newcomer to quickly get off the ground. So here it goes.

First, SAP allows free downloads of developer versions of most of the Sybase products. You can download the one you want from here. Consider revisiting this post before you install ASE.

1. Concept of devices and databases in ASE - They resemble the concepts of databases and schemas in Oracle.

2. Finding the version of the database you are running - Sybase products including ASE are fast evolving to cater to SAP ecosystem. Major features are being added fairly regularly. When something isn't working as expected, first thing you may want to do before calling into tech support is to find the version of the database you are running and refer to product documentation here.

dataserver -v

3. Sybase environment variables are set in sybase.csh file on linux installation. The server startup/shutdown scripts could be found in following directory

your sybase home directoy/ASE-15_0/install

4. If you believe that an installed instance is not responding to your application code, the most likely reason could be that the thread is kept waiting due to the lack of log space. To free up the space you can use following commands. Refer to sybase documentation on what each does, but if you are not concerned about losing the running transaction, then either of the following should work. You should see your transaction going through as soon as you run the following command.

DUMP TRANSACTION your_database_name WITH TRUNCATE_ONLY
DUMP TRANSACTION your_database_name with no_log

Before you try above commands, you may want to look at all the running connections/threads to ASE and their status by executing sp_who stored procedure.

5. One of the most important things you need to decide while installing the ASE is the page size. All data from one row of one table is colocated in one page to improve performance. Once defined, you can not change page size for the installed ASE instance. Below is a table showing relationship between the page sizes and corresponding maximum possible database sizes.

2K page size - 4 TB
4K page size - 8 TB
8K page size - 16 TB
16K page size - 32 TB

If you are not sure what page size the installed ASE instance is using, then run the following command from isql utility, the client used to connect to ASE.

select @@maxpagesize

6. If you use Hibernate for data access, you may want to visit hibernate-sybase integration page here. Spending a few minutes on the required ASE settings for hibernate could save you from some frustration later on.

7. Once the installation is complete you can run Sybase Control Center application that provides a GUI interface to all your ASE instances. You can alternatively use isql, a command line sql interface to interact with ASE. Super admin user name is 'SA' and password could be left blank.

8. Use following stored procedures to add users. sp_addlogin just adds a login and not a user.

sp_addlogin - adds only a login name
sp_adduser - adds a new user.

9. Make sure to assign a role with adequate privileges to the new user by executing following procedure.

sp_role "grant", some_role, username

10. Before you try any sql you may want to check the 'reserved words' of ASE by using following command. I recommend to go through this list if you are porting an existing application to ASE.

select name from master..spt_values where type = "W"

11. Its a common practice to add an autoincrement column to a table and use it as a primary key. Autoincrement is achieved in ASE by denoting a column as IDENTITY. ASE does not allow more than one autoincrement column in a table. There is no 'sequence' as in Oracle. ASE allows adding user defined values to an autoincrement column after you run the following command.

set identity_insert tablename on

12. If you have used MS SQL Server, then you will find ASE's t-sql very similar as both share the same roots. T-sql is similar to pl/sql in Oracle.

13. ASE's native JDBC driver is called jconnet. JTDS, an open source driver, can be used also. Jconnect supports changing connection properties through connection url. Just append the connection url with 'property_name=property_value'. For example

'jdbc:sybase:tds:servername:portnumber/database?QUOTED_IDENTIFIER=ON'

allows quoted identifiers in a sql statement like below.

SELECT * from "dbo". "User_Table"

If you want to run your existing relational database application on ASE then it is really important to pay attention to this connection property. However, if you are building all sqls as 'preparedStatements' then quoted strings are handled correctly even when quoted_identifier is not explicitly turned on.

14. Also important to note is the difference in handling table aliases. Table aliases are allowed in 'SELECT' statements but not in 'UPDATE' or 'DELETE'. For example the first sql below is valid but the second returns an error.

select * from tablename t1
update tablename t1 set columnname=value where column2=value2

15. Sybase Jconnect driver depends on some key metadata information for its correct working. In absence of this information you may receive sql exception with similar description as 'missing metadata'. If this happens, it means that you missed a step during ASE installation. This missed step installs the metadata information in the master database. This error could be eliminated by running a stored procedure after the installation. Refer to the documentation here.

16. 'Select for Update' was introduced in ASE 15.7 and you would expect it to work by default in a new install. Alas, no such luck. I highly recommend reading this document before you try this feature. You would need 'datarows' locking scheme on the table on which select is performed. You can turn 'datarows locking' ON on the enitre database or on a table using following commands.

sp_configure "lock scheme", 0, datarows
alter table tablename lock datarows

17. ASE is case sensitive by default. To display case sensitivity, sort order and character set settings, use sp_helpsort stored procedure.

18. Following information is useful if you want to programmatically access metadata information from ASE. Sysobjects table in master database holds information on all entities like tables, indexes, views in ASE. To display all user objects use following command.

Select * from sysobjects where type = 'U'

syscolumns table holds information of all columns from all tables.
sysreferences, sysconstraints hold information on relationships between tables like foreign key constraints.

19. By joining above tables you can get pretty much all the information about a user table. Unfortunately there is no simple query like 'describe table' in Oracle. You can use following stored procedure to get all the details about a table, but it will include much more than just the column names and their types.

sp_help tablename

If you want just the column names of a user table, use following sql.

select sc.name from syscolumns sc, sysobjects so where sc.id=so.id and so.name='tablename'

20. Unlike Oracle, ASE truncates all trailing blanks while storing variable length varchar columns. So be careful if your code has comparisons on strings pulled from varchar columns.

I hope this list will help you install and navigate ASE in early stages. I mentioned these in particular because I had to use these commands more often than any others. Once you spend a little bit of time on ASE, you will of course come across other things that seem more useful. I will try to list those in my next post.

Understanding Aleri - Complex Event Processing - Part II

2012-04-08T17:37:00.000-04:00

Aleri, the complex event processing platform from Sybase was reviewed at high level in my last post.

This week, let's review the Aleri Studio, the user interface to Aleri platform and the use of pub/sub api, one of many ways to interface with the Aleri platform. The studio is an integral part of the platform and comes packaged with the free evaluation copy. If you haven't already done so, please download a copy from here. The fairly easy installation process of Aleri product gets you up and running in a few minutes.

The aleri studio is an authoring platform for building the model that defines interactions and sequencing between various data streams. It also can merge multiple streams to form one or more streams. With this eclipse based studio, you can test the models you build by feeding them with the test data and monitor the activity inside the streams in real time. Let's look at the various type of streams you can define in Aleri and their functionality.

Source Stream - Only this type of stream can handle incoming data. The operations that can be performed by the incoming data are insert, update, delete and upsert. Upsert, as the name suggests updates data if the key defining a row is already present in the stream. Else, it inserts a record in the stream.

Aggregate Stream - This stream creates a summary record for each group defined by specific attribute. This provides functionality equivalent to 'group by' in ANSI SQL.

Copy stream - This stream is created by copying another stream but with a different retention rule.

Compute Stream - This stream allows you to use a function on each row of data to get a new computed element for each row of the data stream.

Extend Stream - This stream is derived from another stream by additional column expressions

Filter Stream - You can define a filter condition for this stream. Just like extend and compute streams, this stream applies filter conditions on other streams to derive a new stream.

Flex Stream - Significant flexibility in handling streaming data is achieved through custom coded methods. Only this stream allows you to write your own methods to meet special needs.

Join Stream - Creates a new stream by joining two or more streams on some condition. Both, Inner and Outer joins can be used to join streams.

Pattern Stream - Pattern matching rules are applied with this stream

Union Stream - As the name suggests, this joins two or more streams with same row data structure. Unlike the join stream, this stream includes all the data from all the participating streams.

By using some of these streams and the pub api of Aeri, I will demonstrate the seggregation of twitter live feed into two different streams. The twitter live feed is consumed by a listener from Twitter4j library. If you just want to try Twitter4j library first, please follow my earlier post 'Tracking user sentiments on Twitter'. The data received by the twitter4j listener, is fed to a source stream in our model by using the publication API from Aleri. In this exercise we will try to separate out tweets based on their content. Built on the example from my previous post, we will divide the incoming stream into two streams based on the content. One stream will get any tweets that consists 'lol' and the other gets tweets with a smiley ":)" face in the text . First, let's list the tasks we need to perform to make this a working example.

Create a model with three streams
Validate the model is error free
Create a static data file
Start the Aleri server and feed the static data file to the stream manually to confirm correct working of the model.
Write java code to consume twitter feed. Use the publish API to publish the tweets to Aleri platform.
Run the demo and see the live data as it flows through various streams.

Image 1 - Aleri Studio - the authoring view

This image is a snapshot of the Aleri Studio with the three streams - one on the left named "tweets" is a source stream and two on the right named "lolFilter" and "smileyFilter" are of the filter type. Source stream accepts incoming data while filter streams receive the filtered data. Here is how I defined the filter conditions -

like (tweets.text, '%lol%').

tweets is the name of the stream and text is the field in the stream we are interested in. %lol% means, select any tweets that have 'lol' string in the content. Each stream has only 2 fields - id and text. The id and text maps to id and text-message sent by twitter. Once you define the model, you can check it for any errors by clicking on the check mark in the ribbon at the top. Erros if any will show up in the panel at bottom right of the image. Once your model is error free, it's time to test it.

Image 2 - Aleri Studio - real time monitoring view

The following image shows the test interface of the studio. Try running your model with a static data file first. The small red square at the top indicates that Aleri server is currently running. The console window at the bottom right shows server messages like successful starts and stops etc. The Run-test tab in the left pane, is where you pick a static data file to feed the source stream. The pane on the right shows all the currently running streams and live data processed by the streams.

The image below shows the format of the data file used to test the model

The next image shows the information flow.

Image 3 - Aleri Publishing - Information flow

The source code for this exercise is at the bottom.
Remember that you need to have twitter4j library in the build path and have Aleri server running before you run the program. Because I have not added any timer to the execution thread, the only way to stop the execution is to abort it. For brevity and to keep the code line short, I have deleted all the exception handling and logging. The code utilizes only the publishing part of the pub/sub api of Aleri. I will demonstrate the use of sub side of the api in my next blog post.

This blog intends to provide something simple but useful to the developer community. Feel free to leave your comment and share this article if you like it.

package com.sybase.aleri;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import twitter4j.Status;
import twitter4j.StatusDeletionNotice;
import twitter4j.StatusListener;
import twitter4j.TwitterException;
import twitter4j.TwitterStream;
import twitter4j.TwitterStreamFactory;
import twitter4j.conf.Configuration;
import twitter4j.conf.ConfigurationBuilder;

import com.aleri.pubsub.SpGatewayConstants;
import com.aleri.pubsub.SpObserver;
import com.aleri.pubsub.SpPlatform;
import com.aleri.pubsub.SpPlatformParms;
import com.aleri.pubsub.SpPlatformStatus;
import com.aleri.pubsub.SpPublication;
import com.aleri.pubsub.SpStream;
import com.aleri.pubsub.SpStreamDataRecord;
import com.aleri.pubsub.SpStreamDefinition;
import com.aleri.pubsub.SpSubscription;
import com.aleri.pubsub.SpSubscriptionCommon;
import com.aleri.pubsub.impl.SpFactory;
import com.aleri.pubsub.impl.SpUtils;
import com.aleri.pubsub.test.ClientSpObserver;

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.Vector;
import java.util.TimeZone;

public class TwitterTest_2 {
 //make sure that Aleri server is running prior to running this program
 static {
  //creates the publishing platform
  createPlatform();

}
 // Important objects from the publish API
 static SpStream stream;
 static SpPlatformStatus platformStatus;
 static SpPublication pub;

public static void main(String[] args) throws TwitterException, IOException {
  
  TwitterTest_2 tt2 = new TwitterTest_2();
  ConfigurationBuilder cb = new ConfigurationBuilder();
  cb.setDebugEnabled(true);
  
  //use your twitter id and passcode
  cb.setUser("Your user name");
  cb.setPassword("Your Password");

// creating the twitter4j listener

Configuration cfg = cb.build();
  TwitterStream twitterStream = new TwitterStreamFactory(cfg)
    .getInstance();
  StatusListener_1 listener;
  listener = new StatusListener_1();
  twitterStream.addListener(listener);
  
  //runs the sample that comes with twitter4j
  twitterStream.sample();

}

private static int createPlatform() {
  int rc = 0;
  //Aleri platform configuration - better alternative is to your properties file
  String host = "localhost";
  int port = 22000;
  
  //aleri configured to run with empty userid and pwd strings
  String user = "";
  String password = "";
  
  //name of the source stream - the one that gets the data from the twitter4j
  String streamName = "tweets";
  
  String name = "TwitterTest_2";
  
  SpPlatformParms parms = SpFactory.createPlatformParms(host, port, user,
    password, false, false);
  platformStatus = SpFactory.createPlatformStatus();
  SpPlatform sp = SpFactory.createPlatform(parms, platformStatus);
  stream = sp.getStream(streamName);
  pub = sp.createPublication(name, platformStatus);
  
  // Then get the stream definition containing the schema information.
  SpStreamDefinition sdef = stream.getDefinition();
/*
  int numFieldsInRecord = sdef.getNumColumns();
  Vector colTypes = sdef.getColumnTypes();
  Vector colNames = sdef.getColumnNames();

*/
  return 0;
 }

static SpStream getStream() {
  return stream;
 }

static SpPlatformStatus getPlatformStatus() {
  return platformStatus;
 }

static SpPublication getPublication() {
  return pub;
 }

static int publish(SpStream stream, SpPlatformStatus platformStatus,
   SpPublication pub, Collection fieldData) {
  
  int rc = 0;
  int i = pub.start();

SpStreamDataRecord sdr = SpFactory.createStreamDataRecord(stream,
    fieldData, SpGatewayConstants.SO_UPSERT,
    SpGatewayConstants.SF_NULLFLAG, platformStatus);

Collection dataSet = new Vector();
  dataSet.add(sdr);
  System.out
    .println("\nAttempting to publish the data set to the Platform for stream <"
      + stream.getName() + ">.");

rc = pub.publishTransaction(dataSet, SpGatewayConstants.SO_UPSERT,
    SpGatewayConstants.SF_NULLFLAG, 1);

// commit blocks the thread until data is consumed by the platform
  System.out.println("before commit() call to the Platform.");
  rc = pub.commit();

return 0;
 }

}

Aleri - Complex Event Processing - Part I

2012-03-23T23:28:00.000-04:00

Algo Trading - a CEP use case
Trackback URL

Sybase's Aleri streaming platform is one of the more popular products in the CEP market segment. It's is used in Sybase's trading platform - the RAP edition, which is widely used in capital markets to manage positions in a portfolio. Today, in the first of the multi-part series, I want to provide an overview of the Aleri platform and provide some code samples where required. In the second part, I will present the Aleri Studio, the eclipse based GUI that simplifies the task of modeling CEP workflow and monitor the Aleri server through a dashboard.

Fraud Detection - Another CEP use case. Trackback url

In my previous blog post on Complex Event Processing, I demonstrated the use of Esper, the open source CEP software and Twitter4J API to handle stream of tweets from Twitter. A CEP product is much more thanhandling just one stream of data though. Single stream of data could be easily handled through the standard asynchronous messaging platforms and does not pose very challenging scalability or latency issues. But when it comes to consuming more than one real time stream of data and to analyzing it in real time, and when correlation between the streams of data is important, nothing beats a CEP platform. The sources feeding streaming platform could vary in speed, volume and complexity. A true enterprise class CEP should deal effectively with various real time high speed data like stock tickers and slower but voluminous offline batch uploads, with equal ease. Apart from providing standard interfaces, CEP should also provide an easier programming language to query the streaming data and to generate continuous intelligence through such features as pattern matching and snapshot querying.

Sybase Trading Platform - the RAP edition. Trackback URL

To keep it simple and at high level, CEP can be broken down to three basic parts. The first is the mechanism to grab/consume source data. Next is the process of investigating that data, identifying events & patterns and then interacting with target systems by providing them the actionable items. The actionable events take different forms and formats depending on the application you are using the CEP for. An action item could be - selling an equity position based on calculated risk in a risk monitoring application. indicating potential fraud events in money laundering applications or alerting to a catastrophic event in a monitoring system by reading thousands of sensors in a chemical plant. There literally are thousands of scenarios where a manual and off-line inspection of data is simply not an option. After you go through the following section, you may want to try Aleri yourself. This link http://www.sybase.com/aleriform directly takes you to the Aleri download page. Evaluation copy valid for 90 days is freely available from Sybase’s official website. Good amount of documentation, an excellent tutorial and some sample code on the website should help you get started quickly.

If you are an existing user of any CEP product, I encourage you to compare Aleri with that product and share it with the community or comment on this blog. By somewhat dated estimates, Tibco CEP is the biggest CEP vendor in the market. I am not sure how much market share another leading product StreamBase has. There is also a webinar you can listen to on Youtube.com that explains CEP benefits in general and some key features of Streambase in specific. For new comers, this serves as an excellent introduction to CEP and a capital markets use case.

An application on Aleri CEP is built by creating a model using the Studio (the gui) or using Splash(the language) or by using the Aleri Modeling language (ML) - the final stage before it is deployed.

Following is a list of the key features of Splash.

Data Types - Supports standard data types and XML . Also supports ‘Typedef ‘ for user defined data types.
Access Control – a granular level access control enabling access to a stream or modules (containing many streams)
SQL – another way of building a model. Building an Aleri studio model could take longer due to its visual paradigm. Someone proficient with SQL should be able to do it much faster using Aleri SQL which is very similar to regular SQL we all know.
Joins - supported joins are Inner, Left, Right and Full
Filter expressions - include Where, having, Group having
ML - Aleri SQL produces data model in Aleri modeling language (ML) – A proficient ML users might use only ML (in place of Aleri Studio and Aleri SQL)to build a model.
The pattern matching language - includes constructs such as ‘within’ to indicate interval (sliding window), ‘from’ to indicate the stream of data and the interesting ‘fby’ that indicates a sequence (followed by)
User defined functions – user defined function interface provided in the splash allows you to create functions in C++ or Java and to use them within a splash expression in the model.

Advanced pattern matching – capabilities are explained through example here. – Following three code segments and their explanations are directly taken from Sybase's documentation on Aleri.
The first example checks to see whether a broker sends a buy order on the same stock as one of his or her customers, then inserts a buy order for the customer, and then sells that stock. It creates a “buy ahead” event when those actions have occurred in that sequence.

within 5 minutes
from
BuyStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Buy1,
BuyStock[Symbol=sym; Shares=n2; Broker=b; Customer=c1] as Buy2,
SellStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Sell
on Buy1 fby Buy2 fby Sell
{
if ((b = c0) and (b != c1)) {
output [Symbol=sym; Shares=n1; Broker=b];
}
}

This example checks for three events, one following the other, using the fby relationship. Because the same variable sym is used in three patterns, the values in the three events must be the same. Different variables might have the same value, though (e.g., n1 and n2.) It outputs an event if the Broker and Customer from the Buy1 and Sell events are the same, and the Customer from the Buy2 event is different.

The next example shows Boolean operations on events. The rule describes a possible theft condition, when there has been a product reading on a shelf (possibly through RFID), followed by a non-occurrence of a checkout on that product, followed by a reading of the product at a scanner near the door.

within 12 hours
from
ShelfReading[TagId=tag; ProductName=pname] as onShelf,
CounterReading[TagId=tag] as checkout,
ExitReading[TagId=tag; AreaId=area] as exit
on onShelf fby not(checkout) fby exit
output [TagId=t; ProductName=pname; AreaId=area];

The next example shows how to raise an alert if a user tries to log in to an account unsuccessfully three times within 5 minutes.

from
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login1,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login2,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login3,
LoginAttempt[IpAddress=ip; Account=acct; Result=1] as login4
on (login1 fby login2 fby login3) and not(login4)
output [Account=acct];

People wishing to break into computer systems often scan a number of TCP/IP ports for an open one, and attempt to exploit vulnerabilities in the programs listening on those ports. Here’s a rule that checks whether a single IP address has attempted connections on three ports, and whether those have been followed by the use of the “sendmail” program.

within 30 minutes
from
Connect[Source=ip; Port=22] as c1,
Connect[Source=ip; Port=23] as c2,
Connect[Source=ip; Port=25] as c3
SendMail[Source=ip] as send
on (c1 and c2 and c3) fby send
output [Source=ip];

Aleri provides many interfaces out of the box for an easy integration with source and target systems. Through these interfaces/adapters the Aleri platform can communicate with standard relational databases, messaging frameworks like IBM MQ, sockets and file system files. Data in various formats like csv, FIX, Reuters market data, SOAP, http, SMTP is easily consumed by Aleri through standardized interfaces.

Following are available techniques for integrating Aleri with other systems.

Pub/sub API is provided in Java, C++ and dot net - A standard pub/sub mechanism
SQL interface with SELECT, UPDATE, DELETE and INSERT statements used through ODBC and JDBC connection.
Built in adapters for market data and FIX

In the next part of this series we will look at the Aleri Studio, the gui that helps us build the CEP application the easy way.

Sybase IQ 15.4 - for big data analytics

2012-03-13T14:02:00.001-04:00

Big data analytics platform, Sybase IQ 15.4 Express Edition, is now available for free. This article introduces Sybase IQ's big data features and shares some valuable resources with you. Sybase IQ, with its installation base of more than 4,500 clients all over world has always been a leading columnar database for mission critical analytic functions. With new features like native support for Map Reduce and in-database analytics, it positions itself as a premium offering for big data analytics.

Columnar databases have been in existence for almost 2 decades now. Row level databases for OLTP transactions and Columnar databases for analytics, more or less met requirements of organizations. Extracting, transforming and loading (ETL) transactional data into analytic platform has been a big business. However, the recent focus on big data and the nature of it (semi-structured, unstructured) is changing the landscape of analytic platforms and the way ETL is used. Boundaries between OLTP and analytic platform are somewhat blurring due to advent of real-time analytics.

Due in large part to the tremedous growth of ecommerce in recent years, much of the developer talent pool was attracted to low-latency and high throuput transactional systems. Without a doubt, the same pool is gravitating towards aquisition, management and analysis of semi-structured data now. Host of start-ups and established technology companies are creating new tools, methods and methodologies in an effort to address the data deluge we are generating from our use of social networks. Even if there are myriad of options available to handle big data and analytic, some clear trends or underlying technologies have emerged to the fore. Hadoop, the open source implementation for batch operations on big data and in-memory analytic platform to provide real-time business intelligence are two of the most significant trends. Established vendors like Oracle, IBM, Informatica have been adding new products or updating their existing offerings to meet the new demands in this space.

Business intelligence gathered through only one type of data, transactional or semi-structured from social media or machine generated, is not good enough for today's organization. It's really important to glean information from each type of source and co-relate one with the other to present a comprehensive and accurate intelligence that assists critical decision support systems. So, it's imperative that vendors provide a set of tools with good enough integration that support structured, semi-structured and totally unstructured data like audio and video. "Polyglot Persistence", the buzzword of late and made popular by Martin Fowler, addresses the need and nature of different types of data. You may have multiple systems for storage/persistence but ultimately an enterprise needs a comprehensive view of its business through one lense. This is what Sybase's IQ, the traditional columnar database product, is trying to provide by offering native support for Hadoop and Mapreduce. There are other singificant enhancements to the latest enterprise edition of Sybase IQ. What makes this product attractive to the developer community is, its free availability. Please note, it's not an evaluation copy, but is a full function enterprise edition with the only restriction of databas size at 5 GB. This blog post is a quick summary of IQ's features and aims to point important resources that will help you in trying it out.

For uninitiated, this blog entry by William McKnight, provides an introduction to concepts of columnar databases.

Following paragraph directly taken from Sybase's web site, sums up the most important features.

Sybase® IQ 15.4 is revolutionizing “Big Data” analytics breaking down silos of data analysis and integrating it into enterprise analytic processes. Sybase IQ offers a single database platform to analyze different data – structured, semi-structured, unstructured – using different algorithms. Sybase IQ 15.4 expands these capabilities with the introduction of a native MapReduce API, advanced and flexible Hadoop integration, Predictive Model Markup Language (PMML) support, and an expanded library of statistical and data mining algorithms that leverage the power of distributed query processing across a PlexQ grid.

Some details on these features is in order.

User Defined Functions Enabling Map Reduce - One of the best architecture practices in a three tier application architecture is to physically and logically separate business logic and the data. But bringing data to business logic layer and then moving it back to persistence layer adds latency. The mission critical applications typically implement many strategies to reduce the latency, but the underlying theme to all those solutions is the same. That is, to keep business logic and data as close to one another as possible. Sybase IQ's native c++ API allows developers to build user defined functions to implement proprietory algorithms. This means, you can build map reduce functions right inside the database that can yield 10X performance improvements. This also means, you can use ordinary SQL from higher level business logic keeping that layer simpler while taking advantage of map reduce based parallel processing for higher performance. The map reduce jobs are executed in parallel on the grid of servers, called Multiplex or PlexQ in Sybase IQ parlance.

Hadoop Integration - We discussed the need for analyzing and co-relating semi-structured data with structured data earlier. Sybase IQ provides 4 different ways in which this integration can happen. Hadoop is used to extract data points from unstructured or semi-structured data and then is used with OLTP data for further analysis. Clien-side Federation, ETL based, Query Federation and Data Federation are ways in which Hadoop integration occurs. You can request more in-depth information here.

PMML Support - PMML (Predictive Model Markup Language) support allows the user to create predictive models using popular tools like SAS and R. These models can be executed in automated fashion extending the already powerful analytics platform further. A plug-in from Zementis is used to provide the PMML validation and its transformation to Java UDFs.

R Language Support - SQL is woefully inadequate when you need statistical analysis on structured data. But, the simplicity and wide adoption of SQL makes it an attractive query tool. The RJDBC interface in IQ allows an application to use the R programming language to perform statistical functions. R is a very popular open source programming language used in many financial applications today. Please read my blog entry 'Programming with R, it's super' for further information on R.

In Database Analytics - Sybase IQ in its latest version15.4, uses 'in database' analytics library called DB Lytix from Fuzzy Logix. This analytics engine have the ability to perform advance analytics through simple SELECT and EXECUTE statements. Sybase claims that some of these analytical functions are able to leverage MapReduce API in some data mining algorithms. DB Lytix, according to Fuzzy Logic's website, supports Mathematical and Statistical functions, Monte Carlo Simulations- uni-variate and multi-variate, Data Mining ; Pattern Recognition , Principal Component Analysis, Linear Regression, Logistic Regression, Other supervised learning methods, and Clustering.

Sybase provides detail documentation on the features of IQ on Sybase's website. If you would like to try Sybase IQ, use this direct link. Remember, this edition is not an eval copy. It is a full featured IQ edition limited only by database size at 5GB.

As you can imagine, Sybase IQ is not the only company offering big data solutions. This gigaom article lists few others as well. With so many good options to choose from, the users have to look at factors other than just the technology to make the selection. Some of those factors are - the reputation of the brand, standing in the market, the extent of its user base and the eco-system around the product. In that regard, Sybase IQ scores very high points and is the reason for it's position in leadership quadrant by Gartner. Sybase IQ has been a leading columnar database in the market since 90's and has established a robust eco-system around it. Sybase's other products - Power Designer , the leading data modelling workbench, SAP Business Objects for reporting and analytics, and Sybase Control Center for administration and monitoring of IQ - support the IQ and provid one of the most comprehensive analytics platforms in the industry.

Tracking user sentiments on Twitter with Twitter4j and Esper

2012-02-27T21:24:00.002-05:00

For new comers to Complex Event Processing and Twitter API, I hope this serves as a short tutorial and helps them get off the ground quickly.

Managing big data and mining useful information from it is the hottest discussion topic in technology right now. Explosion of growth in semi-structured data flowing from social networks like Twitter, Facebook and Linkedin is making technologies like Hadoop, Cassandra a part of every technology conversation. So as not to fall behind of competition, all customer centric organizations are actively engaged in creating social strategies. What can a company get out of data feeds from social networks? Think location based services, targeted advertisements and algorithm equity trading for starters. IDC Insights have some informative blogs on the relationship between big data and business analytics. Big data in itself will be meaningless unless the right analytic tools are available to sift through it, explains Barb Darrow in her blog post on gigaom.com

Companies often listen into social feeds to learn customers’ interest or perception about the products. They also are trying to identify “influencers” – the one with most connections in a social graph – so they could make better offers to such individuals and get better mileage out of their marketing. The companies involved in equity trading want to know which public trading companies are discussed on Twitter and what are the users' sentiments about them. From big companies like IBM to smaller start-ups, everyone is racing to make most of the opportunities of big data management and analytics. Much documentation about big data like this ebook from IBM 'Big Data Platform' is freely available on the web. However a lot of this covers theory only. Jouko Ahvenainen in reply to Barb Darrow’s post above makes a good point that “many people who talk about the opportunity of big data are on too general level, talk about better customer understanding, better sales, etc. In reality you must be very specific, what you utilize and how”.

It does sound reasonable, doesn't it? So I set out to investigate this a bit further by prototyping an idea, the only good option I know. If I could do it, anybody could do it. The code is remarkably simple. But, that's exactly the point. Writing CEP framework yourself is quite complex but using it is not. Same way, Twitter makes it real easy to get to the information through REST API.

Big Data - http://www.bigdatabytes.com/managing-big-data-starts-here/

Complex Event Processing (CEP), I blogged previously (click here to read) is a critical component of the big data framework. Along with CEP, frameworks with Hadoop are used to compile, parse and make sense out of the 24x7 stream of data from the social networks. Today, Twitter's streaming api and CEP could be used together to capture the happiness levels of twitter users. The code I present below listens in to live tweets to generate an 'happy' event every time “lol” is found in the text of a tweet. The CEP is used to capture happy events and alert is raised every time the count of happy events exceed pre-determined number in a pre-determined time period. An assumption that a user is happy every time he or she uses “lol” is very simplistic, but it helps get the point across. In practice, gauging the users' sentiment is not that easy because it involves natural language analysis. Consider below the example that highlights the complexities of analyzing natural language.

Iphone has never been good.
Iphone has never been so good.

As you can see, addition of just one word to the sentence completely changed the meaning. Because of this reason, natural language processing is considered one of the toughest problems in computer science. You can learn “natural language processing” using free online lectures offered by Stanford University. This link takes you directly to the first lecture on natural language analysis by Christopher Manning. But, in my opnion, the pervasive use of abbreviations in social media and in modern lingo in general, is making the task a little bit easier. Abbreviations like “lol” and “AFAIK” accurately project the meaning. The use of “lol” projects “funny” and “AFAIK” may indicate the user is “unsure” of him or herself.

The code presented below uses Twitter4j api to listen to live twitter feed and Esper CEP to listen to events and alert us when a threshold is met. You can download twitter4j binaries or source from http://twitter4j.org/en/index.html and Esper from http://esper.codehaus.org/ . Before you execute the code, make sure to create a twitter account if you don’t have one and also read Twitter’s guidelines and concepts its streaming API here . The authentication through just username & password combination is currently allowed by Twitter but it is going to be phased out in favor of oAuth authentication in near future. Also, pay close attention to their ‘Access and Rate Limit’ section. The code below uses streaming api in one thread. Please do not use another thread at the same time to avoid hitting the rate limit. Hitting rate limits consistently can result into Twitter blacklisting your twitter ID. Also it is important to note that, the streaming API is not sending each and every tweet our way. Twitter typically will sample the data by sending 1 out every 10 tweets our way. This is not a problem however for us, as long as we are interested in patterns in the data and not in any specific tweet. Twitter offers a paid service for businesses that need streaming data with no rate limits. Following diagram shows the components and processing of data.

Diagram. Charts & DB not yet implemented in the code

Listing 1. Standard java bean representing a happy event.

Listing 2. Esper listener is defined.

package com.sybase.simple;

import java.io.IOException;

import com.espertech.esper.client.EPServiceProvider;
import com.espertech.esper.client.EPServiceProviderManager;
import com.espertech.esper.client.EPStatement;

public class TwitterTest {
 static EPServiceProvider epService;

public static void main(String[] args) throws TwitterException, IOException {

// Creating and registering the CEP listener

com.espertech.esper.client.Configuration config1 = new com.espertech.esper.client.Configuration();
  config1.addEventType("HappyMessage", HappyMessage.class.getName());
  epService = EPServiceProviderManager.getDefaultProvider(config1);
  String expression = "select user, sum(ctr) from com.sybase.simple.HappyMessage.win:time(10 seconds) having sum(ctr) > 2";

EPStatement statement = epService.getEPAdministrator().createEPL(
    expression);
  HappyEventListener happyListener = new HappyEventListener();
  statement.addListener(happyListener);

ConfigurationBuilder cb = new ConfigurationBuilder();
  cb.setDebugEnabled(true);
  //simple http form based authentication, you can use oAuth if you have one, check Twitter4j documentation
  cb.setUser("your Twitter user name here");
  cb.setPassword("Your Twitter password here");

// creating the twitter listener

Configuration cfg = cb.build();
  TwitterStream twitterStream = new TwitterStreamFactory(cfg)
    .getInstance();
  StatusListener listener = new StatusListener() {
   public void onStatus(Status status) {

if (status.getText().indexOf("lol") > 0) {
     System.out.println("********* lol found *************");
     raiseEvent(epService, status.getUser().getScreenName(),
       status);
    }
   }

public void onDeletionNotice(
     StatusDeletionNotice statusDeletionNotice) {
    System.out.println("Got a status deletion notice id:"
      + statusDeletionNotice.getStatusId());
   }

public void onTrackLimitationNotice(int numberOfLimitedStatuses) {
    System.out.println("Got track limitation notice:"
      + numberOfLimitedStatuses);
   }

public void onScrubGeo(long userId, long upToStatusId) {
    System.out.println("Got scrub_geo event userId:" + userId
      + " upToStatusId:" + upToStatusId);
   }

public void onException(Exception ex) {
    ex.printStackTrace();
   }
  };
  twitterStream.addListener(listener);

//
  twitterStream.sample();

}

private static void raiseEvent(EPServiceProvider epService, String name,
   Status status) {
  HappyMessage msg = new HappyMessage();
  msg.setUser(status.getUser().getScreenName());
  epService.getEPRuntime().sendEvent(msg);
 }

}
}

Listing 3.

Twitter4j listener is created. This listener and CEP listener start listening. Every twitter post is parsed for ‘lol’. Every time ‘lol’ is found, an happy event is generated. CEP listener raises an alert every time the total count of ‘lol’ exceeds 2 in last 10 seconds.
The code establishes a long running thread to get twitter feeds. You will see the output on the console every time threshold is met. Please remember to terminate the program, it doesn't terminate on its own.

Now that you have this basic functionality working, you can extend this prototype in number of ways. You can handle additional data feeds (from source other than Twitter) and use Esper to corelate data from the two data feeds. For visually appealing output, you can feed the output to some charting library. For example, every time Esper identifies an event, the data point is used to render a point on a line graph. If you track the ‘happy event’ this way, then the graph will essentially show the ever changing level of happiness of Twitter users over a period of time.

Please use comment section for your feedback, +1 to share and let me know if you would like to see more postings on this subject.

End of ERP as we know it?

2012-02-24T19:37:00.001-05:00

A friend of mine on Facebook drew my attention to this blog post, 'End of ERP' by Tien Tzuo on Forbes.com. With the professional lives of millions tied to ERP in some way, I can imagine the buzz this post must be creating. SAP, being the The biggest ERP software maker in the world and the parent company of my employer, I read this with interest. So as to not be influenced by others' arguments, I haven't read any responses to this post yet.

If you haven't already, you can read the original post by Tien Tzuo here.
To get your opinion on this matter, I have created a short survey of only 5 questions that you can access by clicking here. I will publish the results of the survey soon. A link to the survey also appears at the bottom of this post for your convenience. In my opinion, this notable (reputation derived from the fact that it appeared on Forbes) post is way biased, as many posts often are. Could Tien's earlier job at SalesForce.com as a marketing officer be the reason? Predicting the end of something epic or a most trusted technology, is sure to generate a lot of buzz, which is what bloggers often set out to do. The post would have been a lot better and valuable had he compared ERP's strengths and weaknesses and explained why the weaknesses are so glaring that ERP customers would be willing to walk away from ERP, something so crucial to their existence. There is no success for a case that lacks even a semblance of honest acknowledgment of the other side of the argument.

In support of his argument, Tien mentions some key changes in consumer behavior and consumption patterns. The change in the ways customers engage with a company is driving ERP to its inevitable death. This is the main theme in 'End of ERP'. The services based consumption is rapidly increasing, but it can be applied only to so many things. By focusing on this alone, isn’t Tien forgetting the business processes around other product segments? Like food, energy, health and vehicles, there are simply too many things we cannot subscribe to and consume remotely. All standard functions of an ERP are still required for those sectors, aren't they? A customer may stop buying cars and instead rent from Zipcar, but cars will still have to be made, sold and bought. How would companies manage their businesses and have consolidated views of them without ERP?

ERP modules - Credit (http://www.abouterp.com/)

Tien also mentions companies like SalesForce.com and touts their successes as the proof that companies are moving away from ERP. SalesForce doesn’t offer anything other than CRM, does it? Does it provide finance, HR or materials management modules of ERP? I guess not. You can’t just run a big company effectively by mish- mashing different services from ten different vendors. That's why ERP exists and will keep it's market share in the enterprise segment. I do agree, however, that cloudification (I know, I know it's not a word in the English dictionary) of business functions is an irreversible trend. Oracle and SAP’s acquisitions of Taleo and SuccessFactors, respectively, are an indication of their grudging acceptance of this fact. The key to their success is not the demand for ERP in the cloud, which is ever present, but their ability to integrate acquired companies and their products to provide the same kind of comprehensive tool set as ERP.

“End of ERP” concludes by highlighting some key business requirements that according to Tien, are not met by ERP today. Without going in to details, it suffices to say that ERP is not meant to be a silver bullet for all business problems. It does what it does while ERP providers and its ecosystem try to find solutions to the unresolved business problems. Doesn’t business intelligence (BI) software aim to solve the kind of issues he mentions? The case in point is, there are a number of ways to mine information that you need. The importance of BI is undeniable and that's what vendors are investing millions in. The enormous response to SAP's in-memory analytics appliance HANA is just an example of how innovative products will meet the business requirements of today. While the business problems mentioned in the post may be genuine, they simply highlight opportunities for ERP’s improvement and do not in any way spell doom for it.

Make your voice heard? Take the Survey

Complex Event Processing - a beginner's view

2012-02-11T20:21:00.000-05:00

Using a Complex Event Processing is not so complex. Well, initially at least.

A substantial amount of information is available on the web on CEP products and functionality. But,if you are like me, you want to test run a product/application with little patience for reading detailed documentation. So when I was evaluating CEP as an engine for one of our future products, I decided to just try it out using a business scenario I knew from my past experience working with a financial company. For the impatient developers like me, what could be better than using a free and open source product. So, I decided to use 'Esper', an open source product based on Java and was able to write the code (merely 3 java classes) to address business case below.

But first a little about CEP and a shameless plug of our product. My apologies. :-)

Complex Event Processing has been gaining significant ground recently. The benefits of CEP are widely understood in some verticals such as financial and insurance industries, where it is actively deployed to perform various business critical tasks. Monitoring, Fraud detection and algorithmic trading are some of those critical tasks that depend on CEP to integrate multiple streams of real-time data, identify patterns and generate actionable events for an organization.

My current employer, Sybase Inc is one of the leading suppliers of CEP. Aleri, the Sybase CEP product, is widely used in financial services industry and it is the main component of Sybase's leading solution,'RAP - The Trading Edition'. Aleri is also sold as a separate product. Detailed information about the product is available here. http://www.sybase.com/products/financialservicessolutions/complex-event-processing.

The high level architecture of a CEP application is shown in the diagram below.

Figure 1.

Now on to the best part. The business requirement -

The important aspect of CEP that fascinates me is its ability to co-relate events or data points from different

streams or from within the same data stream. To elaborate, take an example of a retail bank that has a fraud

monitoring system in place. The system flags every cash transaction over $10,000 for a manual review. What this means is a large cash transaction (a deposit or withdrawal) in an account raises the anti-money laundering event from the monitoring system. Such traditional monitoring systems can easily be circumvented /exploited by simple tricks such as depositing more than one check with smaller amounts. What happens if an account holder deposits 2 checks of $6000 in a day or 5 checks of $2500 in a day? Nothing. The system can't catch it. The CEP provides a way to define rules with a time frame criterion. For example, you could specify a rule to raise a flag when some one deposits more than $10000 in cash in a 12 hour window. Get it?

Follow the steps below to see how easy it is to implement CEP to meet this business requirement.

Download latest Esper version (4.5.0 at the time of this writing) from here. http://espertech.com/download/

Unzip the package in a separate folder.
Create a Java project and reference the Esper jar files from this folder.

Create a standard java bean for an event - which here is an Deposit account with a name and amount attributes.

Listing 1.

The next listing is for creating an event type, the sql like query to create an event and to register a listener on

that query. The code generates an event any time one of the two deposit accounts AccountA and AccountB is deposited with more than 100000 in a time frame of 10 seconds (this is where you specify the time window). Because this is just a test, I have put the event generation functionality together with other code, but in real life the deposit amounts would be fed from deposit transaction processing system based on some messaging framework. The code is easy enough to follow. First we create the initial configuration. Then we add a type of event we want. A query with criterion for selecting the event is created next. As you can see the amount is summed up over sliding windows of 10 seconds and it creates an event when total of the amount in that time frame for a particular account exceeds 100000. A listener is created next and it is registered on the query.

package com.sybase.testTools;

import org.apache.log4j.BasicConfigurator;

import com.espertech.esper.client.Configuration;
import com.espertech.esper.client.EPServiceProvider;
import com.espertech.esper.client.EPServiceProviderManager;
import com.espertech.esper.client.EPStatement;
import com.sybase.testTools.util.MyListener;
import com.sybase.testTools.util.DepositEvent;

public class EsperTest {
 public static void main(String args[]) {
  try {
   Configuration config = new Configuration();

config.addEventType("DepositEvent",
     com.sybase.testTools.util.DepositEvent.class
       .getName());
   EPServiceProvider epService = EPServiceProviderManager
     .getDefaultProvider(config);
   String expression = "select accountName, sum(amount) from com.sybase.testTools.util.DepositEvent.win:time(10 seconds)"
     + " group by accountName having sum(amount) > 100000";

EPStatement statement = epService.getEPAdministrator().createEPL(
     expression);
   MyListener listener = new MyListener();
   statement.addListener(listener);
   int amount = 0;
   for (int i = 0; i < 1000; i++) {
    amount = i;
    DepositEvent event;
    if (i % 2 == 0) {
     event = new DepositEvent("AccountA", amount);
    } else {
     event = new DepositEvent("AccountB", amount);
    }
    epService.getEPRuntime().sendEvent(event);
   }

} catch (Exception e) {
   e.printStackTrace();
  }
 }
}

Listing 2

The next listing is the listener. Every time an event is generated in the time window specified in the query, it gets added to the newEvents collection.

Listing 3

Easy enough, right? The expression language itself is fairly easy to understand because of its similarities to standard SQL syntax. Although the real life implementation could become complex based on the type and number of feeds and events you want to monitor, the product in itself is simple enough to understand. Many of the commercial CEP products offer excellent user interface to create the type of events, queries and reports.

Complex event processing is still a growing field and the pace of its adoption will only increase as companies try to make sense of all the streams of data flowing in. The amount of semi-structured and other type of data (audio, video) has already surpassed the amount of traditional relational data. It's easy to gauge the impact of good CEP application at a time when stock trading companies are already gleaning clues from twit feeds from twitter.

Hope this helps the curious. Don't forget to click +1 or Like, if you like it.

Hibernate or JDBC - play smart

2012-01-14T11:01:00.000-05:00

We know how ORM frameworks like Hibernate, JPA make a developer's life easier. Given a choice, we would always code in objects for manipulating data. But, once in a while you come across a use case where JDBC code clearly trumps the Hibernate in terms of performance. I came across such a use case, where I was required to parse thousands of sql statements and input the data into relational tables.What I found was even after I followed the Hibernate's best practices, Hibernate code was still twice slower than JDBC. This post is to generate conversation and to invite expert opinion on how to determine when to not use Hibernate. Does my use case a perfect example of this? Do you know of any other use cases?

Typically Hibernate's best practices on improving performance include following techniques related to my use case.

Use batching while performing batch inserts/updates
Use second level cache
Specify the objects for caching
Clearing session to avoid object caching at first level

And so forth .......

I found that the last option, flushing the session often was most helpful for my project. But I wanted to post this information to invite help, guidance, advice from Hibernate experts on performance tuning in general, and to verify if what I did (use JDBC) was the best option for the fastest performance.

To more curious folks, here are the top 5 urls in google search on 'hibernate performance tuning', but none of those discuss the use case I have here.

http://www.javadev.org/files/Hibernate%20Performance%20Tuning.pdf
http://java.dzone.com/articles/hibernate-performance-tuning
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html
http://www.javaperformancetuning.com/news/interview041.shtml
http://arnosoftwaredev.blogspot.com/2011/01/hibernate-performance-tips.html

The application where I had to use JDBC in favor of Hibernate due to performance reasons was really very simple. Continuously streamed SQL data was parsed and went into three relational tables - parent, child and a child of child. Let's call them tables A, B and C respectively for simplicity. There was 1-1 relation between A and B while C had variable number of rows for each row of table B.

Table relationship

The code I wrote with Hibernate added a sample of 100K rows to A, 100K rows to B and around 200K rows to table C. There was no need to read the data while it was being written. The 'for loop' I wrote repeated 100K times with each instance of the loop adding row to table A first, followed by table B and table C next. Before debugging my code (Hibernate) due to very very sluggish performance, the speed of the program progressively decreased and almost started crawling before it threw 'out of memory' error. I had to add session.flush() at the end of each cycle of the loop. Adding session.flush() statement greatly improved performance of writes and because there was no need to read the data at the time of writing session flushing did not cost performance penalty.

Typically in any real life OLTP applications some user sessions will always be reading the data while other user sessions are actively writing it. Hibernate's best practices suggest the use of second level cache for better performance and in most cases it works well when multiple sessions are adding data. Remember, Hibernate caches recently added data automatically. Second level cache is involved when multiple sessions are writing and reading data but in the scenario I discussed, I had only one session that was writing and no session was reading the data. Hence turning off or on the second level cache had no impact on performance at all.

But a slightly modified scenario could potentially be a real life use case. What are the Hibernate best practices to achieve optimum performance in such a use case. The modified scenario is to have one session writing the data and multiple sessions reading the data. For example, results and statistics of Summer Olympics being fed to a data store by some feed engine, while participants and viewers are viewing the data in real time. What steps could we take to get the best read/write performance using Hibernate. Please consider for discussion sake that data from feed engine is not a bulk upload but more of a live stream data accepted using Hibernate code (I guess it's not really a best practice). Do we turn second level cache off, do we flush session on each write cycle, batch the write operations or?

Of course, the performance tests using simulated work load provides the best answer, but as an architect how do we propose a solution that ultimately is validated by performance tests. Following table shows the performance numbers I got in my test runs with different configurations. The code was really simple and it performed operations I mentioned earlier.

Hibernate Vs JDBC

If you are interested please add a comment and I will send out the code to you. Please note the code was run from within Eclipse on Windows XP with 512 MB max memory settings in Eclipse. I used Sybase 15.5 ASE server to host the tables. The relationship between the tables is really simple each having a primary key and a foreign key for one to many relation between table B and C and one to one relation between A and B. It's clear that flushing the session after each write cycle (after adding 1 record in table A, 1 in table B and multiple records in table C) was most important factor in getting the faster execution time. But the JDBC code still out performed Hibernate code. It was twice as fast Hibernate. See the last row in the table.

Another interesting finding was that batching (20 as suggested by documentation) actually performed worse than non-batching. This application from complexity perspective was really simple - not many tables, simple object structure, simple relationship, no complex SQLs. So it wasn't too difficult or cumbersome to use JDBC for the best performance.
There is abundance of technical literature available on internet on every technical subject. We read and follow guidelines from well regarded sources without thinking twice. But in some occassions it's useful to try things out and validate for yourself. That's what I was forced to do and I was glad I did it. Because of it I got some interesting results to share with the community. All hibernate enthusiasts, feel free to comment and advise.

Playing with Fire for a month.

2011-12-18T15:36:00.002-05:00

Now that I have owned Kindle Fire for a month, it's time to look back and document my own and only my own opinion about it. This disclaimer is necessary to not upset ipad fan club. This is not meant to compare Fire with any other mobile device in the market. To surmise my findings in one sentence - it's not very often that you get more than what you paid for. That's Kindle Fire for you. How much more? It's up to every one to decide after reading this post. In my opinion Fire hits that sweet spot of providing you with adequate features with superior performance at a very very affordable price. It's no wonder Fire is flying off the shelf since its launch.
For those with little patience for reading, the summary of my evaluation is presented here in a nice tabular format.

I received the Fire in my mail after a month long wait time. I had placed an order on Amazon, immediately after the product announcement basically trusting the good reviews from analysts. Right from the time I opened the box I was hooked. Fire comes in a cardboard box with perforated edge that tears without help from scissors, there are no sticky tapes any where outside or inside the box. I was off and running in less than 5 minutes. The interface is intuitive and easy. It has its own quirks, but they don't rob you of your quality time.

First off, let's get out of our way some well documented and commented on 'lack of features of Fire'. They are, absence of any buttons other than power button. Not having a button even for controlling volume is kind of annoying. The power button is at the bottom together with micro USB port and a jack for headphones. Because of the unusual placement of power button, it's possible that Fire goes to sleep by holding it in your lap. But bringing it back to life is easy enough - sliding the switch on the screen like an apple device. The Fire is not exactly light and because of its very thin edges it's hard to hold it in a hand for a long period of time without touching the screen and without slipping through your hands. I would recommend Amazon to provide grooves or impressions/dents on the edges to make it easier to hold.
Fire does not have a microphone, neither does it have a camera. So for all those skype fans this could be a serious issue.

Now on to the best part.
I have used the Fire so far to do almost all my online activities like reading my emails from various personal accounts, using social networks like facebook, twitter, linkedin, watching videos on youtube, carrying out banking, investing functions, listening to local radio channels and Pandora, downloading books and apps from Amazon, watching HD quality movies from Amazon, sending documents to kindle over an email. Without a doubt, I am extremely happy and pleasantly surprised at its ease of use. The Fire boots remarkably fast, downloads web content at great speed, shows crystal clear images and videos. The screen rolls at a very fast pace showing the power of dual core processor, virtual keyboard is sensitive and fun to operate.The sound quality is acceptable and volume is loud enough for a noisy room. Streaming videos of movies and TV episodes works super fast and quality of the display is awesome.

Now for the things I wish the Fire had -. I watch educational videos on you tube all the time. I really wanted to download them to Fire, so I could watch them when I want and where I want, which according to my knowledge is not possible today. The Silk browser does not have any plug-ins that could perform this task. So you are back to downloading videos to your pc and then uploading them to Fire.
Another missing feature is the absence of folder structure. The content is automatically classified and stored in different sections like video, audio, books, magazines etc, but you can't access those from any other section that the menu at the top. The folder structure like windows explorer doesn't exist in Fire today. This prevents you from organizing your content in your own way.

Now for the quirks I found a little bit annoying. When you are watching youtube video in full screen mode and want to adjust volume, the only way you could make the volume adjustment bar to appear is to tilt Fire at an angle (sometimes but not always clicking the small arrow at the bottom displays settings. Why only sometimes is something I need to figure out). When Kindle senses that it needs to go to a portrait mode from a landscape mode or vice versa, it displays the settings at that time. Once you adjust the volume, the video restarts from the beginning and not from where you left it before adjusting the volume. I am sure, there must be some work around, but I haven't found that yet.
Another important feature I think is important to mention is uploading of documents to Kindle. It could be done in two ways. One by connecting your other device like PC that hosts the document or by sending the document as an attachment to an email to your kindle email address. This email address is automatically assigned to you when you register the Fire. Getting the doc uploaded through this email address is something you need to get used it.

As for the kindle applications. Although Kindle's ecosystem is still limited, it is expected to grow very fast. So far I could all applications that I wanted like Facebook, Pandora, Radio streaming, Weather and most importantly Angry Birds - yeh!

So friends, the Fire is an excellent and exciting product that competes with the best product in the world providing the best value in its class of products. Truly an excellent holiday gift, the Fire will bring you blessings from the recipient.

Mobile manifesto - fundamental right to own a mobile device

2011-12-12T22:28:00.000-05:00

The person who never handled a tablet before, the person who didn't own a smart phone, the novice and the illiterate in the mobile world. Some one who didn't use skype with iphone, who didn't know how to take photos on a phone and upload them to facebook. The person who didn't know what made one smart phone better than other or didn't have a clue as to why every one thought it was important to interact with their phones but not to the person sitting next to them in a meeting or on a train or in a conference. I was that person until a few days back. None of those things made me totally irrelevant or useless to my kids or to my friends as yet, but it did deprive me of contributing to their supposedly valuable conversations like which smart phone is the best in the market or why it's important for everyone to buy it as if their life depended on it. Kidding aside, barring a few disagreeing heads, mobile technology has become an enabling technology for all of us. While we take such features as finding directions, listening to music, sharing your pictures as granted for smart phones, we are not far removed from the day when all appliances in and outside of your home will be controlled from your phone. In fact, remote 'car starter' application is already available on iphone today. So guys, albeit relunctantly I convinced myself to buy Amazon Fire and a HTC smart phone. What I think about those toys, will be posted soon but the reason for today's post is -

While we acknowledge the ubiquitous presence of mobile devices in our lives, how do corporations for which most of us work are reacting to this revolution? Just a decade ago buying everything online seemed liked an idea in distant future, but look at what Amazon and other companies have made possible in this short period of time. Do corporations take mobile computing as seriously as the web revolution? Do they think that making their business mobile friendly is a core requirement to keep them relevant in the market? Why would a company enable its enterprise applications for mobile computing? Are the legitimate security concerns keeping them from going full steam ahead with mobility?

These questions are addressed and basically a case for linking mobile computing to the survival of a company is made in a wonderful document which could be downloaded here. It's a great read for every one remotely interested in mobile computing. It's called mobile manifesto. Eric Lai, who regularly blogs for Sybase is the editor of this wonderful document and it's kind of funny how he puts forward the mobile computing as the fundamental right of a person/corporation.

The document helps an organization in assessing its mobile friedliness, the strategies for mobility adoption and tactics for transforming the way the business is conducted. It includes a score card that computes an organization's mobility status based on how you scored on the questions, the mile stones of our journey to mobile platform from analog devices. The document helps you make some tangible decisions as whether to buy mobile apps or to build them . The key take aways for an enterprise are

Open the doors to mobile computing to keep your enterprise relevant to the market
Adoption of mobile devices by masses is an irreversible trend. There are more mobile phones than tooth brushes in the world today.
Mobility projects are typically shorter than other technology projects allowing you to go full steam ahead. There are low hanging fruits that can provide ROI immediately.
Many companies who absolutely opposed to allowing emplyoee's personal device to connect to corporate network previously, are now rolling out plans of adoption in phases.
Security of such devices and hence the data protection and application maintenance on them is vitally important and companies like Sybase offer these services at very competitive rates.
Last but not the least, and easier said than done, is to eliminate the fear of failure.

For the companies that do not want to host and support their own infrastructure, the cloud option is most attractive one. In fact, number one reason companies go to cloud is the need to open information access to multiple computing devices. Check this article on mashable.com by Matt Silverman for some interesting statistics on cloud adoption. Most intriguing for me is last in the chart about US Gov. employees. It wasnt' the number(48% of US goverment agencies moved to cloud) but the mention of 'cloud-first' policy of goverment agencies that caught my attention. Assuming it's true, adopting a cloud strategy is a no brainer. Because goverment agencies typically are the last to adopt new paradigm in technology, their adoption of 'cloud-first' strategy is proof enough that cloud is gone main-stream. In many companies it's not question of it but when to go to cloud or what applications to migrate to cloud.

R programming - It's super!

2011-11-22T21:51:00.000-05:00

Last week Sybase announced that its market data analytics platform Sybase RAP - The Trading Edition now supports R. R programming language enables faster algorithm development and handling of huge amount of data to provide its analysis to traders, risk managers and quantitative analysts. For more information on the offering please click here. Sybase also sponsored a webinar on this subject with reasearchers from Yale University. To listen to the webcast click here.

R's official website (http://www.r-project.org/) provides much of the required information if you want to try, but to save you some time, I would like to summerize what I learnt so far from my quick review of R.

Handling and analyzing huge data sets, performing statistical calculations and rendering the results in very professional charts are its strengths. It's performance absolutely blew me away. I tried some of its features on a csv file with 100,000 records with multiple fields and the results were generated instantaneously. Remarkable! According to one of my friends R is used heavily not only in financial companies but also in pharmaceutical or research companies where gene sequencing/analyzing takes place. Bank of America uses R in quantitative analysis. Having spent 10 years with the bank previously, I was surprised I didn't know about it then.The point is, you may not have heard about R because it is not a mainstream programming language, but it is widely used in some industries. Just to show how powerful this language is,
these 2 lines of code is enough to read a csv file with 100000 lines with "|" as separation character and render a scatter plot with values from 2 columns (Durtn & Curs) of this file.

myFile <- read.csv ("Your FileName", head=TRUE, sep="|")
plot(myFile$Durtn, myFile$Curs, main="Duration vs records", xlab="Duration", ylab="Records retrieved")

Google R for more information, but here is the gist.
Some of the features worth mentioning are -

R is a language and a development environment built for statistical computing and graphics.
It's an open source project.
It was built to address some of the short comings of its predecessor called "S".
Classes, objects, methods - concepts similar to any object oriented language.
The extent of its functionality does not come close to Java or C++ but it addresses a niche and provides functionality that's way easier to use than Java or c++.
Lists are ordered collection of objects with no need for the objects to be of the same type.
Data frames is a fantastic concept. R can handle multi-dimensional data sets through the use of data frames.
Reading a file is as easy as assinging a value to a variable.
Accessing a column in a file and running statistical functions on it is accomplished in just one step
Charting, graphing is done in one step.
Multivariate analysis, a set of techniques dedicated to the analysis of data sets with more than one variable, could be done through R effectively. A use case of multivariate testing is projecting the most effective user behavior (one that yields on multiple clicks) on a website, by moving the assets around on a web page.
The output could be sent to the console or a file system resource.
Packages are available to plug R into some popular IDEs.

In case you are wondering what other software/languages are used for statistical analysis, please read this thread.
http://www.reddit.com/r/programming/comments/7fg6i/why_are_sasstata_the_default_statistical_tools/

Downloading and trying R is really easy. Just give it a try and let me know what you think.

Why clique problem is important? Think social networks like Facebook.

2011-11-12T09:42:00.000-05:00

A question was asked in last Sunday's Boston Globe, in its 'Etiquette at work' section, on how to avoid appearance of cliques. The person who asked the question has formed a group (clique) at her office that eats lunch together on a day every week. Her manager joins the group once in a while without invitation and the questioner was asking how to resolve this situation. Clearly, cliques are important part of our social life but what does that have to do with computer science?

There was a very interesting article in the MSDN's October issue by Dr. McCafferey. It's the first of a series of articles explaining what a clique is and the importance of finding the maximum clique in a graph and the algorithm to solve the problem. So what is a clique in a graph? Clique is a subset of a graph where every node is connected with every other node. Following paragraph is directly quoted from the
MSDN issue -

Maximum clique problem is encountered in wide range of applications, including social network communication analysis, computer network analysis, computer vision and many others. For graphs of even moderate size, it turns out that the maximum clique problem is one of the most challenging and intersting problems in computer science. The techniques used to solve the maximum clique problem - which include tabu search, greedy search, plateau search, real-time parameter adaptation and dynamic solution history - can be used in many other problem scenarios. In short, code that solves the maximum clique problem can be directly userful to you, and the advanced techniques employed in the alogrithm can be helpful for solving other difficult programming problems. In the following figure the set of nodes 0, 1, 3, 4 form the maximum clique.

Figure 1.

You can find more information on clique problem on Wikipaedia and many other web sites.
While working towards solving this problem, I found myself reflecting on my school years.
When I grew up in India, I disliked mathematics like every one else. Most students were true haters but others loved it only because it provided an opportunity to score perfectly in a test. Achieving the perfect score wasn't that harder in spite of our dislike, because every teacher and parent instilled this notion that good score in Mathematics is the yardstick of your intelligence. No matter you liked math or not, or scored perfectly or not, I seriously doubt if any student was interested in knowing why mathematics was needed at all. Any one who dared to raise the question, would typically have to face the silent treatment from the teacher.

Through out the school years, from the day I learnt to count numbers to the days of triple integration in engineering school, I must have asked this question to myself countless number of times. However, by the time I graduated, I was grudgingly admitting to myself that mathematics used to be and would forever be the foundation stone of so many fields of science. Despite my dislike, I thought I won't escape mathematics due to my engineering profession, but then, personal computers took over the world, India's software industry boomed and all every student could think of, was becoming a software professional. It provided a level playing field for all, engineers and non-engineers alike. Any one with decent analytical skills could become a programmer. Engineers too became software programmers en mass and marvelled at the thought of not having to do anything with mathematics any more.

Really? Would software profession be that exciting or sexy if you take away search alogrithms by Google, content distribution algorithms by Akamai or algorithms developed for ultra fast equity trading by nerds on wall street? Take away this exciting stuff and what remains is the mundane 'go to', 'if then' statements or uber present 'do while' or 'for when' statements. So, whether we like it or not, mathematics is the foundation required to solve challenging problems .

The reason for the post though, is to post Java code I wrote to solve maximum clique problem for the following graph. Readers are encouraged to see if the code works on other graphs. I didn't want to look at the solution presented by Dr. McCafferey (which by the way is going to be C# code) before I tried it myself. Creating the data set (storing graph information like nodes and segments) is a challenge in itself. I have taken a short cut by creating a static array of strings to represent segments of the graph. My co-worker Tyler Perry worked on a c++ solution to this problem while I worked on the Java solution. If you like to see more of such posts click on +1 button at the bottom or use share widget on the right to share it with others. My code was tested on the following graph.

Let's do it a little differently this time. Instead of posting the code here, I would like to send it to you directly if requested. Please contact/comment to receive the code.

More on the Constrained Tree Schema

2011-11-05T22:07:00.000-04:00

Last week I posted the Java code that identifies TPCC schema as a constrained tree schema. I also explained why the constrained tree schema is important for an OLTP application. However a little more on the subject is in order, so as to make it easier to understand. First, for those who may not have looked at TPCC schema before, following is an ER diagram showing table relationships.

Figure 1. TPCC schema

Warehouse table has w_id as a primary key while District table has a composite primary key that includes d_id and d_w_id. The d_w_id column is a foreign key pointing to the warehouse table. This pattern continues on to other tables like Customer where its composite primary key consists of c_id, c_d_id and c_w_id where c_d_id and c_w_id are foreign keys pointing to District and Warehouse table respectively.

For those who have used any kind of object relational mapping architecture, this type of relationship may look totally out of ordinary. Typically, ORM tools including Hibernate are proponents of using natural key (like a sequence number) as a primary key of a table. Having a composite primary key on a table, makes it difficult to code Hibernate and java, as you have to create a separate java class for the primary key. But for the extreme OLTP applications, the use of composite primary keys make perfect sense.

In applications that process millions of transactions per minute, it's quite likely that its performance, throughput and latency will diminish as tables in which transaction data is kept start filling up rapidly. Throwing additional hardware at this problem works to an extent. Application redesign through various mechanisms including horizontal and vertical partitioning of the relational tables is used when adding hardware doesn't provide proportionate performance boost.

Many applications use range paritioning to improve performance. It works somewhat like this. Let's say a company, ABC Inc, sells customer durable goods in Unites States and has 4 operational units, one each for its 4 regions - east,west, north and south.

It's customer order entry application generates order number based on the region the order originated from. ABC Inc has built an application logic where an order generated from east region will get a number between 1 and 1 million, the one in west gets a number between 1 and 2 million. To provide better performance, ABC's IT department has partitioned the order table into 4 partitions and has hosted each partition on different servers. For many companies this works without much of a problem. However this scheme has some inherent weakness in dealing with situations such as equal load distribution. If, in case of ABC Inc, one of its region is doing twice as good as others (in terms of of its sales numbers), the server meant to capture the order of that region will be twice as active as others providing inconsistent user response time from different regions. Another factor to consider while creating table paritioning is co-location of related data. If orders table is paritioned, what do we do about other tables? Do your queries often join partioned & non-partitioned tables? Why is this important?

Continuing with ABC Inc. example - the IT department decides to partition payments table too. While it's very likely that a sales person from ABC, would like to see orders from his customers, he or she would also like to see if his or her's customers paid for the items sold to them or not. Just so the application continues to provide best scalability and latency, it is imperative that the partitions of payment table are co-located with partitions of order tables on the same server. The co-location of a customer's order and payment information on the same server is the key, but that adds to the complexity of your application. Add another related table and you could imagine how complex the design is going to be.

Hash partioning solves one of the problems, that of uneven load on different servers. What server particular data resides is decided by some hash algorithm which guarantees more even load on the server. The other problem of co-location of data is solved by the constrained tree schema architecture and hash algorithm together. Going back to our TPCC schema, the tables such as district, customer and stock all have warehouse id in their primary key. So, including warehouse id in the hash algorithm of partioning of all the tables automatically guarantees co-location of data. Which means, data related to one warehouse and its related districts, customers and stock are located on the same server. You may really want to read the "The End of Architecture Era" document for further details. It's a great read for any one remotely interested in database design and improving performance. The low-latency.com is another great resource for those interested in extreme performing financial applications.

I hope this helps clear some of the concepts behind extreme scaling and performing application/database design. Don't forget to make comments if you have any.

Detect constrained tree schema programmatically.

2011-10-30T15:33:00.000-04:00

Tree schema in a relational database is built on hierarchical relationships between relational tables. It is one of the most widely used type of schemas in OLTP applications today.

I present here the java code to identify if some or all the tables in a schema belong to a tree structure and what table sits at the top of this tree (or root depending on how you view it). The code also identifies the primary keys of the tables in tree hierarchy. This effectively identifies TPCC schema (used to measure TPCC benchmark by hardware and software vendors) as a constrained tree schema.

A constrained tree schema, is used in OLTP systems frequently to achieve very low latency and high through put. If you view TPCC schema carefully, then you would realize that "w_id" column in warehouse table is the root key of this tree hierarchy which also means that all the tables that belong to this tree, have composite primary keys that include "w_id", although w_id column is renamed as "d_w_id" in district table and "c_W_id" in customer table and so on.

Table partitioning (breaking data in a relational table into multiple paritions) is one of the ways an extreme transaction processing applications could achieve the desired scalability. The purpose of this post is to present the java code, if you want more information on constrained tree schema and applications on top of it, this article by Billy Newport's article is a great start.

Identification of a constrained tree schema and the participating tables is important in evaluating if you could achieve single sited transactions that boost performance of an application dramatically. "The end of an architecture era" is a wonderful document written by Michael Stonebreaker at MIT and others, that presents a change in application paradigm to achieve high scalability.

The Java code presented below should be used as a guidance only. Adding robust exception handling and follwoing best coding practices would help make this code production ready. The code assumes that you are using Sybase ASE database as a datasource that has "sp_fkeys" stored procedure which lists table dependencies through pk-fk constraints. You may have to rewrite the portion of the code and the database connection information section, for the database you are using. Identifying TPCC schema as a Constrained Tree Schema is easy enough to do manually because it has only 8 tables, but compare it to an application like SAP R3 which has more than 75,000 tables in its schema and you will quickly appreciate the necessity of automation in this regard.

The code uses the depth first strategy rather than breadth first strategy, not for any specific reason. The java code treats relationships between the tables as a graph. It considers the tables as nodes and the relationship between them asedges of the graph. For more on this subject you may want to read this very interesting chapter from a book, the artful software.

Running this code displays the name of the root table and name of the root key and the depth of the tree. It also displays the list of tables that belong to the tree and their primary keys. There are other smaller trees in TPCC schema, but the one with deepest hierarchy is what we are interested in and the table at the top of the hierarchy is our root table. This information in addition to some other aspects such as tables sizes and volume of different transactions would help an application owner decide on the table partioning strategy.

//The java code

package YOUR PACKAGE NAME

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;

import com.sybase.jdbc4.jdbc.SybCallableStatement;
import com.sybase.jdbcx.SybDriver;

public class LatestTpcc {

/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
  // TODO Auto-generated method stub
  Connection con = null;
  try {
   SybDriver sybDriver = (SybDriver) Class.forName(
     "com.sybase.jdbc4.jdbc.SybDriver").newInstance();

sybDriver.setVersion(com.sybase.jdbcx.SybDriver.VERSION_7);
DriverManager.registerDriver(sybDriver);

   con = DriverManager.getConnection(
     YOUR CONNECTION INFORMATION  } catch (Exception e) {
   e.printStackTrace();
  }

  ArrayList<String> allTables = new LatestTpcc().getAllTables(con);
  // loop through all tables bring their pk, fks
  Iterator tableIterator = allTables.iterator();
  String tableName = null;
  ResultSet rs2 = null;
  HashSet<String> allKeysSet = new HashSet<String>();
  ArrayList<String> pair;
  HashSet<ArrayList> allPairsSet = new HashSet<ArrayList>();
  while (tableIterator.hasNext()) {
   tableName = (String) tableIterator.next();
   rs2 = new LatestTpcc().getFKeys(con, tableName);
   // loop through the result set
   while (rs2.next()) {
    pair = new ArrayList<String>();
    pair.add(rs2.getString("pkcolumn_name") + ":"
      + rs2.getString("pktable_name"));
    pair.add(rs2.getString("fkcolumn_name") + ":"
      + rs2.getString("fktable_name"));
    pair.add("notReached");
    allPairsSet.add(pair);
    // System.out.println("adding pair :"+rs2.getString("pkcolumn_name")+":"+rs2.getString("fkcolumn_name"));
    allKeysSet.add(rs2.getString("pkcolumn_name") + ":"
      + rs2.getString("pktable_name"));
    allKeysSet.add(rs2.getString("fkcolumn_name") + ":"
      + rs2.getString("fktable_name"));
   }
  }

  // now that we have all sets and everything let's light up
  Iterator allKeysSetIter = allKeysSet.iterator();
  String theCurrentKey = null;
  HashSet<String> c = new HashSet<String>();
  HashMap<String, HashSet> keysToColumns = new HashMap<String, HashSet>();
  int maxDepth = 0;
  int theDepth = 0;
  String maxDepthKey = "";
  while (allKeysSetIter.hasNext()) {
   theCurrentKey = (String) allKeysSetIter.next();
   theDepth = 0;
   // all pairs need to be set to non reached for every loop
   // so loop through and set to non reached
   Iterator allPairsSetIterTemp = allPairsSet.iterator();
   while (allPairsSetIterTemp.hasNext()) {
    ArrayList<String> keyPair = (ArrayList) allPairsSetIterTemp
      .next();
    if (keyPair.size() > 2) {
     keyPair.set(2, "notReached");
    }
   }
   // which pair has this key? I will add corresponding key in the pair
   // to this set c
   // so loop through allPairsSet
   boolean reached = false;
   do {
    Iterator allPairsSetIter = allPairsSet.iterator();
    reached = false;
    while (allPairsSetIter.hasNext()) {
     ArrayList<String> keyPair = (ArrayList) allPairsSetIter
       .next();
     boolean keyVisited = false;
     if (keyPair.size() > 2 && keyPair.get(2).equals("reached")) {
      keyVisited = true;
     }

     if ((((String) (keyPair.get(0))).equals(theCurrentKey) || new LatestTpcc()
       .gotAMatch(((String) (keyPair.get(0))), c))
       && keyVisited != true) {
      c.add(theCurrentKey);
      c.add((String) (keyPair.get(1)));

      keyPair.set(2, "reached");
      theDepth++;
      // c.add(String.valueOf(theDepth));
      // System.out.println("the depth of the tree :"+theDepth+" for: "+theCurrentKey);
      reached = true;
      // System.out.println("relationship :"+theCurrentKey+" : "+(String)(keyPair.get(1)));
     }
    }
   } while (reached == true);
   keysToColumns.put(theCurrentKey, c);
   if (theDepth > maxDepth) {
    maxDepth = theDepth;
    maxDepthKey = theCurrentKey;
   }
   c = new HashSet<String>();
  }

  new LatestTpcc().displayValues(keysToColumns, maxDepthKey);
  System.out.println("max depth key : " + maxDepthKey + " depth :"
    + maxDepth);

}

private void displayValues(HashMap map, String maxDepthKey) {
  Iterator mapIter = map.keySet().iterator();
  while (mapIter.hasNext()) {
   String key = (String) mapIter.next();
   HashSet<String> hs = (HashSet<String>) map.get(key);

   // display for only max depth key
   if (key.equals(maxDepthKey)) {
    System.out.println("Tree : " + key);
    Iterator hsIter = hs.iterator();
    while (hsIter.hasNext()) {
     String nextKey = (String) hsIter.next();
     System.out.println("element : " + nextKey);
    }
   }
  }

// System.out.println("the end");
}

private boolean gotAMatch(String fetchedKey, HashSet<String> c) {
  // loop through c and see if you find a match
  Iterator cIter = c.iterator();
  while (cIter.hasNext()) {
   String fromC = (String) cIter.next();
   if (fromC.equals(fetchedKey)) {
    return true;
   }
  }
  return false;

}
/**
*
* @param con
* @return
* Returns the list of all user tables in this database
*/
private ArrayList<String> getAllTables(Connection con) {
  ArrayList<String> al = new ArrayList<String>();
  try {
   Statement stmt = con.createStatement();
   ResultSet rs = stmt
     .executeQuery("select name from sysobjects where type = 'U'");
   while (rs.next()) {
    al.add(rs.getString("name"));
   }
  } catch (Exception e) {
   e.printStackTrace();
  }
  return al;
}
/**
*
* @param con
* @param tableName
* @return
* returns information such as primary key & table name of the table that's dependent on this table
*/
private ResultSet getFKeys(Connection con, String tableName) {
  ResultSet rs = null;
  try {
   Statement stmt = con.createStatement();
   String procName = "sp_fkeys";
   String execRPC = "{call " + procName + " (?)}";
   SybCallableStatement scs = (SybCallableStatement) con
     .prepareCall(execRPC);
   scs.setString(1, tableName);
   scs.setParameterName(1, "@pktable_name");
   rs = scs.executeQuery();
  } catch (Exception e) {
   e.printStackTrace();
  }
  return rs;
}

}