Simple & Practical

Saturday, January 25, 2014

ASE, HANA and much more

After a long time I am back to writing this blog again. Much has happened in SAP ecosystem since my last post. SAP HANA has gone to cloud, ODATA has become the standard way of communicating between different layers in SAP products and SAP business suite customers are increasingly using Adaptive Server (ASE) as the underlying database.

As I continue to work on ODATA and ASE, new things continue to surprise me. These things, I call them quirks - may not be that surprising for an ASE veteran, but they are for users of some other databases in the market. These quirks/peculiarities or special features, whatever they are, have been discussed in various forums. One particular helpful resource I found is a creation of one of my co-workers. Check it out ( http://www.sypron.nl/main.html ) to find wealth of interesting ASE facts compiled over the years.

Today, I will discuss a feature of ASE that an average user may not have come across. It's an sql command -SET CHAINED ON/OFF that controls the transaction behavior. If your code is not explicitly starting a database transaction for series of sql commands, then SET CHAINED OFF command tells ASE to start a transaction. In the jdbc code you can do connection.setAutoCommit(true) to get similar functionality. The chained mode is particularly important if you are trying to execute a stored procedure on ASE. The successful execution of ASE stored procedure is dependent on using the correct chained mode. What bugs me is that, as a user of the stored procedure you need to know which mode was used when that stored procedure was created in the database.

If you happen to use autocommit(true) while executing a stored procedure that was created with 'set chained on' setting, then you will get a nasty error that looks like this "this procedure may be run only in chained transaction mode, SET CHAINED ON will cause the current session to run in chained transaction mode" . I have seen countless posts on internet and recently in SAP ecosystem discussion forums about this error, highlighting the frustration experienced by developers new to ASE. Thankfully ASE provides a system stored procedure that can set the mode of a stored procedure to "ANY", which allows that stored procedure to be executed in both settings - autocommit true and false. However it's not used widely judging from the number of people that experience this problem. If you do not want to get bitten by this nasty problem in runtime you can choose to do either of these two.

1. If you have the access to the underlying database (you need to be the owner of the procedure or have an SA role) you can run sp_procxmode command to make the stored procedure in question run in any mode. Assuming the name of the stored procedure is sp_proc_1 the command would be
execute sp_procxmode sp_proc_1, "anymode"

2. You can do this at run-time in your JDBC code. Execute sp_sprocxmode procedure with only one argument - the name of the stored procedure. The result set of this statement will give you what mode this procedure needs. Assuming, it requires 'unchained', you can execute conn.setAutoCommit(true) and then execute the sp_proc_1 procedure.

The second option is the better one that will avoid certain headaches later on, if you ever want to execute a stored procedure in ASE. Regardless of the mode in which the stored procedure was created, I recommend you implement this strategy. There are other interesting facts about stored procedures and why stored procedures are important again is something I would to write in my next post.

Saturday, June 8, 2013

Adaptive Server IDENTITY quirks

In my previous posts I tried to provide some key information to stimulate your interest in SAP Sybase products. If you are an application developer on any SAP platform, it will pay to know more about Adaptive Server Enterprise ( the former Sybase ASE) platform as the SAP business suite is successfully ported on it now. If you are just starting to explore ASE, you will be in a similar boat as I was about 2 years back. As I work more and more on it for various projects, more and more things continually surprise me. Having come to ASE from other popular databases, some of the features about ASE totally threw me a curve ball.

One such feature is called IDENTITY. This is very similar to AUTOINCREMENT column in other databases. IDENTITY is an automatically incremented counter, the value of which can be assigned to a column of a table. As one would expect, you can assign a starting value for the column and ASE increments the value with every insert into the table without any problem. In addition, ASE allows you to bulk add records to any table with an identity column, there by accepting explicit identity column values.

It's important to understand that IDENTITY columns are routinely used as primary keys on a table because of its auto-increment capability.You might also believe that the name 'IDENTITY' itself suggests that the value of the column in each row is unique and it is, as long as you leave it to ASE but the strange thing is ASE does not create a unique constraint on the column, there by leaving it to the user to keep it unique when IDENTITY_INSERT on the table is turned ON. By using this option, the table owner or db administrator can insert any number of duplicate entries into IDENTITY column as they want, which in my mind defeats the purpose of having an identity column. By default IDENTITY_INSERT is turned off, hence identity column receiving a duplicate entry is not normal. But once you turn it ON for any reason, you have to be extra careful while entering values manually. You can avoid this problem by denoting the identity column as a primary key column. Which bring me to the next point, how to assign a primary key to the table in ASE.
There are two ways you could do it, using Create or Alter Table syntax or through sp_primarykey stored procedure. A thing you may want to keep in mind while using sp_primarykey is that the stored procedure does not create a unique constraint on the column, which again is a huge surprise to me. Here is what ASE documentation says about sp_primarykey stored procedure.

Executing sp_primarykey adds the key to the syskeys table. Only the owner of a table or view can define its primary key. sp_primarykey does not enforce referential integrity constraints; use the primary key clause of the create table or alter table command to enforce a primary key relationship.
Define keys with sp_primarykey, sp_commonkey, and sp_foreignkey to make explicit a logical relationship that is implicit in your database design. An application program can use the information.
A table or view can have only one primary key. To display a report on the keys that have been defined, execute sp_helpkey.
The installation process runs sp_primarykey on the appropriate columns of the system tables.

Notice the lack of mention of 'unique constraint' from above which will make you believe that unique constraint is implicitly implied because you are making a column a primary key. But that's not so.
So a unique combination of denoting a column as identity column and a primary key you will have unintended consequences of getting duplicate values in the identity primary columns. Here is an example.\

CREATE TABLE test (id int IDENTITY, desc varchar(10))
sp_primarykey test, id
SET IDENTITY_INSERT test OFF
INSERT INTO test (desc) VALUES('abcd')
SET IDENTITY_INSERT test ON
INSERT INTO test(id, desc) VALUES(1, 'fghk')

Now you have two rows with duplicate id value of 1. Neither, defining as IDENTITY nor PRIMARY KEY could prevent getting a duplicate value in the id column. This could have disaster effect on your business application if it relies on this column to get a unique row. So, to avoid this type of situation follow the guidelines as below.

Always define primary keys through Create or Alter table syntax.
Check the uniqueness of identity column of table after you insert its value manually OR create a unique constraint on the identity column right after table creation.
If you are turning IDENTITY_INSERT on then turn it off in the same transaction, so as to avoid any other transaction in the same session adding a duplicate value to identity. It should be somewhat reassuring to developers that the change in IDENTITY_INSERT setting is not persistent. It's persistence is lost once the session ends.

I hope this important point helps you use IDENTITY carefully and saves you some trouble shooting in case of erroneous results from your business application. Happy development.

Saturday, February 9, 2013

ODATA for Dummies

In a world full of open source and open standards what's so important about one more protocol or specification, right? Why should you know about it, right? Right, but technology professionals are required to know a lot more technologies today than they were in the past because of the rapidly shifting software landscape. Gone are the days when you could master one technology and keep your job forever. At any moment you could be asked to look into some new technology and be expected to work on it or create a proof of concept in an effort to evaluate its fit to your current project. ODATA may be one such technology you may not have worked in or heard of, but worth knowing about.

If you already know about ODATA specification and how it works, please stop reading. This is a very high level but useful information on ODATA meant for newcomers.

Promoted by Microsoft ( do I hear Microsoft haters leaving already? ) ODATA is a protocol to access information over the web. ODATA defines a standard way of exposing any information through industry standard HTTP protocol. It also follows the architecture popularly known as REST. You can get details of the protocol and much more here. If you want to expose your data to your clients using ODATA, you will need much deeper understanding of the specification. However if you just want to consume the data as a client, you could be up and running in a few minutes to a few hours depending on your level of expertise. This post will have you create a sample and test it within minutes.

Let's take an example of a company that we all know about - Netflix. It uses odata service to publish its movie catalog. Viewing the movie catalog or searching for a title could easily be done by just accessing certain urls through your browser. You just need to know how to build further urls based on the information you get from a url. ODATA provides the response to your http request in a standard XML format that follows either Atom or Json format. Try these urls yourself and see how easy it is get to the information. The odata services are consumed through urls that are intuitive and easy to understand.

http://odata.netflix.com will show what type of information you could get from this website. The URL will resolve to http://odata.netflix.com/v2/Catalog/. The results will show that you could get a list of Genres, Titles and Languages. Now, suppose you want to see all the titles Netflix has, then simply add 'Titles' to the previous url as- http://odata.netflix.com/v2/Catalog/Titles. This will generate all the titles and associated information. If you want to know all the details (metadata) about all the entities (collections) simply type in this url. http://odata.netflix.com/v2/Catalog/$metadata.This results in information on collections, the entities they are made up of and the attributes of each entity there in. This just shows you how easy it is to get to this basic information. Now you are ready to dig deeper.

The data that shows up on your screen may have been stored in a table in a relational database or in an xml database or in a a file in the file system, it does not matter. The information is available to you through the http get method (browser sends 'get' method to the end point). Irrespective of the type and location of the data storage, specific data is likely identified by a unique key. If you know the unique identifier of the entity you want to access, then you could use it to get to more information. The following example shows how e Titles are uniquely identified by an 'id' field. This information - that id is the unique identifier of a title is gotten from the previous metadata query we executed. Spend a few minutes reviewing the results of metadata query.
http://odata.netflix.com/v2/Catalog/Titles('13bLK').

When you don't know the unique identifier you could always search titles based on a name (name is an attribute of Title entity as revealed by the metadata query.)
http://odata.netflix.com/v2/Catalog/Titles?$filter=(Name%20eq%20'Rod%20Stewart')

The count of total number of titles Netflix owns is easily found by using the following url. As of 2/8/2013 the count is 160050. Odata defines many such functions count, length, replace, trim etc.
http://odata.netflix.com/v2/Catalog/Titles/$count

So this is what odata specification provides. It specifies the format of the url, the functions and operators you could use in the url to get to the data you want. REST based architecture exposes all the information through http metods - in this case - 'GET' is what we have seen so far.

Now a little bit on to how providers like Netflix create these services. If service provider is exposing mechanisms for creating, updating and deleting the data through odata producer and if the information that's being exposed is in relational database, they need to map the standard CRUD methods to http methods. So create is mapped to post, delete to delete, update to put and select to get. The essential tasks of the producer would be to parse the request, get the http method, get the content, call the back end where the data is stored, perform the mapped functions and convert the retrieved data to Atom or Json format before it's sent back to the client. Vendors like Microsoft and Google (odata4j) are offering function libraries that makes it easy to create your own producer. If you just want to consume the odata services there are many client libraries like popular datajs.

Now you are ready to create your own client.
Here is a snippet of code I got from datajs tutorial. Save this as any html document and use it in any modern browser. You will see all the genres of all the content from the Netflix. Before running the sample don't forget to download datajs-1.1.0.min.js directory from here and save it to your c:\datajs directory.

I hope this post will give you enough information to try things on your own. I also hope to have another post some time later on the same subject. Odata is actively used by companies such as Microsoft, SAP, Netflix, Facebook and Ebay. Only time will tell if this standard becomes widely used technology or not, but it already has gathered enough following to attract our attention. Odata.org website. Good luck. Your comments are welcome. Please feel free to share if you like it.

Tuesday, September 25, 2012

SAP Sybase ASE - An introduction for a newbie

Since the day of its acquisition of Sybase, SAP has made great strides in porting its business suite on to Sybase ASE (known as Adaptive Server Enterprise - the Sybase relational database). Last week's announcement from SAP is just one of the many milestones that symbolises its committment to Sybase products and highlights the successes. The highest number in SD benchmark was achieved on a 2 core HP/Linux machine by ASE. Check out the details here.

As SAP works towards its stated goal of becoming a premier database company in the world, it is but natural that more and more SAP practitioners and enthusiasts become interested in ASE and may be looking to use it not only for SAP applications but also for other in-house applications. For a java programmer, familiarity with the JDBC API is enough to use a standard RDBMS. But once in a while one is required to connect directly to the underlying database to debug an issue. Unlike production systems where you can depend on a DBA for support, you are often left to deal with problems yourself in a development environment. Quirks or peculiarities in a new relational database system can be a source of frustration. Some features that totally make sense in one database may not work the same way in other. It's those little things, quirks, that can drive one crazy when one is dealing with tighter deadlines.

I have been working with SAP Sybase Adaptive Server Enterprise for more than a year now. I feel comfortable in sharing some bits and pieces I pieced together over that period. Much of this information is available on the web and or in the product documentation, but scouring and digging through this information can take time. Also, some of the information is available at most unlikely places. Having worked with popular databases other than ASE before, these key features seemed interesting, a bit different or I came across them too often. This is not a comprehensive list of ASE features, but it is a list to save some time for a newcomer to quickly get off the ground. So here it goes.

First, SAP allows free downloads of developer versions of most of the Sybase products. You can download the one you want from here. Consider revisiting this post before you install ASE.

1. Concept of devices and databases in ASE - They resemble the concepts of databases and schemas in Oracle.

2. Finding the version of the database you are running - Sybase products including ASE are fast evolving to cater to SAP ecosystem. Major features are being added fairly regularly. When something isn't working as expected, first thing you may want to do before calling into tech support is to find the version of the database you are running and refer to product documentation here.

dataserver -v

3. Sybase environment variables are set in sybase.csh file on linux installation. The server startup/shutdown scripts could be found in following directory

your sybase home directoy/ASE-15_0/install

4. If you believe that an installed instance is not responding to your application code, the most likely reason could be that the thread is kept waiting due to the lack of log space. To free up the space you can use following commands. Refer to sybase documentation on what each does, but if you are not concerned about losing the running transaction, then either of the following should work. You should see your transaction going through as soon as you run the following command.

DUMP TRANSACTION your_database_name WITH TRUNCATE_ONLY
DUMP TRANSACTION your_database_name with no_log

Before you try above commands, you may want to look at all the running connections/threads to ASE and their status by executing sp_who stored procedure.

5. One of the most important things you need to decide while installing the ASE is the page size. All data from one row of one table is colocated in one page to improve performance. Once defined, you can not change page size for the installed ASE instance. Below is a table showing relationship between the page sizes and corresponding maximum possible database sizes.

2K page size - 4 TB
4K page size - 8 TB
8K page size - 16 TB
16K page size - 32 TB

If you are not sure what page size the installed ASE instance is using, then run the following command from isql utility, the client used to connect to ASE.

select @@maxpagesize

6. If you use Hibernate for data access, you may want to visit hibernate-sybase integration page here. Spending a few minutes on the required ASE settings for hibernate could save you from some frustration later on.

7. Once the installation is complete you can run Sybase Control Center application that provides a GUI interface to all your ASE instances. You can alternatively use isql, a command line sql interface to interact with ASE. Super admin user name is 'SA' and password could be left blank.

8. Use following stored procedures to add users. sp_addlogin just adds a login and not a user.

sp_addlogin - adds only a login name
sp_adduser - adds a new user.

9. Make sure to assign a role with adequate privileges to the new user by executing following procedure.

sp_role "grant", some_role, username

10. Before you try any sql you may want to check the 'reserved words' of ASE by using following command. I recommend to go through this list if you are porting an existing application to ASE.

select name from master..spt_values where type = "W"

11. Its a common practice to add an autoincrement column to a table and use it as a primary key. Autoincrement is achieved in ASE by denoting a column as IDENTITY. ASE does not allow more than one autoincrement column in a table. There is no 'sequence' as in Oracle. ASE allows adding user defined values to an autoincrement column after you run the following command.

set identity_insert tablename on

12. If you have used MS SQL Server, then you will find ASE's t-sql very similar as both share the same roots. T-sql is similar to pl/sql in Oracle.

13. ASE's native JDBC driver is called jconnet. JTDS, an open source driver, can be used also. Jconnect supports changing connection properties through connection url. Just append the connection url with 'property_name=property_value'. For example

'jdbc:sybase:tds:servername:portnumber/database?QUOTED_IDENTIFIER=ON'

allows quoted identifiers in a sql statement like below.

SELECT * from "dbo". "User_Table"

If you want to run your existing relational database application on ASE then it is really important to pay attention to this connection property. However, if you are building all sqls as 'preparedStatements' then quoted strings are handled correctly even when quoted_identifier is not explicitly turned on.

14. Also important to note is the difference in handling table aliases. Table aliases are allowed in 'SELECT' statements but not in 'UPDATE' or 'DELETE'. For example the first sql below is valid but the second returns an error.

select * from tablename t1
update tablename t1 set columnname=value where column2=value2

15. Sybase Jconnect driver depends on some key metadata information for its correct working. In absence of this information you may receive sql exception with similar description as 'missing metadata'. If this happens, it means that you missed a step during ASE installation. This missed step installs the metadata information in the master database. This error could be eliminated by running a stored procedure after the installation. Refer to the documentation here.

16. 'Select for Update' was introduced in ASE 15.7 and you would expect it to work by default in a new install. Alas, no such luck. I highly recommend reading this document before you try this feature. You would need 'datarows' locking scheme on the table on which select is performed. You can turn 'datarows locking' ON on the enitre database or on a table using following commands.

sp_configure "lock scheme", 0, datarows
alter table tablename lock datarows

17. ASE is case sensitive by default. To display case sensitivity, sort order and character set settings, use sp_helpsort stored procedure.

18. Following information is useful if you want to programmatically access metadata information from ASE. Sysobjects table in master database holds information on all entities like tables, indexes, views in ASE. To display all user objects use following command.

Select * from sysobjects where type = 'U'

syscolumns table holds information of all columns from all tables.
sysreferences, sysconstraints hold information on relationships between tables like foreign key constraints.

19. By joining above tables you can get pretty much all the information about a user table. Unfortunately there is no simple query like 'describe table' in Oracle. You can use following stored procedure to get all the details about a table, but it will include much more than just the column names and their types.

sp_help tablename

If you want just the column names of a user table, use following sql.

select sc.name from syscolumns sc, sysobjects so where sc.id=so.id and so.name='tablename'

20. Unlike Oracle, ASE truncates all trailing blanks while storing variable length varchar columns. So be careful if your code has comparisons on strings pulled from varchar columns.

I hope this list will help you install and navigate ASE in early stages. I mentioned these in particular because I had to use these commands more often than any others. Once you spend a little bit of time on ASE, you will of course come across other things that seem more useful. I will try to list those in my next post.

Sunday, April 8, 2012

Understanding Aleri - Complex Event Processing - Part II

Aleri, the complex event processing platform from Sybase was reviewed at high level in my last post.

This week, let's review the Aleri Studio, the user interface to Aleri platform and the use of pub/sub api, one of many ways to interface with the Aleri platform. The studio is an integral part of the platform and comes packaged with the free evaluation copy. If you haven't already done so, please download a copy from here. The fairly easy installation process of Aleri product gets you up and running in a few minutes.

The aleri studio is an authoring platform for building the model that defines interactions and sequencing between various data streams. It also can merge multiple streams to form one or more streams. With this eclipse based studio, you can test the models you build by feeding them with the test data and monitor the activity inside the streams in real time. Let's look at the various type of streams you can define in Aleri and their functionality.

Source Stream - Only this type of stream can handle incoming data. The operations that can be performed by the incoming data are insert, update, delete and upsert. Upsert, as the name suggests updates data if the key defining a row is already present in the stream. Else, it inserts a record in the stream.

Aggregate Stream - This stream creates a summary record for each group defined by specific attribute. This provides functionality equivalent to 'group by' in ANSI SQL.

Copy stream - This stream is created by copying another stream but with a different retention rule.

Compute Stream - This stream allows you to use a function on each row of data to get a new computed element for each row of the data stream.

Extend Stream - This stream is derived from another stream by additional column expressions

Filter Stream - You can define a filter condition for this stream. Just like extend and compute streams, this stream applies filter conditions on other streams to derive a new stream.

Flex Stream - Significant flexibility in handling streaming data is achieved through custom coded methods. Only this stream allows you to write your own methods to meet special needs.

Join Stream - Creates a new stream by joining two or more streams on some condition. Both, Inner and Outer joins can be used to join streams.

Pattern Stream - Pattern matching rules are applied with this stream

Union Stream - As the name suggests, this joins two or more streams with same row data structure. Unlike the join stream, this stream includes all the data from all the participating streams.

By using some of these streams and the pub api of Aeri, I will demonstrate the seggregation of twitter live feed into two different streams. The twitter live feed is consumed by a listener from Twitter4j library. If you just want to try Twitter4j library first, please follow my earlier post 'Tracking user sentiments on Twitter'. The data received by the twitter4j listener, is fed to a source stream in our model by using the publication API from Aleri. In this exercise we will try to separate out tweets based on their content. Built on the example from my previous post, we will divide the incoming stream into two streams based on the content. One stream will get any tweets that consists 'lol' and the other gets tweets with a smiley ":)" face in the text . First, let's list the tasks we need to perform to make this a working example.

Create a model with three streams
Validate the model is error free
Create a static data file
Start the Aleri server and feed the static data file to the stream manually to confirm correct working of the model.
Write java code to consume twitter feed. Use the publish API to publish the tweets to Aleri platform.
Run the demo and see the live data as it flows through various streams.

Image 1 - Aleri Studio - the authoring view

This image is a snapshot of the Aleri Studio with the three streams - one on the left named "tweets" is a source stream and two on the right named "lolFilter" and "smileyFilter" are of the filter type. Source stream accepts incoming data while filter streams receive the filtered data. Here is how I defined the filter conditions -

like (tweets.text, '%lol%').

tweets is the name of the stream and text is the field in the stream we are interested in. %lol% means, select any tweets that have 'lol' string in the content. Each stream has only 2 fields - id and text. The id and text maps to id and text-message sent by twitter. Once you define the model, you can check it for any errors by clicking on the check mark in the ribbon at the top. Erros if any will show up in the panel at bottom right of the image. Once your model is error free, it's time to test it.

Image 2 - Aleri Studio - real time monitoring view

The following image shows the test interface of the studio. Try running your model with a static data file first. The small red square at the top indicates that Aleri server is currently running. The console window at the bottom right shows server messages like successful starts and stops etc. The Run-test tab in the left pane, is where you pick a static data file to feed the source stream. The pane on the right shows all the currently running streams and live data processed by the streams.

The image below shows the format of the data file used to test the model

The next image shows the information flow.

Image 3 - Aleri Publishing - Information flow

The source code for this exercise is at the bottom.
Remember that you need to have twitter4j library in the build path and have Aleri server running before you run the program. Because I have not added any timer to the execution thread, the only way to stop the execution is to abort it. For brevity and to keep the code line short, I have deleted all the exception handling and logging. The code utilizes only the publishing part of the pub/sub api of Aleri. I will demonstrate the use of sub side of the api in my next blog post.

This blog intends to provide something simple but useful to the developer community. Feel free to leave your comment and share this article if you like it.

package com.sybase.aleri;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import twitter4j.Status;
import twitter4j.StatusDeletionNotice;
import twitter4j.StatusListener;
import twitter4j.TwitterException;
import twitter4j.TwitterStream;
import twitter4j.TwitterStreamFactory;
import twitter4j.conf.Configuration;
import twitter4j.conf.ConfigurationBuilder;

import com.aleri.pubsub.SpGatewayConstants;
import com.aleri.pubsub.SpObserver;
import com.aleri.pubsub.SpPlatform;
import com.aleri.pubsub.SpPlatformParms;
import com.aleri.pubsub.SpPlatformStatus;
import com.aleri.pubsub.SpPublication;
import com.aleri.pubsub.SpStream;
import com.aleri.pubsub.SpStreamDataRecord;
import com.aleri.pubsub.SpStreamDefinition;
import com.aleri.pubsub.SpSubscription;
import com.aleri.pubsub.SpSubscriptionCommon;
import com.aleri.pubsub.impl.SpFactory;
import com.aleri.pubsub.impl.SpUtils;
import com.aleri.pubsub.test.ClientSpObserver;

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.Vector;
import java.util.TimeZone;

public class TwitterTest_2 {
 //make sure that Aleri server is running prior to running this program
 static {
  //creates the publishing platform
  createPlatform();

}
 // Important objects from the publish API
 static SpStream stream;
 static SpPlatformStatus platformStatus;
 static SpPublication pub;

public static void main(String[] args) throws TwitterException, IOException {
  
  TwitterTest_2 tt2 = new TwitterTest_2();
  ConfigurationBuilder cb = new ConfigurationBuilder();
  cb.setDebugEnabled(true);
  
  //use your twitter id and passcode
  cb.setUser("Your user name");
  cb.setPassword("Your Password");

// creating the twitter4j listener

Configuration cfg = cb.build();
  TwitterStream twitterStream = new TwitterStreamFactory(cfg)
    .getInstance();
  StatusListener_1 listener;
  listener = new StatusListener_1();
  twitterStream.addListener(listener);
  
  //runs the sample that comes with twitter4j
  twitterStream.sample();

}

private static int createPlatform() {
  int rc = 0;
  //Aleri platform configuration - better alternative is to your properties file
  String host = "localhost";
  int port = 22000;
  
  //aleri configured to run with empty userid and pwd strings
  String user = "";
  String password = "";
  
  //name of the source stream - the one that gets the data from the twitter4j
  String streamName = "tweets";
  
  String name = "TwitterTest_2";
  
  SpPlatformParms parms = SpFactory.createPlatformParms(host, port, user,
    password, false, false);
  platformStatus = SpFactory.createPlatformStatus();
  SpPlatform sp = SpFactory.createPlatform(parms, platformStatus);
  stream = sp.getStream(streamName);
  pub = sp.createPublication(name, platformStatus);
  
  // Then get the stream definition containing the schema information.
  SpStreamDefinition sdef = stream.getDefinition();
/*
  int numFieldsInRecord = sdef.getNumColumns();
  Vector colTypes = sdef.getColumnTypes();
  Vector colNames = sdef.getColumnNames();

*/
  return 0;
 }

static SpStream getStream() {
  return stream;
 }

static SpPlatformStatus getPlatformStatus() {
  return platformStatus;
 }

static SpPublication getPublication() {
  return pub;
 }

static int publish(SpStream stream, SpPlatformStatus platformStatus,
   SpPublication pub, Collection fieldData) {
  
  int rc = 0;
  int i = pub.start();

SpStreamDataRecord sdr = SpFactory.createStreamDataRecord(stream,
    fieldData, SpGatewayConstants.SO_UPSERT,
    SpGatewayConstants.SF_NULLFLAG, platformStatus);

Collection dataSet = new Vector();
  dataSet.add(sdr);
  System.out
    .println("\nAttempting to publish the data set to the Platform for stream <"
      + stream.getName() + ">.");

rc = pub.publishTransaction(dataSet, SpGatewayConstants.SO_UPSERT,
    SpGatewayConstants.SF_NULLFLAG, 1);

// commit blocks the thread until data is consumed by the platform
  System.out.println("before commit() call to the Platform.");
  rc = pub.commit();

return 0;
 }

}

Friday, March 23, 2012

Aleri - Complex Event Processing - Part I

Algo Trading - a CEP use case
Trackback URL

Sybase's Aleri streaming platform is one of the more popular products in the CEP market segment. It's is used in Sybase's trading platform - the RAP edition, which is widely used in capital markets to manage positions in a portfolio. Today, in the first of the multi-part series, I want to provide an overview of the Aleri platform and provide some code samples where required. In the second part, I will present the Aleri Studio, the eclipse based GUI that simplifies the task of modeling CEP workflow and monitor the Aleri server through a dashboard.

Fraud Detection - Another CEP use case. Trackback url

In my previous blog post on Complex Event Processing, I demonstrated the use of Esper, the open source CEP software and Twitter4J API to handle stream of tweets from Twitter. A CEP product is much more thanhandling just one stream of data though. Single stream of data could be easily handled through the standard asynchronous messaging platforms and does not pose very challenging scalability or latency issues. But when it comes to consuming more than one real time stream of data and to analyzing it in real time, and when correlation between the streams of data is important, nothing beats a CEP platform. The sources feeding streaming platform could vary in speed, volume and complexity. A true enterprise class CEP should deal effectively with various real time high speed data like stock tickers and slower but voluminous offline batch uploads, with equal ease. Apart from providing standard interfaces, CEP should also provide an easier programming language to query the streaming data and to generate continuous intelligence through such features as pattern matching and snapshot querying.

Sybase Trading Platform - the RAP edition. Trackback URL

To keep it simple and at high level, CEP can be broken down to three basic parts. The first is the mechanism to grab/consume source data. Next is the process of investigating that data, identifying events & patterns and then interacting with target systems by providing them the actionable items. The actionable events take different forms and formats depending on the application you are using the CEP for. An action item could be - selling an equity position based on calculated risk in a risk monitoring application. indicating potential fraud events in money laundering applications or alerting to a catastrophic event in a monitoring system by reading thousands of sensors in a chemical plant. There literally are thousands of scenarios where a manual and off-line inspection of data is simply not an option. After you go through the following section, you may want to try Aleri yourself. This link http://www.sybase.com/aleriform directly takes you to the Aleri download page. Evaluation copy valid for 90 days is freely available from Sybase’s official website. Good amount of documentation, an excellent tutorial and some sample code on the website should help you get started quickly.

If you are an existing user of any CEP product, I encourage you to compare Aleri with that product and share it with the community or comment on this blog. By somewhat dated estimates, Tibco CEP is the biggest CEP vendor in the market. I am not sure how much market share another leading product StreamBase has. There is also a webinar you can listen to on Youtube.com that explains CEP benefits in general and some key features of Streambase in specific. For new comers, this serves as an excellent introduction to CEP and a capital markets use case.

An application on Aleri CEP is built by creating a model using the Studio (the gui) or using Splash(the language) or by using the Aleri Modeling language (ML) - the final stage before it is deployed.

Following is a list of the key features of Splash.

Data Types - Supports standard data types and XML . Also supports ‘Typedef ‘ for user defined data types.
Access Control – a granular level access control enabling access to a stream or modules (containing many streams)
SQL – another way of building a model. Building an Aleri studio model could take longer due to its visual paradigm. Someone proficient with SQL should be able to do it much faster using Aleri SQL which is very similar to regular SQL we all know.
Joins - supported joins are Inner, Left, Right and Full
Filter expressions - include Where, having, Group having
ML - Aleri SQL produces data model in Aleri modeling language (ML) – A proficient ML users might use only ML (in place of Aleri Studio and Aleri SQL)to build a model.
The pattern matching language - includes constructs such as ‘within’ to indicate interval (sliding window), ‘from’ to indicate the stream of data and the interesting ‘fby’ that indicates a sequence (followed by)
User defined functions – user defined function interface provided in the splash allows you to create functions in C++ or Java and to use them within a splash expression in the model.

Advanced pattern matching – capabilities are explained through example here. – Following three code segments and their explanations are directly taken from Sybase's documentation on Aleri.
The first example checks to see whether a broker sends a buy order on the same stock as one of his or her customers, then inserts a buy order for the customer, and then sells that stock. It creates a “buy ahead” event when those actions have occurred in that sequence.

within 5 minutes
from
BuyStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Buy1,
BuyStock[Symbol=sym; Shares=n2; Broker=b; Customer=c1] as Buy2,
SellStock[Symbol=sym; Shares=n1; Broker=b; Customer=c0] as Sell
on Buy1 fby Buy2 fby Sell
{
if ((b = c0) and (b != c1)) {
output [Symbol=sym; Shares=n1; Broker=b];
}
}

This example checks for three events, one following the other, using the fby relationship. Because the same variable sym is used in three patterns, the values in the three events must be the same. Different variables might have the same value, though (e.g., n1 and n2.) It outputs an event if the Broker and Customer from the Buy1 and Sell events are the same, and the Customer from the Buy2 event is different.

The next example shows Boolean operations on events. The rule describes a possible theft condition, when there has been a product reading on a shelf (possibly through RFID), followed by a non-occurrence of a checkout on that product, followed by a reading of the product at a scanner near the door.

within 12 hours
from
ShelfReading[TagId=tag; ProductName=pname] as onShelf,
CounterReading[TagId=tag] as checkout,
ExitReading[TagId=tag; AreaId=area] as exit
on onShelf fby not(checkout) fby exit
output [TagId=t; ProductName=pname; AreaId=area];

The next example shows how to raise an alert if a user tries to log in to an account unsuccessfully three times within 5 minutes.

from
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login1,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login2,
LoginAttempt[IpAddress=ip; Account=acct; Result=0] as login3,
LoginAttempt[IpAddress=ip; Account=acct; Result=1] as login4
on (login1 fby login2 fby login3) and not(login4)
output [Account=acct];

People wishing to break into computer systems often scan a number of TCP/IP ports for an open one, and attempt to exploit vulnerabilities in the programs listening on those ports. Here’s a rule that checks whether a single IP address has attempted connections on three ports, and whether those have been followed by the use of the “sendmail” program.

within 30 minutes
from
Connect[Source=ip; Port=22] as c1,
Connect[Source=ip; Port=23] as c2,
Connect[Source=ip; Port=25] as c3
SendMail[Source=ip] as send
on (c1 and c2 and c3) fby send
output [Source=ip];

Aleri provides many interfaces out of the box for an easy integration with source and target systems. Through these interfaces/adapters the Aleri platform can communicate with standard relational databases, messaging frameworks like IBM MQ, sockets and file system files. Data in various formats like csv, FIX, Reuters market data, SOAP, http, SMTP is easily consumed by Aleri through standardized interfaces.

Following are available techniques for integrating Aleri with other systems.

Pub/sub API is provided in Java, C++ and dot net - A standard pub/sub mechanism
SQL interface with SELECT, UPDATE, DELETE and INSERT statements used through ODBC and JDBC connection.
Built in adapters for market data and FIX

In the next part of this series we will look at the Aleri Studio, the gui that helps us build the CEP application the easy way.

Tuesday, March 13, 2012

Sybase IQ 15.4 - for big data analytics

Big data analytics platform, Sybase IQ 15.4 Express Edition, is now available for free. This article introduces Sybase IQ's big data features and shares some valuable resources with you. Sybase IQ, with its installation base of more than 4,500 clients all over world has always been a leading columnar database for mission critical analytic functions. With new features like native support for Map Reduce and in-database analytics, it positions itself as a premium offering for big data analytics.

Columnar databases have been in existence for almost 2 decades now. Row level databases for OLTP transactions and Columnar databases for analytics, more or less met requirements of organizations. Extracting, transforming and loading (ETL) transactional data into analytic platform has been a big business. However, the recent focus on big data and the nature of it (semi-structured, unstructured) is changing the landscape of analytic platforms and the way ETL is used. Boundaries between OLTP and analytic platform are somewhat blurring due to advent of real-time analytics.

Due in large part to the tremedous growth of ecommerce in recent years, much of the developer talent pool was attracted to low-latency and high throuput transactional systems. Without a doubt, the same pool is gravitating towards aquisition, management and analysis of semi-structured data now. Host of start-ups and established technology companies are creating new tools, methods and methodologies in an effort to address the data deluge we are generating from our use of social networks. Even if there are myriad of options available to handle big data and analytic, some clear trends or underlying technologies have emerged to the fore. Hadoop, the open source implementation for batch operations on big data and in-memory analytic platform to provide real-time business intelligence are two of the most significant trends. Established vendors like Oracle, IBM, Informatica have been adding new products or updating their existing offerings to meet the new demands in this space.

Business intelligence gathered through only one type of data, transactional or semi-structured from social media or machine generated, is not good enough for today's organization. It's really important to glean information from each type of source and co-relate one with the other to present a comprehensive and accurate intelligence that assists critical decision support systems. So, it's imperative that vendors provide a set of tools with good enough integration that support structured, semi-structured and totally unstructured data like audio and video. "Polyglot Persistence", the buzzword of late and made popular by Martin Fowler, addresses the need and nature of different types of data. You may have multiple systems for storage/persistence but ultimately an enterprise needs a comprehensive view of its business through one lense. This is what Sybase's IQ, the traditional columnar database product, is trying to provide by offering native support for Hadoop and Mapreduce. There are other singificant enhancements to the latest enterprise edition of Sybase IQ. What makes this product attractive to the developer community is, its free availability. Please note, it's not an evaluation copy, but is a full function enterprise edition with the only restriction of databas size at 5 GB. This blog post is a quick summary of IQ's features and aims to point important resources that will help you in trying it out.

For uninitiated, this blog entry by William McKnight, provides an introduction to concepts of columnar databases.

Following paragraph directly taken from Sybase's web site, sums up the most important features.

Sybase® IQ 15.4 is revolutionizing “Big Data” analytics breaking down silos of data analysis and integrating it into enterprise analytic processes. Sybase IQ offers a single database platform to analyze different data – structured, semi-structured, unstructured – using different algorithms. Sybase IQ 15.4 expands these capabilities with the introduction of a native MapReduce API, advanced and flexible Hadoop integration, Predictive Model Markup Language (PMML) support, and an expanded library of statistical and data mining algorithms that leverage the power of distributed query processing across a PlexQ grid.

Some details on these features is in order.

User Defined Functions Enabling Map Reduce - One of the best architecture practices in a three tier application architecture is to physically and logically separate business logic and the data. But bringing data to business logic layer and then moving it back to persistence layer adds latency. The mission critical applications typically implement many strategies to reduce the latency, but the underlying theme to all those solutions is the same. That is, to keep business logic and data as close to one another as possible. Sybase IQ's native c++ API allows developers to build user defined functions to implement proprietory algorithms. This means, you can build map reduce functions right inside the database that can yield 10X performance improvements. This also means, you can use ordinary SQL from higher level business logic keeping that layer simpler while taking advantage of map reduce based parallel processing for higher performance. The map reduce jobs are executed in parallel on the grid of servers, called Multiplex or PlexQ in Sybase IQ parlance.

Hadoop Integration - We discussed the need for analyzing and co-relating semi-structured data with structured data earlier. Sybase IQ provides 4 different ways in which this integration can happen. Hadoop is used to extract data points from unstructured or semi-structured data and then is used with OLTP data for further analysis. Clien-side Federation, ETL based, Query Federation and Data Federation are ways in which Hadoop integration occurs. You can request more in-depth information here.

PMML Support - PMML (Predictive Model Markup Language) support allows the user to create predictive models using popular tools like SAS and R. These models can be executed in automated fashion extending the already powerful analytics platform further. A plug-in from Zementis is used to provide the PMML validation and its transformation to Java UDFs.

R Language Support - SQL is woefully inadequate when you need statistical analysis on structured data. But, the simplicity and wide adoption of SQL makes it an attractive query tool. The RJDBC interface in IQ allows an application to use the R programming language to perform statistical functions. R is a very popular open source programming language used in many financial applications today. Please read my blog entry 'Programming with R, it's super' for further information on R.

In Database Analytics - Sybase IQ in its latest version15.4, uses 'in database' analytics library called DB Lytix from Fuzzy Logix. This analytics engine have the ability to perform advance analytics through simple SELECT and EXECUTE statements. Sybase claims that some of these analytical functions are able to leverage MapReduce API in some data mining algorithms. DB Lytix, according to Fuzzy Logic's website, supports Mathematical and Statistical functions, Monte Carlo Simulations- uni-variate and multi-variate, Data Mining ; Pattern Recognition , Principal Component Analysis, Linear Regression, Logistic Regression, Other supervised learning methods, and Clustering.

Sybase provides detail documentation on the features of IQ on Sybase's website. If you would like to try Sybase IQ, use this direct link. Remember, this edition is not an eval copy. It is a full featured IQ edition limited only by database size at 5GB.

As you can imagine, Sybase IQ is not the only company offering big data solutions. This gigaom article lists few others as well. With so many good options to choose from, the users have to look at factors other than just the technology to make the selection. Some of those factors are - the reputation of the brand, standing in the market, the extent of its user base and the eco-system around the product. In that regard, Sybase IQ scores very high points and is the reason for it's position in leadership quadrant by Gartner. Sybase IQ has been a leading columnar database in the market since 90's and has established a robust eco-system around it. Sybase's other products - Power Designer , the leading data modelling workbench, SAP Business Objects for reporting and analytics, and Sybase Control Center for administration and monitoring of IQ - support the IQ and provid one of the most comprehensive analytics platforms in the industry.

About this blog

Anything that is interesting and technology related, gets posted here. I like to keep it concise, simple yet practical so that the readers could try out sample code easily if they want to. Theory without a practical use of it has no value in my mind. I always want to try something new and this blog hopes to provide you that fresh perspective of an average techie. The opinions, reviews and demos are entirely of my own and not my employer's - SAP America. Credit is duly provided where I borrow an idea, image or text from others. Feel free to comment, share, follow and distribute. The more you do it, the more the incentive for me to keep writing. :-)