We know how ORM frameworks like Hibernate, JPA make a developer's life easier. Given a choice, we would always code in objects for manipulating data. But, once in a while you come across a use case where JDBC code clearly trumps the Hibernate in terms of performance. I came across such a use case, where I was required to parse thousands of sql statements and input the data into relational tables.What I found was even after I followed the Hibernate's best practices, Hibernate code was still twice slower than JDBC. This post is to generate conversation and to invite expert opinion on how to determine when to not use Hibernate. Does my use case a perfect example of this? Do you know of any other use cases?
Typically Hibernate's best practices on improving performance include following techniques related to my use case.
And so forth .......
I found that the last option, flushing the session often was most helpful for my project. But I wanted to post this information to invite help, guidance, advice from Hibernate experts on performance tuning in general, and to verify if what I did (use JDBC) was the best option for the fastest performance.
To more curious folks, here are the top 5 urls in google search on 'hibernate performance tuning', but none of those discuss the use case I have here.
http://www.javadev.org/files/Hibernate%20Performance%20Tuning.pdf
http://java.dzone.com/articles/hibernate-performance-tuning
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html
http://www.javaperformancetuning.com/news/interview041.shtml
http://arnosoftwaredev.blogspot.com/2011/01/hibernate-performance-tips.html
The application where I had to use JDBC in favor of Hibernate due to performance reasons was really very simple. Continuously streamed SQL data was parsed and went into three relational tables - parent, child and a child of child. Let's call them tables A, B and C respectively for simplicity. There was 1-1 relation between A and B while C had variable number of rows for each row of table B.
The code I wrote with Hibernate added a sample of 100K rows to A, 100K rows to B and around 200K rows to table C. There was no need to read the data while it was being written. The 'for loop' I wrote repeated 100K times with each instance of the loop adding row to table A first, followed by table B and table C next. Before debugging my code (Hibernate) due to very very sluggish performance, the speed of the program progressively decreased and almost started crawling before it threw 'out of memory' error. I had to add session.flush() at the end of each cycle of the loop. Adding session.flush() statement greatly improved performance of writes and because there was no need to read the data at the time of writing session flushing did not cost performance penalty.
Typically in any real life OLTP applications some user sessions will always be reading the data while other user sessions are actively writing it. Hibernate's best practices suggest the use of second level cache for better performance and in most cases it works well when multiple sessions are adding data. Remember, Hibernate caches recently added data automatically. Second level cache is involved when multiple sessions are writing and reading data but in the scenario I discussed, I had only one session that was writing and no session was reading the data. Hence turning off or on the second level cache had no impact on performance at all.
But a slightly modified scenario could potentially be a real life use case. What are the Hibernate best practices to achieve optimum performance in such a use case. The modified scenario is to have one session writing the data and multiple sessions reading the data. For example, results and statistics of Summer Olympics being fed to a data store by some feed engine, while participants and viewers are viewing the data in real time. What steps could we take to get the best read/write performance using Hibernate. Please consider for discussion sake that data from feed engine is not a bulk upload but more of a live stream data accepted using Hibernate code (I guess it's not really a best practice). Do we turn second level cache off, do we flush session on each write cycle, batch the write operations or?
Of course, the performance tests using simulated work load provides the best answer, but as an architect how do we propose a solution that ultimately is validated by performance tests. Following table shows the performance numbers I got in my test runs with different configurations. The code was really simple and it performed operations I mentioned earlier.
If you are interested please add a comment and I will send out the code to you. Please note the code was run from within Eclipse on Windows XP with 512 MB max memory settings in Eclipse. I used Sybase 15.5 ASE server to host the tables. The relationship between the tables is really simple each having a primary key and a foreign key for one to many relation between table B and C and one to one relation between A and B. It's clear that flushing the session after each write cycle (after adding 1 record in table A, 1 in table B and multiple records in table C) was most important factor in getting the faster execution time. But the JDBC code still out performed Hibernate code. It was twice as fast Hibernate. See the last row in the table.
Another interesting finding was that batching (20 as suggested by documentation) actually performed worse than non-batching. This application from complexity perspective was really simple - not many tables, simple object structure, simple relationship, no complex SQLs. So it wasn't too difficult or cumbersome to use JDBC for the best performance.
There is abundance of technical literature available on internet on every technical subject. We read and follow guidelines from well regarded sources without thinking twice. But in some occassions it's useful to try things out and validate for yourself. That's what I was forced to do and I was glad I did it. Because of it I got some interesting results to share with the community. All hibernate enthusiasts, feel free to comment and advise.
Typically Hibernate's best practices on improving performance include following techniques related to my use case.
- Use batching while performing batch inserts/updates
- Use second level cache
- Specify the objects for caching
- Clearing session to avoid object caching at first level
And so forth .......
I found that the last option, flushing the session often was most helpful for my project. But I wanted to post this information to invite help, guidance, advice from Hibernate experts on performance tuning in general, and to verify if what I did (use JDBC) was the best option for the fastest performance.
To more curious folks, here are the top 5 urls in google search on 'hibernate performance tuning', but none of those discuss the use case I have here.
http://www.javadev.org/files/Hibernate%20Performance%20Tuning.pdf
http://java.dzone.com/articles/hibernate-performance-tuning
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html
http://www.javaperformancetuning.com/news/interview041.shtml
http://arnosoftwaredev.blogspot.com/2011/01/hibernate-performance-tips.html
The application where I had to use JDBC in favor of Hibernate due to performance reasons was really very simple. Continuously streamed SQL data was parsed and went into three relational tables - parent, child and a child of child. Let's call them tables A, B and C respectively for simplicity. There was 1-1 relation between A and B while C had variable number of rows for each row of table B.
Table relationship |
The code I wrote with Hibernate added a sample of 100K rows to A, 100K rows to B and around 200K rows to table C. There was no need to read the data while it was being written. The 'for loop' I wrote repeated 100K times with each instance of the loop adding row to table A first, followed by table B and table C next. Before debugging my code (Hibernate) due to very very sluggish performance, the speed of the program progressively decreased and almost started crawling before it threw 'out of memory' error. I had to add session.flush() at the end of each cycle of the loop. Adding session.flush() statement greatly improved performance of writes and because there was no need to read the data at the time of writing session flushing did not cost performance penalty.
Typically in any real life OLTP applications some user sessions will always be reading the data while other user sessions are actively writing it. Hibernate's best practices suggest the use of second level cache for better performance and in most cases it works well when multiple sessions are adding data. Remember, Hibernate caches recently added data automatically. Second level cache is involved when multiple sessions are writing and reading data but in the scenario I discussed, I had only one session that was writing and no session was reading the data. Hence turning off or on the second level cache had no impact on performance at all.
But a slightly modified scenario could potentially be a real life use case. What are the Hibernate best practices to achieve optimum performance in such a use case. The modified scenario is to have one session writing the data and multiple sessions reading the data. For example, results and statistics of Summer Olympics being fed to a data store by some feed engine, while participants and viewers are viewing the data in real time. What steps could we take to get the best read/write performance using Hibernate. Please consider for discussion sake that data from feed engine is not a bulk upload but more of a live stream data accepted using Hibernate code (I guess it's not really a best practice). Do we turn second level cache off, do we flush session on each write cycle, batch the write operations or?
Of course, the performance tests using simulated work load provides the best answer, but as an architect how do we propose a solution that ultimately is validated by performance tests. Following table shows the performance numbers I got in my test runs with different configurations. The code was really simple and it performed operations I mentioned earlier.
Hibernate Vs JDBC |
Another interesting finding was that batching (20 as suggested by documentation) actually performed worse than non-batching. This application from complexity perspective was really simple - not many tables, simple object structure, simple relationship, no complex SQLs. So it wasn't too difficult or cumbersome to use JDBC for the best performance.
There is abundance of technical literature available on internet on every technical subject. We read and follow guidelines from well regarded sources without thinking twice. But in some occassions it's useful to try things out and validate for yourself. That's what I was forced to do and I was glad I did it. Because of it I got some interesting results to share with the community. All hibernate enthusiasts, feel free to comment and advise.
Hi,
ReplyDeleteI'm astonished that your JDBC code is only twice as fast as Hibernate. You cannot beat direct JDBC PreparedStatements with batching, small/no DB transactions etc.
For mass imports/exports we always switch from JPA/Hibernate to pure JDBC and I have seen much better improvements than *2.
Regular flushing is only necessary because the Hibernate session else keeps all persisted Entities as managed Entities. They cannot be Garbage Collectet till the Session/EntityManager is flushed (and the data written). Otherwise you also get too big DB transactions, also slowing down the inserts.
There is nothing wrong in using the right tool for the right job. O/R mappers arn't necessarily the right tool for mass data handling. The object creation etc. cost overhead - they give you huge benefits for building business applications but they come with a price.
The 2nd level cache will do nothing for write performance, quite the contrary, especially if you activate 2nd level query caching (in addition to the default Entity cache). If any datarecord is updatet for a cached Collection/HQL query the whole Cache Collection/HQL Query is flushed. In the end you will get much slower through the additional Caching administration efforts.
Best regards,
André
Thanks Andre.
DeleteThe comparison of JDBC and Hibernate I listed was over a very small set of records of 5000. I am sure the gap would have widened substantially with more records.
Turning the cache off while writing is a better solution as Sam's comment indicate?
Another question still outstanding in my mind is - JDBC batch size setting deteriorated the performance against expectations.
Mahesh - You should try hibernate's StatelessSession. My financial quantitative app loads 2.5 million rows in a single call and performance is equivalent to JDBC.
ReplyDeleteThanks Sam.
DeleteIn the app you mentioned, is any user session reading the records while rows are being uploaded?
So, a stateless session is used to write the records while stateful(to get caching benefits) sessions are used to read the records at the same time?
I've same exact problem and still struggling to find right solution. Let me know if you have find one
ReplyDeleteJay - for an entirely 'data upload utility', you can use stateless sessions as some readers suggested and the performance is comparable, but then why would you write hibernate code for just data upload? I would rather use JDBC
ReplyDeleteThe argument for both sides could be made when your application is a mix of reads and writes. If you want reads to be faster you may not want to use stateless sessions, just to take advantage of default caching hibernate does. Then you may want to periodically flush the cash to avoid performance penalties in data writes. How frequently to flush, it depends (the usual answer but can't help it.) That's the point of my article really. If you are not averse to writing JDBC code, there are instances when you want to use it in favor of Hibernate. Those scenarios will typically involve simple sqls (simple joins, fewer columns in the tables etc). Almost always JDBC will write faster.