Enterprise Systems


3
Feb 12

Hibernate bulk loading: The right way

In the current project we are procesing big amount of data in a distributed cache (an Oracle Coherence data grid). This processing relies on some (also big) amount of reference data (1+ million of objects). After application starts the reference data is loaded from an Oracle database into the cache. The data is fairly complex and is spread over about 10 tables. Hibernate is used for the ORM mapping. The data model is represented basically by a main class with a set of simple properties and 4 one-to-many, 1-to-5 average, associations. This data is to be loaded, transformed into a different OO form and stored in the cache for effective retrieval later. Guess, how long it would take to do this for 1 million records?.. OK, it depends. There are several approaches, though it’s obvious that the bottleneck here is the database and Hibernate (reflection is used for O-R transformation) as it can’t be much parallelized.

1. Load main objects with lazy batch loading of one-to-many associations. One long query for the main objects and multiple (sub)queries for the associations.

2. Load main objects with eager loading of one-to-many associations using join fetch, which means executing single complex query with many outer joins. The performance of such a query, query execution without fetching, is quite poor (20-30 seconds in my case), but it’s only one query.

One obvious thing to do since we need to deal with a very big number of records (it means we can’t load everything into memory first and then start processing), is to fetch data via scrollable cursor with some batch size, e.g. 100.

Unfortunately, both approaches have their drawbacks. First approach with regards to the amount of data will execute many queries to fetch associations, say, 1000 000 / 100 (batch size) = 10000 – this is 100 less, but still lots! In this case, execution time was around 45 minutes (including transformation and putting into cache, but this phases’ time is neglectable).

Second approach is way better in terms of the number of queries (there is only 1! query to execute), and we would expect some significant performance improvement. However, the number of rows to load will be about same, and if we look closer, the amount of data to load and to process by Hibernate is much bigger, because every row will have column from all joint tables: if main table has 10 columns, all 10 values will appear in a row for each combination of associations. Though, in practice this approach showed some improvement, it was insignificant – it ran in around 30 minutes.

I simply could not accept this very suboptimal performance. So I came up with one more idea… and it was not about abandoning Hibernate and implementing everything with pure JDBC. The third approach is to have one query per object type: one query for main objects and one query for each of one-to-many associations – in total 5 quite effective select queries. All these queries should be executed one by one in the beginning and all 5 scrollable resultsets should be processed kvazi-simultaneously in one go. To achieve this, one important requirement should be met by all the queries: they should be ordered by the same key (id of the main object) and have the same set of filters as in the first query. In this case, hibernate mapping of associations is lazy, so we need to take care of populating them in our code while going through the resultsets. This process is illustrated below:

iterate MasterResultSet as m
    iterate DetailResultSet1 as d1 while d1.key = m.key
        m.addDetail1(d1)
    end
    …
    iterate DetailResultSetN as dN while dN.key = m.key
        m.addDetailN(dN)
    end
end

Using this approach the full bulk loading took only, get ready… 4 minutes! This is due to the data being loaded in an optimal way in terms of amount of data, number and complexity of queries, and number of network round trips.


8
Nov 09

System Architecture vs Business Requirements

Business requirements define functionality of a system, whereas system architecture allows to implement this functionality. The more sophisticated business requirements are, the more flexible architecture is required. The more flexible architecture is, the more future business needs it may cover. The more specific selected architecture is, in a better way it serves current business needs. The more focused on current business requirements architecture is, the less time to market is achieved. So there should a compromise be found. I would say, fifty-fifty approach will fit most of small to medium projects, while more flexibility to less dedicated solution will fit big projects or projects that will potentially grow in future. Either extreme will go phut eventually and end up with totally new solution. There is nothing new about it, I just wanted to emphasize the importance of such choice, no matter how philosophic the question is.