February, 2012


4
Feb 12

Why immutable data objects?

This may be obvious to many, but apprently not to some. Consider the following example:

Car bmw = new Car("BMW");

Wheel wheel = new Wheel();

wheel.setSize(19);

bmw.addWheels(wheel, 4);

save(bmw); // here the car is not saved as underlying mechanism, say Hibernate,
           // decided not to flush until later

Car trabant = new Car("Trabant");

wheel.setSize(13);

trabant.addWheels(wheel, 4);

save(trabant);

flush(); // here the BMW will be saved with 13 inch wheels

What happens here is that the BMW instead of the expected fancy 19 inch wheels ends up having Trabant size, 13 inch, wheels (not that the car looks odd, it just won’t go far). If a unit test does not cover this scenario (there would rather be coverage for a simple scenario with one car, as I don’t believe in perfection of the testing code), the bug may slip into production.

However, this problem can be easily prevented/fixed by introducing immutability.

class Wheel {

private final size;

public Wheel(int size) { this.size = size; }

public int getSize() { return size; }

}

This will force a user of the Wheel class to create a new instance every time instead of reusing existing object if a different configuration required.

Car bmw = new Car("BMW");

bmw.addWheels(new Wheel(19), 4);

Car trabant = new Car("Trabant");

save(bmw);

trabant.addWheels(new Wheel(13), 4);

save(trabant);

flush();

Note, the code becomes slightly shorter as well.

Just to answer a question why don’t we do the same with the Car class and leave the `add` method? – The reason we don’t need to do this is because the internal state of the car is managed via the Car class interface only. However, be careful if a Car instance may be a non-root part of a different objects hierarchy.


3
Feb 12

Hibernate bulk loading: The right way

In the current project we are procesing a big amount of data in a distributed cache (an Oracle Coherence data grid). This processing relies on some (also big) amount of reference data (1+ million of objects). After application starts the reference data is loaded from an Oracle database into the cache. The data is fairly complex and is spread over about 10 tables. Hibernate is used for the ORM mapping. The data model is represented basically by a main class with a set of simple properties and 4 one-to-many (of size 10 on average) associations. This data is to be loaded, transformed into a different OO form and stored in the grid for efficient retrieval later.

Guess, how long it would take to do this for 1 million records?.. OK, it depends. There are several approaches, though it’s obvious that the bottleneck here is the database and Hibernate (reflection is used for O-R transformation) as it can’t be much parallelized.

1. Load main objects with lazy batch loading of one-to-many associations. One long query for the main objects and multiple (sub)queries for the associations.

2. Load main objects with eager loading of one-to-many associations using join fetch, which means executing single complex query with many outer joins. The performance of such a query, query execution without fetching, is quite poor (25-30 seconds in my case), however, on a plus side it becomes the only query to run.

In every case, an obvious thing to do since we need to deal with a very big number of records (it means we can’t load everything into memory first and then start processing), is to fetch data via scrollable cursor with some batch size, e.g. 100.

Unfortunately, both approaches have their drawbacks. First approach with regards to the amount of data will execute many queries to fetch associations, say, 1000 000 / 100 (batch size) = 10000 – this is 100 less, but still lots! In this case, execution time was around 45 minutes (including transformation and putting into cache, but this phases’ time is neglectable).

Second approach is way better in terms of the number of queries (there is only 1! query to execute), and we would expect some significant performance improvement. However, the number of rows to load will be about same, and if we look closer, the amount of data to load and to process by Hibernate is much bigger, because every row will have column from all joint tables: if main table has 10 columns, all 10 values will appear in a row for each combination of associations. Though, in practice this approach showed some improvement, it was insignificant – it ran in around 30 minutes.

I simply could not accept this very suboptimal performance. So I came up with one more idea… and it was not about abandoning Hibernate and implementing everything with pure JDBC. The third approach is to have one query per object type: one query for main objects and one query for each of one-to-many associations – in total 5 quite efficient select queries. All these queries should be executed one by one in the beginning and all 5 scrollable resultsets should be processed kvazi-simultaneously in one go. To achieve this, one important requirement should be met by all the queries: they should be ordered by the same key (id of the main object) and have the same set of filters as in the first query. In this case, hibernate mapping of associations is lazy, so we need to take care of populating them in our code while going through the resultsets. This process is illustrated below:

iterate MasterResultSet as m
    iterate DetailResultSet1 as d1 while d1.key = m.key
        m.addDetail1(d1)
    end
    …
    iterate DetailResultSetN as dN while dN.key = m.key
        m.addDetailN(dN)
    end
end

Using this approach the full bulk loading took only… 8 minutes! This is due to the data being loaded in an optimal way in terms of amount of data, number and complexity of queries, and number of network round trips.

Testing environment configuration: CentOS x86_64 GNU/Linux, Intel Xeon x8 Core, 24GB RAM. Run in debug mode from Eclipse IDE.