Performance


3
Feb 12

Hibernate bulk loading: The right way

In the current project we are procesing a big amount of data in a distributed cache (an Oracle Coherence data grid). This processing relies on some (also big) amount of reference data (1+ million of objects). After application starts the reference data is loaded from an Oracle database into the cache. The data is fairly complex and is spread over about 10 tables. Hibernate is used for the ORM mapping. The data model is represented basically by a main class with a set of simple properties and 4 one-to-many (of size 10 on average) associations. This data is to be loaded, transformed into a different OO form and stored in the grid for efficient retrieval later.

Guess, how long it would take to do this for 1 million records?.. OK, it depends. There are several approaches, though it’s obvious that the bottleneck here is the database and Hibernate (reflection is used for O-R transformation) as it can’t be much parallelized.

1. Load main objects with lazy batch loading of one-to-many associations. One long query for the main objects and multiple (sub)queries for the associations.

2. Load main objects with eager loading of one-to-many associations using join fetch, which means executing single complex query with many outer joins. The performance of such a query, query execution without fetching, is quite poor (25-30 seconds in my case), however, on a plus side it becomes the only query to run.

In every case, an obvious thing to do since we need to deal with a very big number of records (it means we can’t load everything into memory first and then start processing), is to fetch data via scrollable cursor with some batch size, e.g. 100.

Unfortunately, both approaches have their drawbacks. First approach with regards to the amount of data will execute many queries to fetch associations, say, 1000 000 / 100 (batch size) = 10000 – this is 100 less, but still lots! In this case, execution time was around 45 minutes (including transformation and putting into cache, but this phases’ time is neglectable).

Second approach is way better in terms of the number of queries (there is only 1! query to execute), and we would expect some significant performance improvement. However, the number of rows to load will be about same, and if we look closer, the amount of data to load and to process by Hibernate is much bigger, because every row will have column from all joint tables: if main table has 10 columns, all 10 values will appear in a row for each combination of associations. Though, in practice this approach showed some improvement, it was insignificant – it ran in around 30 minutes.

I simply could not accept this very suboptimal performance. So I came up with one more idea… and it was not about abandoning Hibernate and implementing everything with pure JDBC. The third approach is to have one query per object type: one query for main objects and one query for each of one-to-many associations – in total 5 quite efficient select queries. All these queries should be executed one by one in the beginning and all 5 scrollable resultsets should be processed kvazi-simultaneously in one go. To achieve this, one important requirement should be met by all the queries: they should be ordered by the same key (id of the main object) and have the same set of filters as in the first query. In this case, hibernate mapping of associations is lazy, so we need to take care of populating them in our code while going through the resultsets. This process is illustrated below:

iterate MasterResultSet as m
    iterate DetailResultSet1 as d1 while d1.key = m.key
        m.addDetail1(d1)
    end
    …
    iterate DetailResultSetN as dN while dN.key = m.key
        m.addDetailN(dN)
    end
end

Using this approach the full bulk loading took only… 8 minutes! This is due to the data being loaded in an optimal way in terms of amount of data, number and complexity of queries, and number of network round trips.

Testing environment configuration: CentOS x86_64 GNU/Linux, Intel Xeon x8 Core, 24GB RAM. Run in debug mode from Eclipse IDE.


14
Aug 11

Hierarchical data fetched in one query

This is a very typical situation when you have a hierarchical data structure of the type like forum posts or comments:

class Comment {
    long id;
    String comment;
    Comment parent;
}

Let’s store it in a relational database (e.g. MySQL). For this we need to create a table:

CREATE TABLE Comment(
  id int PRIMARY KEY,
  comment text,
  parent_id int,
  FOREIGN KEY (parent_id) REFERENCES Comment(id)
);

And you need to show it in a tree view, e.g.:

/Comment#1
   /Comment#2 (Reply to Comment#1)
   /Comment#4 (Reply to Comment#1)
      /Comment#8 (Reply to Comment#4)
   /Comment#7 (Reply to Comment#1)
/Comment#3
/Comment#5
   /Comment#6 (Reply to Comment#5)
      /Comment#9 (Reply to Comment#6)

You can see that comments added in an arbitrary order (by looking at the comment id sequence).

So how are you going to select all the data to build the tree by SQL means? Standard SQL does not provide hierarchical select feature (like Oracle’s CONNECT BY). One would probably end up doing multiple queries, which sounds like a very inefficient thing to do (imagine fetching a 10-level depth tree of 1000 comments). Fortunately, there is a better idea. I would say even more: a solution with one simple select query! :) Excited? It’s very simple, actually.

The idea is to construct a select statement which will return results sorted in an order we need (say, in the tree above each node is a row in the query result set). On the other hand it is very similar to the structure of directories and files in the filesystem. So, we will use a property of files here, which is the path. When creating a new Comment instance we will set its path, and update when setting a parent Comment:

class Comment {
   String path;
   …
   Comment(long id) {
     this.id = id;
     this.path = "/" + id;
   }

   void setParent(Comment parent) {
      this.parent = parent;
      this.path = (parent == null ? "" : parent.path) + "/" + id;
   }
}

If we apply this transformation to the example above the data in the table will look like:

id  parent_id  path
————————————-
1   NULL       /1
2   1          /1/2
3   NULL       /3
4   1          /1/4
5   NULL       /5
6   5          /5/6
7   1          /1/7
8   4          /1/4/8
9   6          /5/6/9

Now we can use the path field to retrieve our comments ordered accordingly. By executing

SELECT * FROM comment ORDER BY path

we will get this:

id  parent_id  path
————————————-
1   NULL       /1
2   1          /1/2
4   1          /1/4
8   4          /1/4/8
7   1          /1/7
3   NULL       /3
5   NULL       /5
6   5          /5/6
9   6          /5/6/9

The rest is only a matter of formatting the results in order to get nicely looking comments tree output. (Note: the level can be computed as a number of slashes, ‘/’, in the path string).

Also, when using ORM like Hibernate (or JPA in general), you are not forced to use SQL and can get your job done with simple criteria-based or HQL queries.

As a free bonus, you get pagination working properly without extra tweaking as you would have to do in case of multiple queries.

P.S. One piece that is missing in the above example is that the sorting for strings based on numbers will be done in the alphabetical order, e.g. 11 will appear before 2 (which is obviously not what we want). So to fix this problem you can prefix each part of the path with a length of the numbers in characters. It will result in smth like: 12, 211 (or 1.2, 2.11)…


13
Aug 11

Server-side cache vs client-side cache of prepared statements

I’ve been asked to explain the difference between server-side and client-side caching of database prepared statements. So here is a brief overview.

First of all, query plans are not cached on the client-side (it would be a bit foolish, because it would not give any performance boost, because we would have to send the query plan over the network each time before execution, and it contains more structured information than just a SQL string).

The parsed query plan is only stored on the server side, in a common session cache, so all sessions can reuse it. This comprises the biggest part of queries optimisation by the database server (in terms of usage of both CPU time and memory).

What is cached on the client side? – It is an identifier of the parsed query (cache then looks like a map of [sql => id]). Another big difference from the server-side cache is that the client-side cache of prepared statements resides in scope of a session (connection). (I’m not sure about the rest of the JDBC drivers, but in the Oracle implementation query parameters are also cached, which would create a memory “leak” problem when pooled connections are actively used for some period of time, hence they don’t get closed, and the cache doesn’t get released.)

In general, the prepared statement execution looks like (in terms on commands sent to the database server):

1. id = PREPARE (sql) <- here on the server there will be an attempt to get the query plan from the cache.
2. EXECUTE (id, params1)
3. (optional) EXECUTE (id, params2) …

You can see that PREPARE for the same query is executed only once throughout the session (the lifetime of the connection). Considering that in most scenarios connection pool is used, it allows you to save relevant amount of time by not performing an additional PREPARE request on the network before each consecutive EXECUTE.


5
Jul 11

mod-pagespeeding up your site

A beautiful and simple way to speed up your site without applying any magic in your code is to install mod_pagespeed module for Apache web server.

Installation:

  • Create /etc/yum.repos.d/mod-pagespeed.repo yum config file for mod_pagespeed with the following content:
    [mod-pagespeed]
    name=mod-pagespeed
    baseurl=http://dl.google.com/linux/mod-pagespeed/rpm/stable/x86_64
    enabled=1
    gpgcheck=0
  • Then install mod-pagespeed:
    yum install mod-pagespeed
  • Restart httpd service to get it all working:
    service httpd restart

Later, I will update this post with the comparison of performance with and without mod_pagespeed (with lori add-on).