Summary: Slow import post-processing with large number of non-leaf entries
Thomas Lackey posted a patch to create an ancestorid index. With his
patch, the large import with non-leaf entries dramatically. I reviewed
his code and approved it. I'd like to have at least one more review to
check in his patch.
[Description of problem]
Import post-processing of large databases is extremely slow when they contain a
large number of non-leaf entries. This slowness is centered in the creation of
the ancestorid index.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a large LDIF to import, something with 1 million entries or more
where many of the entries have children. In this test, approximately 1/10 of
the entries were placed directly beneath the suffix and the other 9/10 were
child entries evenly distributed beneath them. Eg:
dn: cn=MyEntry0, <suffix>
dn: cn=Child0, cn=MyEntry0, <suffix>
dn: cn=Child1, cn=MyEntry0, <suffix>
2. Import with ldif2db.
3. Watch the amount of time spent in the post-processing phase of the import
and the difference in the final import rate when compared to the last rate from
the import phase.
The post-processing time is inordinate to the import time.
[30/Oct/2008:11:22:48 -0500] - import userRoot: Processed 1059199 entries --
average rate 5725.4/sec, recent rate 5565.0/sec, hit ratio 100%
[30/Oct/2008:11:22:56 -0500] - import userRoot: Indexing complete.
[30/Oct/2008:11:31:09 -0500] - import userRoot: Import complete. Processed
1100009 entries in 686 seconds. (1603.51 entries/sec)
This import took 11m31s according to 'time', of which 7m38s was spent
post-processing. This dropped the average rate from 5700 entries/s to 1600
This difference becomes even more pronounced with larger databases. Importing
a very large database on another machine averaged 11k entries/s until
post-processing, where the final rate ended at only 283 entries/s.
Building the ancestorid index does not need to be so expensive, since the
information is available from the parentid index. The cost is associated with
general overhead in maintaining the IDLists in memory, and in particular to the
constant unions done on them to add children. When these lists may contain
millions of entries, the time spent copying the existing data when inserting
children is prohibitively expensive. This does not affect all layouts equally,
but does cause problems when large numbers of children are dispersed throughout
BDB can usually handle inserts efficiently on its own, so it is not necessary
to maintain complete IDLists in memory for all the entries and write them out
in total. Updates can be performed directly to the DB instead.
This example is on the same hardware using the same LDIF, but using code that
does not maintain all the IDLists in memory and calls through to
idl_new_store_block() to perform updates incrementally.
[30/Oct/2008:12:02:45 -0500] - import userRoot: Processed 1042117 entries --
average rate 5633.1/sec, recent rate 5752.4/sec, hit ratio 100%
[30/Oct/2008:12:02:56 -0500] - import userRoot: Indexing complete.
[30/Oct/2008:12:03:51 -0500] - import userRoot: Import complete. Processed
1100009 entries in 251 seconds. (4382.51 entries/sec)
The total time for this run was 4m17s, of which only 30s was post-processing.
Most importantly, 'dbscan -r -f ancestorid.db4' shows identical contents:
> md5sum ancestorid.db4.dbscan.prefix
Created an attachment (id=322386)
This proposed patch uses a new function,
ldbm_ancestorid_new_idl_create_index(), to create the index when
idl_get_idl_new() is true. The existing code is used otherwise.
This is the code reference in the decription.