[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

Tuesday, 15 August 2017

On Tue, 2017-08-15 at 16:56 +1000, William Brown wrote:
...
 On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
 > Hi everybody,
 > 
 > Following Ludwig's and Mark's suggestions on how to perform a database dump
 > in LDIF format from dbscan, I have come up with a strategy. I'm talking
 > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
 > 
 > I'd like to discuss this strategy and get some feedback.
 > 
 > The general idea is:
 > 
 > - We are cursing though id2entry.db printing entries in order.
 > - Parents should be printed before children.
 > - Hence if for some entry parenitd > entryid, we have to print its parent
 > first (out of order) and track that we did so.
 > - In order to do the above, we don't need to move the db cursor. We can
 > just go fetch something from a random place in the db and the cursor will
 > remain in its place, so we can continue from where we left off after we're
 > done with printing a parent.
 > - We also need to construct and print the DN of each entry using its RDN
 > and the DN of its father.
 > - Let's use something like a hash map to pair entryids with dns (only for
 > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
 > "ou=People,dc=example,dc=com", etc.
 > 
 > I'll present the algorithm that I came up with in python-like pseudo-code.
 > 
 > First, the following function constructs the entry's DN and updates the
 > hash map if needed. We can know whether an entry is a parent or not, by the
 > presence of the numSubordinates attribute.
 > 
 > # assumes that parentid already exists in the map
 > function display_entry(e, map):
 >     if not e.parentid:
 >         e.dn = e.rdn
 >     else:
 >         e.dn = e.rdn + map[e.parentid]
 >     if isparent(e):
 >         map[e.entryid] = e.dn
 >     print_to_ldif_format(e)
 > 
 > Then, the main loop:
 > 
 > map = new(hashmap)
 > printed_in_advance = []
 > 
 > for e in entries:
 >     if e.entryid in printed_in_advance:
 >         continue # skip this, we have already displayed it
 > 
 >     if e.parentid < e.entryid:
 >         display_entry(e, map)
 > 
 >     else:
 >         # we need to display parents before children
 > 
 >         list_of_parents = []
 > 
 >         p = e
 >         while p.parentid > p.entryid:
 >             p = get_from_db(key = p.parentid)
 >             list_of_parents.append(p) # see note below (*)
 > 
 >         for p in reversed(list_of_parents):
 >             display_entry(p, map)
 >             printed_in_advance.append(p.entryid)
 > 
 > 
 > * here we store the whole entry in the list (aka its data) and not just
 > its id, in order to avoid fetching it from the database again
 > 
 > As a map, we can use a B+ tree implementation from libsds.
 > 
 > I would like to know how the above idea sounds to you before going ahead
 > and implementing it.
 > 

 Looks like a pretty good plan to me.

 What happens if you have this situation?

 rdn: a
 entryid: 1
 parentid: 2

 rdn: b
 entryid: 2
 parentid: 3

 rdn: c
 entryid: 3
 parentid: 4

 etc.

 Imagine we have to continue to work up ids to get to the parent. Can
 your algo handle this case? 

 It seems like the way you could approach this is to sort the id order
 you need to display them in *before* you start parsing entries. We can
 file allids in memory anyway, because it's just a set of 32bit ints, so
 even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
 So I would approach this as an exercise for a comparison function to the
 set.

 universal_entry_set = [....] # entryrdn / id2entry
 # The list of entries to print
 print_order = []

 for e in universal_entry_set:
     printer_order_insert_sorted(e)

 So we can imagine that this would invoke a sorting function. Something
 that would work is:

 compare(e1, e2):
     # if e1 parent is 0, return -1
     # if e2 parent is 0, return 1
     # If e1 is parent to e2, return -1
     # If e2 is parent to e1, return 1
     # return compare of eid.

 Then this would create a list in order of parent relationships, and you
 can just do:

 for e in print_order:

 Which despite jumping about in the cursor, will print in order.

 So you'll need to look at entryrdn once, and then id2entry once for
 this. 

 If you want to do this without entryrdn, you'll need to pass id2entry
 twice. But You'll only need 1 entry in memory at a time to achieve it I
 think. It also doesn't matter about order of ids at all here.

 You could easily demo this algo in py with entry objects and implement
 __cmp__, then do "sorted(list)' on it.

 For the C variant, you would give the cmp function to the btree map, and
 then as it inserts, it orders them, then you can just do
 "sds_bptree_map(print_fn)" and it will apply print to each entry in the
 order.

 Hope that helps,

 _______________________________________________
 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
 To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org 
[william@rei 17:04] /tmp I0> python sort.py
before [uid=william, ou=People, dc=example,dc=com] 
after [dc=example,dc=com, ou=People, uid=william]

[william@rei 17:05] /tmp I0> cat sort.py 

class SlapiEntry(object):
    def __init__(self, rdn, eid, parent):
        self.rdn = rdn
        self.eid = eid
        self.parent = parent

    def __lt__(self, other):
        if self.parent == 0:
            return True
        elif other.parent == self.eid:
            return True
        else:
            return self.eid < other.eid

    def __gt__(self, other):
        return not self.__lt__(other)

    def __eq__(self, other):
        return False

    def __repr__(self):
        return self.rdn

a = SlapiEntry('dc=example,dc=com', 10, 0)
b = SlapiEntry('ou=People', 5, 10)
c = SlapiEntry('uid=william', 1, 5)

l = [c, b, a]
print("before %s " % l)
print("after %s" % sorted(l))

Have fun :) 

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan