Re: Strategy proposal for making DB dump in LDIF format from dbscan

William Brown <wibrown@xxxxxxxxxx> · Tue, 15 Aug 2017 17:05:36 +1000



On Tue, 2017-08-15 at 16:56 +1000, William Brown wrote:
> On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > Hi everybody,
> > 
> > Following Ludwig's and Mark's suggestions on how to perform a database dump
> > in LDIF format from dbscan, I have come up with a strategy. I'm talking
> > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > 
> > I'd like to discuss this strategy and get some feedback.
> > 
> > The general idea is:
> > 
> > - We are cursing though id2entry.db printing entries in order.
> > - Parents should be printed before children.
> > - Hence if for some entry parenitd > entryid, we have to print its parent
> > first (out of order) and track that we did so.
> > - In order to do the above, we don't need to move the db cursor. We can
> > just go fetch something from a random place in the db and the cursor will
> > remain in its place, so we can continue from where we left off after we're
> > done with printing a parent.
> > - We also need to construct and print the DN of each entry using its RDN
> > and the DN of its father.
> > - Let's use something like a hash map to pair entryids with dns (only for
> > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> > "ou=People,dc=example,dc=com", etc.
> > 
> > I'll present the algorithm that I came up with in python-like pseudo-code.
> > 
> > First, the following function constructs the entry's DN and updates the
> > hash map if needed. We can know whether an entry is a parent or not, by the
> > presence of the numSubordinates attribute.
> > 
> > # assumes that parentid already exists in the map
> > function display_entry(e, map):
> >     if not e.parentid:
> >         e.dn = e.rdn
> >     else:
> >         e.dn = e.rdn + map[e.parentid]
> >     if isparent(e):
> >         map[e.entryid] = e.dn
> >     print_to_ldif_format(e)
> > 
> > Then, the main loop:
> > 
> > map = new(hashmap)
> > printed_in_advance = []
> > 
> > for e in entries:
> >     if e.entryid in printed_in_advance:
> >         continue # skip this, we have already displayed it
> > 
> >     if e.parentid < e.entryid:
> >         display_entry(e, map)
> > 
> >     else:
> >         # we need to display parents before children
> > 
> >         list_of_parents = []
> > 
> >         p = e
> >         while p.parentid > p.entryid:
> >             p = get_from_db(key = p.parentid)
> >             list_of_parents.append(p) # see note below (*)
> > 
> >         for p in reversed(list_of_parents):
> >             display_entry(p, map)
> >             printed_in_advance.append(p.entryid)
> > 
> > 
> > * here we store the whole entry in the list (aka its data) and not just
> > its id, in order to avoid fetching it from the database again
> > 
> > As a map, we can use a B+ tree implementation from libsds.
> > 
> > I would like to know how the above idea sounds to you before going ahead
> > and implementing it.
> > 
> 
> Looks like a pretty good plan to me.
> 
> What happens if you have this situation?
> 
> rdn: a
> entryid: 1
> parentid: 2
> 
> rdn: b
> entryid: 2
> parentid: 3
> 
> rdn: c
> entryid: 3
> parentid: 4
> 
> etc.
> 
> Imagine we have to continue to work up ids to get to the parent. Can
> your algo handle this case? 
> 
> It seems like the way you could approach this is to sort the id order
> you need to display them in *before* you start parsing entries. We can
> file allids in memory anyway, because it's just a set of 32bit ints, so
> even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> So I would approach this as an exercise for a comparison function to the
> set.
> 
> universal_entry_set = [....] # entryrdn / id2entry
> # The list of entries to print
> print_order = []
> 
> for e in universal_entry_set:
>     printer_order_insert_sorted(e)
> 
> So we can imagine that this would invoke a sorting function. Something
> that would work is:
> 
> compare(e1, e2):
>     # if e1 parent is 0, return -1
>     # if e2 parent is 0, return 1
>     # If e1 is parent to e2, return -1
>     # If e2 is parent to e1, return 1
>     # return compare of eid.
> 
> Then this would create a list in order of parent relationships, and you
> can just do:
> 
> for e in print_order:
> 
> Which despite jumping about in the cursor, will print in order.
> 
> So you'll need to look at entryrdn once, and then id2entry once for
> this. 
> 
> If you want to do this without entryrdn, you'll need to pass id2entry
> twice. But You'll only need 1 entry in memory at a time to achieve it I
> think. It also doesn't matter about order of ids at all here.
> 
> You could easily demo this algo in py with entry objects and implement
> __cmp__, then do "sorted(list)' on it.
> 
> For the C variant, you would give the cmp function to the btree map, and
> then as it inserts, it orders them, then you can just do
> "sds_bptree_map(print_fn)" and it will apply print to each entry in the
> order.
> 
> Hope that helps,
> 
> 
> _______________________________________________
> 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx

[william@rei 17:04] /tmp I0> python sort.py
before [uid=william, ou=People, dc=example,dc=com] 
after [dc=example,dc=com, ou=People, uid=william]

[william@rei 17:05] /tmp I0> cat sort.py 

class SlapiEntry(object):
    def __init__(self, rdn, eid, parent):
        self.rdn = rdn
        self.eid = eid
        self.parent = parent

    def __lt__(self, other):
        if self.parent == 0:
            return True
        elif other.parent == self.eid:
            return True
        else:
            return self.eid < other.eid

    def __gt__(self, other):
        return not self.__lt__(other)

    def __eq__(self, other):
        return False

    def __repr__(self):
        return self.rdn


a = SlapiEntry('dc=example,dc=com', 10, 0)
b = SlapiEntry('ou=People', 5, 10)
c = SlapiEntry('uid=william', 1, 5)

l = [c, b, a]
print("before %s " % l)
print("after %s" % sorted(l))


Have fun :) 

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx