Re: Strategy proposal for making DB dump in LDIF format from dbscan

William Brown <wibrown@xxxxxxxxxx> · Thu, 17 Aug 2017 10:55:26 +1000

On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
> 2017-08-15 9:56 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>:
> 
> > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > > Hi everybody,
> > >
> > > Following Ludwig's and Mark's suggestions on how to perform a database
> > dump
> > > in LDIF format from dbscan, I have come up with a strategy. I'm talking
> > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > >
> > > I'd like to discuss this strategy and get some feedback.
> > >
> > > The general idea is:
> > >
> > > - We are cursing though id2entry.db printing entries in order.
> > > - Parents should be printed before children.
> > > - Hence if for some entry parenitd > entryid, we have to print its parent
> > > first (out of order) and track that we did so.
> > > - In order to do the above, we don't need to move the db cursor. We can
> > > just go fetch something from a random place in the db and the cursor will
> > > remain in its place, so we can continue from where we left off after
> > we're
> > > done with printing a parent.
> > > - We also need to construct and print the DN of each entry using its RDN
> > > and the DN of its father.
> > > - Let's use something like a hash map to pair entryids with dns (only for
> > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> > > "ou=People,dc=example,dc=com", etc.
> > >
> > > I'll present the algorithm that I came up with in python-like
> > pseudo-code.
> > >
> > > First, the following function constructs the entry's DN and updates the
> > > hash map if needed. We can know whether an entry is a parent or not, by
> > the
> > > presence of the numSubordinates attribute.
> > >
> > > # assumes that parentid already exists in the map
> > > function display_entry(e, map):
> > >     if not e.parentid:
> > >         e.dn = e.rdn
> > >     else:
> > >         e.dn = e.rdn + map[e.parentid]
> > >     if isparent(e):
> > >         map[e.entryid] = e.dn
> > >     print_to_ldif_format(e)
> > >
> > > Then, the main loop:
> > >
> > > map = new(hashmap)
> > > printed_in_advance = []
> > >
> > > for e in entries:
> > >     if e.entryid in printed_in_advance:
> > >         continue # skip this, we have already displayed it
> > >
> > >     if e.parentid < e.entryid:
> > >         display_entry(e, map)
> > >
> > >     else:
> > >         # we need to display parents before children
> > >
> > >         list_of_parents = []
> > >
> > >         p = e
> > >         while p.parentid > p.entryid:
> > >             p = get_from_db(key = p.parentid)
> > >             list_of_parents.append(p) # see note below (*)
> > >
> > >         for p in reversed(list_of_parents):
> > >             display_entry(p, map)
> > >             printed_in_advance.append(p.entryid)
> > >
> > >
> > > * here we store the whole entry in the list (aka its data) and not just
> > > its id, in order to avoid fetching it from the database again
> > >
> > > As a map, we can use a B+ tree implementation from libsds.
> > >
> > > I would like to know how the above idea sounds to you before going ahead
> > > and implementing it.
> > >
> >
> > Looks like a pretty good plan to me.
> >
> > What happens if you have this situation?
> >
> > rdn: a
> > entryid: 1
> > parentid: 2
> >
> > rdn: b
> > entryid: 2
> > parentid: 3
> >
> > rdn: c
> > entryid: 3
> > parentid: 4
> >
> > etc.
> >
> > Imagine we have to continue to work up ids to get to the parent. Can
> > your algo handle this case?
> >
> 
> Unless I'm mistaken, yes. Provided of course that there is a root entry
> which has no parentid.
> 
> This is the part that handles this:
> 
>          p = e
>          while p.parentid > p.entryid:
>              p = get_from_db(key = p.parentid)
>              list_of_parents.append(p)
> 
> We may jump at a few random points in the db for this and we will have to
> keep just a few entries in memory. But I think this number of entries is
> always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"
> cases, but never more than let's say 15. I might as well be completely
> wrong about this, so, please correct me if that's the case.
> 
> 
> > It seems like the way you could approach this is to sort the id order
> > you need to display them in *before* you start parsing entries. We can
> > file allids in memory anyway, because it's just a set of 32bit ints, so
> > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> > So I would approach this as an exercise for a comparison function to the
> > set.
> >
> > universal_entry_set = [....] # entryrdn / id2entry
> > # The list of entries to print
> > print_order = []
> >
> > for e in universal_entry_set:
> >     printer_order_insert_sorted(e)
> >
> > So we can imagine that this would invoke a sorting function. Something
> > that would work is:
> >
> > compare(e1, e2):
> >     # if e1 parent is 0, return -1
> >     # if e2 parent is 0, return 1
> >     # If e1 is parent to e2, return -1
> >     # If e2 is parent to e1, return 1
> >     # return compare of eid.
> >
> > Then this would create a list in order of parent relationships, and you
> > can just do:
> >
> > for e in print_order:
> >
> > Which despite jumping about in the cursor, will print in order.
> >
> > So you'll need to look at entryrdn once, and then id2entry once for
> > this.
> >
> > If you want to do this without entryrdn, you'll need to pass id2entry
> > twice. But You'll only need 1 entry in memory at a time to achieve it I
> > think. It also doesn't matter about order of ids at all here.
> >
> 
> Hmm, I just didn't understand what's wrong with what I proposed previously.
> As you said with the technique you just described we either have to pass
> both entryrdn and id2entry once, or pass id2entry twice in any case. With
> what I described previously we have to pass id2entry only once (and
> probably for a few entries we will have to skip them if we come across them
> a second time).
> 
> Could you explain why do you think the first one would be better / more
> efficient?
> 

I think your algo doesn't handle the case where you have *multiple* outo
of order parents. Even though you have to do two passes, the way I
proposed will guaranteed correct display regardless of id number because
they are solely sorted on parent relationships. 

I think your one will get to 

> >         while p.parentid > p.entryid:
> > >             p = get_from_db(key = p.parentid)
> > >             list_of_parents.append(p) # see note below (*)
> > >
> > >         for p in reversed(list_of_parents):
> > >             display_entry(p, map)
> > >             printed_in_advance.append(p.entryid)

^ Here, what if the parent's parent is not yet printed? display_entry
would need to be recursive to handle this (and it appears to be
iterative).

Does that help? 

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx