On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote: > 2017-08-15 9:56 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>: > > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote: > > > Hi everybody, > > > > > > Following Ludwig's and Mark's suggestions on how to perform a database > > dump > > > in LDIF format from dbscan, I have come up with a strategy. I'm talking > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567 > > > > > > I'd like to discuss this strategy and get some feedback. > > > > > > The general idea is: > > > > > > - We are cursing though id2entry.db printing entries in order. > > > - Parents should be printed before children. > > > - Hence if for some entry parenitd > entryid, we have to print its parent > > > first (out of order) and track that we did so. > > > - In order to do the above, we don't need to move the db cursor. We can > > > just go fetch something from a random place in the db and the cursor will > > > remain in its place, so we can continue from where we left off after > > we're > > > done with printing a parent. > > > - We also need to construct and print the DN of each entry using its RDN > > > and the DN of its father. > > > - Let's use something like a hash map to pair entryids with dns (only for > > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] = > > > "ou=People,dc=example,dc=com", etc. > > > > > > I'll present the algorithm that I came up with in python-like > > pseudo-code. > > > > > > First, the following function constructs the entry's DN and updates the > > > hash map if needed. We can know whether an entry is a parent or not, by > > the > > > presence of the numSubordinates attribute. > > > > > > # assumes that parentid already exists in the map > > > function display_entry(e, map): > > > if not e.parentid: > > > e.dn = e.rdn > > > else: > > > e.dn = e.rdn + map[e.parentid] > > > if isparent(e): > > > map[e.entryid] = e.dn > > > print_to_ldif_format(e) > > > > > > Then, the main loop: > > > > > > map = new(hashmap) > > > printed_in_advance = [] > > > > > > for e in entries: > > > if e.entryid in printed_in_advance: > > > continue # skip this, we have already displayed it > > > > > > if e.parentid < e.entryid: > > > display_entry(e, map) > > > > > > else: > > > # we need to display parents before children > > > > > > list_of_parents = [] > > > > > > p = e > > > while p.parentid > p.entryid: > > > p = get_from_db(key = p.parentid) > > > list_of_parents.append(p) # see note below (*) > > > > > > for p in reversed(list_of_parents): > > > display_entry(p, map) > > > printed_in_advance.append(p.entryid) > > > > > > > > > * here we store the whole entry in the list (aka its data) and not just > > > its id, in order to avoid fetching it from the database again > > > > > > As a map, we can use a B+ tree implementation from libsds. > > > > > > I would like to know how the above idea sounds to you before going ahead > > > and implementing it. > > > > > > > Looks like a pretty good plan to me. > > > > What happens if you have this situation? > > > > rdn: a > > entryid: 1 > > parentid: 2 > > > > rdn: b > > entryid: 2 > > parentid: 3 > > > > rdn: c > > entryid: 3 > > parentid: 4 > > > > etc. > > > > Imagine we have to continue to work up ids to get to the parent. Can > > your algo handle this case? > > > > Unless I'm mistaken, yes. Provided of course that there is a root entry > which has no parentid. > > This is the part that handles this: > > p = e > while p.parentid > p.entryid: > p = get_from_db(key = p.parentid) > list_of_parents.append(p) > > We may jump at a few random points in the db for this and we will have to > keep just a few entries in memory. But I think this number of entries is > always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme" > cases, but never more than let's say 15. I might as well be completely > wrong about this, so, please correct me if that's the case. > > > > It seems like the way you could approach this is to sort the id order > > you need to display them in *before* you start parsing entries. We can > > file allids in memory anyway, because it's just a set of 32bit ints, so > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids. > > So I would approach this as an exercise for a comparison function to the > > set. > > > > universal_entry_set = [....] # entryrdn / id2entry > > # The list of entries to print > > print_order = [] > > > > for e in universal_entry_set: > > printer_order_insert_sorted(e) > > > > So we can imagine that this would invoke a sorting function. Something > > that would work is: > > > > compare(e1, e2): > > # if e1 parent is 0, return -1 > > # if e2 parent is 0, return 1 > > # If e1 is parent to e2, return -1 > > # If e2 is parent to e1, return 1 > > # return compare of eid. > > > > Then this would create a list in order of parent relationships, and you > > can just do: > > > > for e in print_order: > > > > Which despite jumping about in the cursor, will print in order. > > > > So you'll need to look at entryrdn once, and then id2entry once for > > this. > > > > If you want to do this without entryrdn, you'll need to pass id2entry > > twice. But You'll only need 1 entry in memory at a time to achieve it I > > think. It also doesn't matter about order of ids at all here. > > > > Hmm, I just didn't understand what's wrong with what I proposed previously. > As you said with the technique you just described we either have to pass > both entryrdn and id2entry once, or pass id2entry twice in any case. With > what I described previously we have to pass id2entry only once (and > probably for a few entries we will have to skip them if we come across them > a second time). > > Could you explain why do you think the first one would be better / more > efficient? > I think your algo doesn't handle the case where you have *multiple* outo of order parents. Even though you have to do two passes, the way I proposed will guaranteed correct display regardless of id number because they are solely sorted on parent relationships. I think your one will get to > > while p.parentid > p.entryid: > > > p = get_from_db(key = p.parentid) > > > list_of_parents.append(p) # see note below (*) > > > > > > for p in reversed(list_of_parents): > > > display_entry(p, map) > > > printed_in_advance.append(p.entryid) ^ Here, what if the parent's parent is not yet printed? display_entry would need to be recursive to handle this (and it appears to be iterative). Does that help? -- Sincerely, William Brown Software Engineer Red Hat, Australia/Brisbane
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx