On Tue, 2017-08-15 at 16:56 +1000, William Brown wrote: > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote: > > Hi everybody, > > > > Following Ludwig's and Mark's suggestions on how to perform a database dump > > in LDIF format from dbscan, I have come up with a strategy. I'm talking > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567 > > > > I'd like to discuss this strategy and get some feedback. > > > > The general idea is: > > > > - We are cursing though id2entry.db printing entries in order. > > - Parents should be printed before children. > > - Hence if for some entry parenitd > entryid, we have to print its parent > > first (out of order) and track that we did so. > > - In order to do the above, we don't need to move the db cursor. We can > > just go fetch something from a random place in the db and the cursor will > > remain in its place, so we can continue from where we left off after we're > > done with printing a parent. > > - We also need to construct and print the DN of each entry using its RDN > > and the DN of its father. > > - Let's use something like a hash map to pair entryids with dns (only for > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] = > > "ou=People,dc=example,dc=com", etc. > > > > I'll present the algorithm that I came up with in python-like pseudo-code. > > > > First, the following function constructs the entry's DN and updates the > > hash map if needed. We can know whether an entry is a parent or not, by the > > presence of the numSubordinates attribute. > > > > # assumes that parentid already exists in the map > > function display_entry(e, map): > > if not e.parentid: > > e.dn = e.rdn > > else: > > e.dn = e.rdn + map[e.parentid] > > if isparent(e): > > map[e.entryid] = e.dn > > print_to_ldif_format(e) > > > > Then, the main loop: > > > > map = new(hashmap) > > printed_in_advance = [] > > > > for e in entries: > > if e.entryid in printed_in_advance: > > continue # skip this, we have already displayed it > > > > if e.parentid < e.entryid: > > display_entry(e, map) > > > > else: > > # we need to display parents before children > > > > list_of_parents = [] > > > > p = e > > while p.parentid > p.entryid: > > p = get_from_db(key = p.parentid) > > list_of_parents.append(p) # see note below (*) > > > > for p in reversed(list_of_parents): > > display_entry(p, map) > > printed_in_advance.append(p.entryid) > > > > > > * here we store the whole entry in the list (aka its data) and not just > > its id, in order to avoid fetching it from the database again > > > > As a map, we can use a B+ tree implementation from libsds. > > > > I would like to know how the above idea sounds to you before going ahead > > and implementing it. > > > > Looks like a pretty good plan to me. > > What happens if you have this situation? > > rdn: a > entryid: 1 > parentid: 2 > > rdn: b > entryid: 2 > parentid: 3 > > rdn: c > entryid: 3 > parentid: 4 > > etc. > > Imagine we have to continue to work up ids to get to the parent. Can > your algo handle this case? > > It seems like the way you could approach this is to sort the id order > you need to display them in *before* you start parsing entries. We can > file allids in memory anyway, because it's just a set of 32bit ints, so > even at 50,000,000 entries, this only takes 190MB of ram to fit allids. > So I would approach this as an exercise for a comparison function to the > set. > > universal_entry_set = [....] # entryrdn / id2entry > # The list of entries to print > print_order = [] > > for e in universal_entry_set: > printer_order_insert_sorted(e) > > So we can imagine that this would invoke a sorting function. Something > that would work is: > > compare(e1, e2): > # if e1 parent is 0, return -1 > # if e2 parent is 0, return 1 > # If e1 is parent to e2, return -1 > # If e2 is parent to e1, return 1 > # return compare of eid. > > Then this would create a list in order of parent relationships, and you > can just do: > > for e in print_order: > > Which despite jumping about in the cursor, will print in order. > > So you'll need to look at entryrdn once, and then id2entry once for > this. > > If you want to do this without entryrdn, you'll need to pass id2entry > twice. But You'll only need 1 entry in memory at a time to achieve it I > think. It also doesn't matter about order of ids at all here. > > You could easily demo this algo in py with entry objects and implement > __cmp__, then do "sorted(list)' on it. > > For the C variant, you would give the cmp function to the btree map, and > then as it inserts, it orders them, then you can just do > "sds_bptree_map(print_fn)" and it will apply print to each entry in the > order. > > Hope that helps, > > > _______________________________________________ > 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx > To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx [william@rei 17:04] /tmp I0> python sort.py before [uid=william, ou=People, dc=example,dc=com] after [dc=example,dc=com, ou=People, uid=william] [william@rei 17:05] /tmp I0> cat sort.py class SlapiEntry(object): def __init__(self, rdn, eid, parent): self.rdn = rdn self.eid = eid self.parent = parent def __lt__(self, other): if self.parent == 0: return True elif other.parent == self.eid: return True else: return self.eid < other.eid def __gt__(self, other): return not self.__lt__(other) def __eq__(self, other): return False def __repr__(self): return self.rdn a = SlapiEntry('dc=example,dc=com', 10, 0) b = SlapiEntry('ou=People', 5, 10) c = SlapiEntry('uid=william', 1, 5) l = [c, b, a] print("before %s " % l) print("after %s" % sorted(l)) Have fun :) -- Sincerely, William Brown Software Engineer Red Hat, Australia/Brisbane
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx