Re: Strategy proposal for making DB dump in LDIF format from dbscan

Ilias Stamatis <stamatis.iliass@xxxxxxxxx> · Mon, 21 Aug 2017 15:03:33 +0300

2017-08-17 3:55 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>:
On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:

> 2017-08-15 9:56 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>:

>

> > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:

> > > Hi everybody,

> > >

> > > Following Ludwig's and Mark's suggestions on how to perform a database

> > dump

> > > in LDIF format from dbscan, I have come up with a strategy. I'm talking

> > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567

> > >

> > > I'd like to discuss this strategy and get some feedback.

> > >

> > > The general idea is:

> > >

> > > - We are cursing though id2entry.db printing entries in order.

> > > - Parents should be printed before children.

> > > - Hence if for some entry parenitd > entryid, we have to print its parent

> > > first (out of order) and track that we did so.

> > > - In order to do the above, we don't need to move the db cursor. We can

> > > just go fetch something from a random place in the db and the cursor will

> > > remain in its place, so we can continue from where we left off after

> > we're

> > > done with printing a parent.

> > > - We also need to construct and print the DN of each entry using its RDN

> > > and the DN of its father.

> > > - Let's use something like a hash map to pair entryids with dns (only for

> > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =

> > > "ou=People,dc=example,dc=com", etc.

> > >

> > > I'll present the algorithm that I came up with in python-like

> > pseudo-code.

> > >

> > > First, the following function constructs the entry's DN and updates the

> > > hash map if needed. We can know whether an entry is a parent or not, by

> > the

> > > presence of the numSubordinates attribute.

> > >

> > > # assumes that parentid already exists in the map

> > > function display_entry(e, map):

> > >     if not e.parentid:

> > >         e.dn = e.rdn

> > >     else:

> > >         e.dn = e.rdn + map[e.parentid]

> > >     if isparent(e):

> > >         map[e.entryid] = e.dn

> > >     print_to_ldif_format(e)

> > >

> > > Then, the main loop:

> > >

> > > map = new(hashmap)

> > > printed_in_advance = []

> > >

> > > for e in entries:

> > >     if e.entryid in printed_in_advance:

> > >         continue # skip this, we have already displayed it

> > >

> > >     if e.parentid < e.entryid:

> > >         display_entry(e, map)

> > >

> > >     else:

> > >         # we need to display parents before children

> > >

> > >         list_of_parents = []

> > >

> > >         p = e

> > >         while p.parentid > p.entryid:

> > >             p = get_from_db(key = p.parentid)

> > >             list_of_parents.append(p) # see note below (*)

> > >

> > >         for p in reversed(list_of_parents):

> > >             display_entry(p, map)

> > >             printed_in_advance.append(p.entryid)

> > >

> > >

> > > * here we store the whole entry in the list (aka its data) and not just

> > > its id, in order to avoid fetching it from the database again

> > >

> > > As a map, we can use a B+ tree implementation from libsds.

> > >

> > > I would like to know how the above idea sounds to you before going ahead

> > > and implementing it.

> > >

> >

> > Looks like a pretty good plan to me.

> >

> > What happens if you have this situation?

> >

> > rdn: a

> > entryid: 1

> > parentid: 2

> >

> > rdn: b

> > entryid: 2

> > parentid: 3

> >

> > rdn: c

> > entryid: 3

> > parentid: 4

> >

> > etc.

> >

> > Imagine we have to continue to work up ids to get to the parent. Can

> > your algo handle this case?

> >

>

> Unless I'm mistaken, yes. Provided of course that there is a root entry

> which has no parentid.

>

> This is the part that handles this:

>

>          p = e

>          while p.parentid > p.entryid:

>              p = get_from_db(key = p.parentid)

>              list_of_parents.append(p)

>

> We may jump at a few random points in the db for this and we will have to

> keep just a few entries in memory. But I think this number of entries is

> always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"

> cases, but never more than let's say 15. I might as well be completely

> wrong about this, so, please correct me if that's the case.

>

>

> > It seems like the way you could approach this is to sort the id order

> > you need to display them in *before* you start parsing entries. We can

> > file allids in memory anyway, because it's just a set of 32bit ints, so

> > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.

> > So I would approach this as an exercise for a comparison function to the

> > set.

> >

> > universal_entry_set = [....] # entryrdn / id2entry

> > # The list of entries to print

> > print_order = []

> >

> > for e in universal_entry_set:

> >     printer_order_insert_sorted(e)

> >

> > So we can imagine that this would invoke a sorting function. Something

> > that would work is:

> >

> > compare(e1, e2):

> >     # if e1 parent is 0, return -1

> >     # if e2 parent is 0, return 1

> >     # If e1 is parent to e2, return -1

> >     # If e2 is parent to e1, return 1

> >     # return compare of eid.

> >

> > Then this would create a list in order of parent relationships, and you

> > can just do:

> >

> > for e in print_order:

> >

> > Which despite jumping about in the cursor, will print in order.

> >

> > So you'll need to look at entryrdn once, and then id2entry once for

> > this.

> >

> > If you want to do this without entryrdn, you'll need to pass id2entry

> > twice. But You'll only need 1 entry in memory at a time to achieve it I

> > think. It also doesn't matter about order of ids at all here.

> >

>

> Hmm, I just didn't understand what's wrong with what I proposed previously.

> As you said with the technique you just described we either have to pass

> both entryrdn and id2entry once, or pass id2entry twice in any case. With

> what I described previously we have to pass id2entry only once (and

> probably for a few entries we will have to skip them if we come across them

> a second time).

>

> Could you explain why do you think the first one would be better / more

> efficient?

>

I think your algo doesn't handle the case where you have *multiple* outo

of order parents. Even though you have to do two passes, the way I

proposed will guaranteed correct display regardless of id number because

they are solely sorted on parent relationships.

I insist that it does. Please, allow me to use an example case to illustrate this.

I think your one will get to

> >         while p.parentid > p.entryid:

> > >             p = get_from_db(key = p.parentid)

> > >             list_of_parents.append(p) # see note below (*)

> > >

> > >         for p in reversed(list_of_parents):

> > >             display_entry(p, map)

> > >             printed_in_advance.append(p.entryid)

^ Here, what if the parent's parent is not yet printed? display_entry

would need to be recursive to handle this (and it appears to be

iterative).

Say you have the following case:

id 1 rdn: A
entry: 1

id 2 rdn: B
parentid: 1

id 3 rdn: C
parentid: 4

id 4 rdn: D
parentid: 5

id 5 rdn: E
parentid: 1

id 6 rdn: F
parentid: 1

The algorithm will first print the first 2 entries without entering the else clause of the loop because we don't have out of order entries.

Then when it comes to entry C it will enter the else clause. The while loop you quoted will create this list:

list_of_parents = [C, D, E]

Then the next for loop calls display_entry(), but after *reversing* the list. So it will call display_entry() on [E, D, C]. E can be displayed since it's parent has already been displayd. Then D can be displayed since E has already been displayed, etc.

We have also added entries D and E in the printed_in_advance list.

We now exit the else clause and we can continue with the outer for loop.

We have stopped in entry C so now we examine entry D. We see that we have already printed that (because it's in printed_in_advance list) so we skip this entry. We also skip entry E for the same reason. Then we reach entry F and we call display_entry to it since it's not out of order.

Does it make sense?

Does that help?

--

Sincerely,

William Brown

Software Engineer

Red Hat, Australia/Brisbane

_______________________________________________

389-devel mailing list -- 389-devel@lists.fedoraproject.org

To unsubscribe send an email to 389-devel-leave@lists.fedoraproject.org

2017-08-17 3:55 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>:
On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:

> 2017-08-15 9:56 GMT+03:00 William Brown <wibrown@xxxxxxxxxx>:

>

> > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:

> > > Hi everybody,

> > >

> > > Following Ludwig's and Mark's suggestions on how to perform a database

> > dump

> > > in LDIF format from dbscan, I have come up with a strategy. I'm talking

> > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567

> > >

> > > I'd like to discuss this strategy and get some feedback.

> > >

> > > The general idea is:

> > >

> > > - We are cursing though id2entry.db printing entries in order.

> > > - Parents should be printed before children.

> > > - Hence if for some entry parenitd > entryid, we have to print its parent

> > > first (out of order) and track that we did so.

> > > - In order to do the above, we don't need to move the db cursor. We can

> > > just go fetch something from a random place in the db and the cursor will

> > > remain in its place, so we can continue from where we left off after

> > we're

> > > done with printing a parent.

> > > - We also need to construct and print the DN of each entry using its RDN

> > > and the DN of its father.

> > > - Let's use something like a hash map to pair entryids with dns (only for

> > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =

> > > "ou=People,dc=example,dc=com", etc.

> > >

> > > I'll present the algorithm that I came up with in python-like

> > pseudo-code.

> > >

> > > First, the following function constructs the entry's DN and updates the

> > > hash map if needed. We can know whether an entry is a parent or not, by

> > the

> > > presence of the numSubordinates attribute.

> > >

> > > # assumes that parentid already exists in the map

> > > function display_entry(e, map):

> > >     if not e.parentid:

> > >         e.dn = e.rdn

> > >     else:

> > >         e.dn = e.rdn + map[e.parentid]

> > >     if isparent(e):

> > >         map[e.entryid] = e.dn

> > >     print_to_ldif_format(e)

> > >

> > > Then, the main loop:

> > >

> > > map = new(hashmap)

> > > printed_in_advance = []

> > >

> > > for e in entries:

> > >     if e.entryid in printed_in_advance:

> > >         continue # skip this, we have already displayed it

> > >

> > >     if e.parentid < e.entryid:

> > >         display_entry(e, map)

> > >

> > >     else:

> > >         # we need to display parents before children

> > >

> > >         list_of_parents = []

> > >

> > >         p = e

> > >         while p.parentid > p.entryid:

> > >             p = get_from_db(key = p.parentid)

> > >             list_of_parents.append(p) # see note below (*)

> > >

> > >         for p in reversed(list_of_parents):

> > >             display_entry(p, map)

> > >             printed_in_advance.append(p.entryid)

> > >

> > >

> > > * here we store the whole entry in the list (aka its data) and not just

> > > its id, in order to avoid fetching it from the database again

> > >

> > > As a map, we can use a B+ tree implementation from libsds.

> > >

> > > I would like to know how the above idea sounds to you before going ahead

> > > and implementing it.

> > >

> >

> > Looks like a pretty good plan to me.

> >

> > What happens if you have this situation?

> >

> > rdn: a

> > entryid: 1

> > parentid: 2

> >

> > rdn: b

> > entryid: 2

> > parentid: 3

> >

> > rdn: c

> > entryid: 3

> > parentid: 4

> >

> > etc.

> >

> > Imagine we have to continue to work up ids to get to the parent. Can

> > your algo handle this case?

> >

>

> Unless I'm mistaken, yes. Provided of course that there is a root entry

> which has no parentid.

>

> This is the part that handles this:

>

>          p = e

>          while p.parentid > p.entryid:

>              p = get_from_db(key = p.parentid)

>              list_of_parents.append(p)

>

> We may jump at a few random points in the db for this and we will have to

> keep just a few entries in memory. But I think this number of entries is

> always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"

> cases, but never more than let's say 15. I might as well be completely

> wrong about this, so, please correct me if that's the case.

>

>

> > It seems like the way you could approach this is to sort the id order

> > you need to display them in *before* you start parsing entries. We can

> > file allids in memory anyway, because it's just a set of 32bit ints, so

> > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.

> > So I would approach this as an exercise for a comparison function to the

> > set.

> >

> > universal_entry_set = [....] # entryrdn / id2entry

> > # The list of entries to print

> > print_order = []

> >

> > for e in universal_entry_set:

> >     printer_order_insert_sorted(e)

> >

> > So we can imagine that this would invoke a sorting function. Something

> > that would work is:

> >

> > compare(e1, e2):

> >     # if e1 parent is 0, return -1

> >     # if e2 parent is 0, return 1

> >     # If e1 is parent to e2, return -1

> >     # If e2 is parent to e1, return 1

> >     # return compare of eid.

> >

> > Then this would create a list in order of parent relationships, and you

> > can just do:

> >

> > for e in print_order:

> >

> > Which despite jumping about in the cursor, will print in order.

> >

> > So you'll need to look at entryrdn once, and then id2entry once for

> > this.

> >

> > If you want to do this without entryrdn, you'll need to pass id2entry

> > twice. But You'll only need 1 entry in memory at a time to achieve it I

> > think. It also doesn't matter about order of ids at all here.

> >

>

> Hmm, I just didn't understand what's wrong with what I proposed previously.

> As you said with the technique you just described we either have to pass

> both entryrdn and id2entry once, or pass id2entry twice in any case. With

> what I described previously we have to pass id2entry only once (and

> probably for a few entries we will have to skip them if we come across them

> a second time).

>

> Could you explain why do you think the first one would be better / more

> efficient?

>

I think your algo doesn't handle the case where you have *multiple* outo

of order parents. Even though you have to do two passes, the way I

proposed will guaranteed correct display regardless of id number because

they are solely sorted on parent relationships.

I think your one will get to

> >         while p.parentid > p.entryid:

> > >             p = get_from_db(key = p.parentid)

> > >             list_of_parents.append(p) # see note below (*)

> > >

> > >         for p in reversed(list_of_parents):

> > >             display_entry(p, map)

> > >             printed_in_advance.append(p.entryid)

^ Here, what if the parent's parent is not yet printed? display_entry

would need to be recursive to handle this (and it appears to be

iterative).

Does that help?

--

Sincerely,

William Brown

Software Engineer

Red Hat, Australia/Brisbane

_______________________________________________

389-devel mailing list -- 389-devel@lists.fedoraproject.org

To unsubscribe send an email to 389-devel-leave@lists.fedoraproject.org

_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx