Re: Strategy proposal for making DB dump in LDIF format from dbscan

Ludwig Krispenz <lkrispen@xxxxxxxxxx> · Tue, 22 Aug 2017 09:03:53 +0200

    On 08/22/2017 01:31 AM, William Brown
      wrote:

          I have a question / concern though. I thought that we want dbscan 2
ldif for emergency recovery scenarios when all else has gone bad and
assuming that id2entry is still readable. In the approach you
described we make the assumption that the parentid index is readable
as well. So we depend on two files instead of one for exporting the
database. Does this matter or we don't care at all?

        There are two scenarios here in my opinion.  Backup, and emergency
backup :-)  As I've previously stated: performance is important.  It
should not take forever to process a 100 million entry database.  I
think the tool should use multiple index files (id2entry + friends) if
we can generate the LDIF faster.  But, if some of those indexes are
corrupted, then we need an alternate algorithm to generate it just from
id2entry.  Also, if we are dealing with a corrupted db, then performance
is not important, recovery is.  So if we can do it fast, do it,
otherwise grind it out.

All that being said there is something we need to consider, which I
don't have an answer for, and that is when databases do get corrupted
which files typically get corrupted?  Is it indexes, or is it id2entry? 
To be honest database corruption doesn't happen very often, but the tool
should be smart enough to realize that the data could be inaccurate. 
Perhaps a parent could be missing, etc.  So the tool should be robust
enough to use multiple techniques to complete an entry, and if it can't
it should log something, or better yet create a rejects file that an
Admin can take and repair manually.

I know this is getting more complicated, but we need to keep these
things in mind.

Regards,
Mark

      With the current design of id2entry and friends, we can't automatically
detect this so easily. I think we should really just have a flag on
dbscan that says "ignore everything BUT id2entry" and recover all you
can. We should leave this to a human to make that call.

If our database had proper checksumming of content and pages, we could
detect this, but today that's not the case :( 

    well, BDB has db_verify to ensure that a db file is consistent in
    itself and can be processed, this should be good enough to decide if
    it is usable.

    But, as Mark mentioned backup, if the backup is an online backup we
    can not be sure that the id2entry alone is sane backup/restore
    relies on backup of txn logs and recovery on restore. That said,
    after a crash we can also have the situation that pages are not
    flushed from dbcache.

    Generating an ldif from id2entry can be a best effort only and might
    fail in situations wher it is most needed.

    About the general strategy, maybe we can use another approach (yes,
    always new ideas).

    In generating the ldif we have to solve the problem that when
    cursoring thru id2entry child entries can come before parent
    entries. And we do it in different places: total replication init,
    db2ldif and now in a utility.

    Wouldn't it be better to make ldif import smarter, and stack entries
    without parents until the parent is read ? This would simplify the
    export,init and be solved in one place.

      _______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx

    -- 
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, 
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander

_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx