Re: 6685 backfill head/snapdir issue brain dump

David Zafman <david.zafman@xxxxxxxxxxx> · Thu, 20 Feb 2014 15:50:12 -0800

Another way to look at this is to enumerate the recovery cases:

primary starts with head and no snapdir:

A	Recovery sets last_backfill_started to head and sends head object where needed
              head (1.b case while backfills in flight -> 1.a when done)
	      snapdir (2)

B	Recovery sets last_backfill_started to snapdir and would send snapdir remove(s) and same as above case for head
	       head (1.b case while backfills in flight -> 1.a when done)
	       snapdir (1.a)

primary starts with snapdir and no head:

C	Recovery set last_backfill_started to head and sends remove of head
	       head 1.a
	       snapdir (2)

D	Recovery set last_backfill_started to snapdir and sends both remove of head and create of snapdir
	       head 1.a
	       snapdir (1.b case while backfills in flight -> 1.a when done)

Cases B and D meet our criteria because they include head/snapdir <= last_backfill_started and we check head and snapdir for is_degraded_object().  Also, removes are always processed before creates even if recover_backfill() saw them in the other order (case B).  That way once the head objects are created (1.a) we know that all snapdirs have been removed too.  In other words these 2 cases do not allow an intervening operations to occur that confuses the head <-> snapdir state.

Case C is tricky.  An intervening write to head, requires update_range() determining that snapdir is gone even though had it not looked at the log it was going to try to recover (re-create) snapdir.

Case A is the only one which has a problem with an intervening deletion of the head object.

David

On Feb 20, 2014, at 12:07 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:

> The current implementation divides the hobject space into two sets:
> 1) oid | oid <= last_backfill_started
> 2) oid | oid > last_backfill_started
> 
> Space 1) is further divided into two sets:
> 1.a) oid | oid \notin backfills_in_flight
> 1.b) oid | oid \in backfills_in_flight
> 
> The value of this division is that we must send ops in set 1.a to the
> backfill peer because we won't re-backfill those objects and they must
> therefore be kept up to date.  Furthermore, we *can* send the op
> because the backfill peer already has all of the dependencies (this
> statement is where we run into trouble).
> 
> In set 2), we have not yet backfilled the object, so we are free to
> not send the op to the peer confident that the object will be
> backfilled later.
> 
> In set 1.b), we block operations until the backfill operation is
> complete.  This is necessary at the very least because we are in the
> process of reading the object and shouldn't be sending writes anyway.
> Thus, it seems to me like we are blocking, in some sense, the minimum
> possible set of ops, which is good.
> 
> The issue is that there is a small category of ops which violate our
> statement above that we can send ops in set 1.a: ops where the
> corresponding snapdir object is in set  2 or set 1.b.  The 1.b case we
> currently handle by requiring that snapdir also be
> !is_degraded_object.
> 
> The case where the snapdir falls into set 2 should be the problem, but
> now I am wondering.  I think the original problem was as follows:
> 1) advance last_backfill_started to head
> 2) complete recovery on head
> 3) accept op on head which deletes head and creates snapdir
> 4) start op
> 5) attempt to recover snapdir
> 6) race with write and get screwed up
> 
> Now, however, we have logic to delay backfill on ObjectContexts which
> currently have write locks.  It should suffice to take a write lock on
> the new snapdir and use that...which we do since the ECBackend patch
> series.  The case where we create head and remove snapdir isn't an
> issue since we'll just send the delete which will work whether snapdir
> exists or not...  We can also just include a delete in the snapdir
> creation transaction to make it correctly handle garbage snapdirs on
> backfill peers.  The snapdir would then be superfluously recovered,
> but that's probably ok?
> 
> The main issue I see is that it would cause the primary's idea of the
> replica's backfill_interval to be slightly incorrect (snapdir would
> have been removed or created on the peer, but not reflected in the
> master's current backfill_interval which might contain snapdir).  We
> could adjust it in make_writeable, or update_range?
> 
> Sidenote: multiple backfill peers complicates the issue only slightly.
> All backfill peers with last_backfill <= last_backfill_started are
> handled uniformly as above.  Any backfill_peer with last_backfill >
> last_backfill_started we can model as having a private
> last_backfill_started equal to last_backfill.  This results in a
> picture for that peer identical to the one above with an empty set
> 1.b.  Because 1.b is empty for these peers, is_degraded_object can
> disregard them.  should_send_op accounts for them with the
> MAX(last_backfill, last_backfill_started) adjustment.
> 
> Anyone have anything simpler?  I'll try to put the explanation part
> into the docs later.
> -Sam

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html