Another way to look at this is to enumerate the recovery cases: primary starts with head and no snapdir: A Recovery sets last_backfill_started to head and sends head object where needed head (1.b case while backfills in flight -> 1.a when done) snapdir (2) B Recovery sets last_backfill_started to snapdir and would send snapdir remove(s) and same as above case for head head (1.b case while backfills in flight -> 1.a when done) snapdir (1.a) primary starts with snapdir and no head: C Recovery set last_backfill_started to head and sends remove of head head 1.a snapdir (2) D Recovery set last_backfill_started to snapdir and sends both remove of head and create of snapdir head 1.a snapdir (1.b case while backfills in flight -> 1.a when done) Cases B and D meet our criteria because they include head/snapdir <= last_backfill_started and we check head and snapdir for is_degraded_object(). Also, removes are always processed before creates even if recover_backfill() saw them in the other order (case B). That way once the head objects are created (1.a) we know that all snapdirs have been removed too. In other words these 2 cases do not allow an intervening operations to occur that confuses the head <-> snapdir state. Case C is tricky. An intervening write to head, requires update_range() determining that snapdir is gone even though had it not looked at the log it was going to try to recover (re-create) snapdir. Case A is the only one which has a problem with an intervening deletion of the head object. David On Feb 20, 2014, at 12:07 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: > The current implementation divides the hobject space into two sets: > 1) oid | oid <= last_backfill_started > 2) oid | oid > last_backfill_started > > Space 1) is further divided into two sets: > 1.a) oid | oid \notin backfills_in_flight > 1.b) oid | oid \in backfills_in_flight > > The value of this division is that we must send ops in set 1.a to the > backfill peer because we won't re-backfill those objects and they must > therefore be kept up to date. Furthermore, we *can* send the op > because the backfill peer already has all of the dependencies (this > statement is where we run into trouble). > > In set 2), we have not yet backfilled the object, so we are free to > not send the op to the peer confident that the object will be > backfilled later. > > In set 1.b), we block operations until the backfill operation is > complete. This is necessary at the very least because we are in the > process of reading the object and shouldn't be sending writes anyway. > Thus, it seems to me like we are blocking, in some sense, the minimum > possible set of ops, which is good. > > The issue is that there is a small category of ops which violate our > statement above that we can send ops in set 1.a: ops where the > corresponding snapdir object is in set 2 or set 1.b. The 1.b case we > currently handle by requiring that snapdir also be > !is_degraded_object. > > The case where the snapdir falls into set 2 should be the problem, but > now I am wondering. I think the original problem was as follows: > 1) advance last_backfill_started to head > 2) complete recovery on head > 3) accept op on head which deletes head and creates snapdir > 4) start op > 5) attempt to recover snapdir > 6) race with write and get screwed up > > Now, however, we have logic to delay backfill on ObjectContexts which > currently have write locks. It should suffice to take a write lock on > the new snapdir and use that...which we do since the ECBackend patch > series. The case where we create head and remove snapdir isn't an > issue since we'll just send the delete which will work whether snapdir > exists or not... We can also just include a delete in the snapdir > creation transaction to make it correctly handle garbage snapdirs on > backfill peers. The snapdir would then be superfluously recovered, > but that's probably ok? > > The main issue I see is that it would cause the primary's idea of the > replica's backfill_interval to be slightly incorrect (snapdir would > have been removed or created on the peer, but not reflected in the > master's current backfill_interval which might contain snapdir). We > could adjust it in make_writeable, or update_range? > > Sidenote: multiple backfill peers complicates the issue only slightly. > All backfill peers with last_backfill <= last_backfill_started are > handled uniformly as above. Any backfill_peer with last_backfill > > last_backfill_started we can model as having a private > last_backfill_started equal to last_backfill. This results in a > picture for that peer identical to the one above with an empty set > 1.b. Because 1.b is empty for these peers, is_degraded_object can > disregard them. should_send_op accounts for them with the > MAX(last_backfill, last_backfill_started) adjustment. > > Anyone have anything simpler? I'll try to put the explanation part > into the docs later. > -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html