On Thu, Feb 20, 2014 at 3:50 PM, David Zafman <david.zafman@xxxxxxxxxxx> wrote: > > Another way to look at this is to enumerate the recovery cases: > > primary starts with head and no snapdir: > > A Recovery sets last_backfill_started to head and sends head object where needed > head (1.b case while backfills in flight -> 1.a when done) > snapdir (2) > > B Recovery sets last_backfill_started to snapdir and would send snapdir remove(s) and same as above case for head > head (1.b case while backfills in flight -> 1.a when done) > snapdir (1.a) > > primary starts with snapdir and no head: > > C Recovery set last_backfill_started to head and sends remove of head > head 1.a > snapdir (2) > > D Recovery set last_backfill_started to snapdir and sends both remove of head and create of snapdir > head 1.a > snapdir (1.b case while backfills in flight -> 1.a when done) > > > Cases B and D meet our criteria because they include head/snapdir <= last_backfill_started and we check head and snapdir for is_degraded_object(). Also, removes are always processed before creates even if recover_backfill() saw them in the other order (case B). That way once the head objects are created (1.a) we know that all snapdirs have been removed too. In other words these 2 cases do not allow an intervening operations to occur that confuses the head <-> snapdir state. > > Case C is tricky. An intervening write to head, requires update_range() determining that snapdir is gone even though had it not looked at the log it was going to try to recover (re-create) snapdir. > I'm not sure what you mean here. update_range would remove snapdir from the interval during the next call to recover_backfill before making any decisions about snapdir. > Case A is the only one which has a problem with an intervening deletion of the head object. > Can you elaborate on this one? -Sam > > David > > > > On Feb 20, 2014, at 12:07 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: > >> The current implementation divides the hobject space into two sets: >> 1) oid | oid <= last_backfill_started >> 2) oid | oid > last_backfill_started >> >> Space 1) is further divided into two sets: >> 1.a) oid | oid \notin backfills_in_flight >> 1.b) oid | oid \in backfills_in_flight >> >> The value of this division is that we must send ops in set 1.a to the >> backfill peer because we won't re-backfill those objects and they must >> therefore be kept up to date. Furthermore, we *can* send the op >> because the backfill peer already has all of the dependencies (this >> statement is where we run into trouble). >> >> In set 2), we have not yet backfilled the object, so we are free to >> not send the op to the peer confident that the object will be >> backfilled later. >> >> In set 1.b), we block operations until the backfill operation is >> complete. This is necessary at the very least because we are in the >> process of reading the object and shouldn't be sending writes anyway. >> Thus, it seems to me like we are blocking, in some sense, the minimum >> possible set of ops, which is good. >> >> The issue is that there is a small category of ops which violate our >> statement above that we can send ops in set 1.a: ops where the >> corresponding snapdir object is in set 2 or set 1.b. The 1.b case we >> currently handle by requiring that snapdir also be >> !is_degraded_object. >> >> The case where the snapdir falls into set 2 should be the problem, but >> now I am wondering. I think the original problem was as follows: >> 1) advance last_backfill_started to head >> 2) complete recovery on head >> 3) accept op on head which deletes head and creates snapdir >> 4) start op >> 5) attempt to recover snapdir >> 6) race with write and get screwed up >> >> Now, however, we have logic to delay backfill on ObjectContexts which >> currently have write locks. It should suffice to take a write lock on >> the new snapdir and use that...which we do since the ECBackend patch >> series. The case where we create head and remove snapdir isn't an >> issue since we'll just send the delete which will work whether snapdir >> exists or not... We can also just include a delete in the snapdir >> creation transaction to make it correctly handle garbage snapdirs on >> backfill peers. The snapdir would then be superfluously recovered, >> but that's probably ok? >> >> The main issue I see is that it would cause the primary's idea of the >> replica's backfill_interval to be slightly incorrect (snapdir would >> have been removed or created on the peer, but not reflected in the >> master's current backfill_interval which might contain snapdir). We >> could adjust it in make_writeable, or update_range? >> >> Sidenote: multiple backfill peers complicates the issue only slightly. >> All backfill peers with last_backfill <= last_backfill_started are >> handled uniformly as above. Any backfill_peer with last_backfill > >> last_backfill_started we can model as having a private >> last_backfill_started equal to last_backfill. This results in a >> picture for that peer identical to the one above with an empty set >> 1.b. Because 1.b is empty for these peers, is_degraded_object can >> disregard them. should_send_op accounts for them with the >> MAX(last_backfill, last_backfill_started) adjustment. >> >> Anyone have anything simpler? I'll try to put the explanation part >> into the docs later. >> -Sam > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html