On Tue, Sep 29, 2015 at 12:08 AM, Balázs Kossovics <kossovics@xxxxxxxxx> wrote: > Hey! > > I'm trying to understand the peering algorithm based on [1] and [2]. There > are things that aren't really clear or I'm not entirely sure if I understood > them correctly, so I'd like to ask some clarification on the points below: > > 1, Is it right, that the primary writes the operations to the PG log > immediately upon its reception? The operation is written into the PG log as part of the same transaction that logs the op itself. The primary ships off the operations to the replicas concurrently with this happening (it doesn't care about the ordering of those bits) so while it might happen first on the primary, there's no particular guarantee of that. > > 2, Is it possible that an operation is persisted, but never acknowledged? > Imagine this situation: a write arrives to an object, the operation is > copied to and get written to the journal by the replicas, but the primary > OSD dies and never recovers before it could acknowledge to the user. Upon > the next peering, this operations will make part of the authoritative > history? Operations can be persisted without being acknowledged, yes. Persisted operations that aren't acknowledged will *usually* end up as part of the authoritative history, but it depends on which OSDs persisted it and which are involved in peering as part of the next set. > > 3, Quote from the second step of the peering algorithm: "generate a list of > past intervals since last epoch started" > If there was no peering failure, than there is exactly one past interval? Yes? I'm not quite clear on your question. > > 4, Quote from the same step: "the subset for which peering could have > completed before the acting set changed to another set of OSDs". > The other intervals are ignored, because we can be sure that no write > operations were allowed during those? I can't find these quotes and don't know which bit you're asking about. > > 5, In each moment, the Up set is either equals to, or a strict subset of the > Acting set? No. CRUSH calculates the set of OSDs responsible for a PG. That set can include OSDs which are not currently running, so the filtered set of OSDs which are both responsible for a PG and are currently up (running, not dead, etc) is the "up set". However, in some cases it's possible that Ceph has forcibly remapped the PG to a different set of OSDs. This happens a lot when rebalancing, for instance — if a PG moves from a,b,c to a,b,d it will make a,b,c the "acting set" in order to maintain the requested 3 copies while OSD d gets backfilled. > > 6, When does OSDs repeer? Only when an OSD goes from in -> out, or even if > an OSD goes down (but not yet marked automatically out)? OSDs go through the peering process whenever a member of one of their PGs changes state (up, down, in, out, whichever). This is usually a fast process if data doesn't actually have to move. > 7, For what reasons can the peering fail? If the OSD map changes before the > peering completes, then it's a failure? If the OSD map doesn't change, then > a reason for failure is not being able to contact "at least one OSD from > each of past interval‘s acting set"? Peering only "fails" if the OSDs can't find enough members of prior acting sets. OSD map changes won't cause failure, they just might require peering to re-run a lot. > 8, up_thru: is a per OSD value in the OSD map, which is updated for the > primary after successfully agreeing on the authoritative history, but before > completing the peering. What about the secondaries? up_thru is an indicator that the PGs on this OSD might have been written to. It's an optimization (albeit an important one) to keep track of it (and allow later peering processes to skip any epoch which doesn't have a high enough up_thru value) and requiring it of the secondaries wouldn't really improve anything, since the primary OSD doesn't necessarily require any individual one of them in order to go active. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com