Re: Peering algorithm questions

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 8 Oct 2015 13:23:07 -0700

On Tue, Sep 29, 2015 at 12:08 AM, Balázs Kossovics <kossovics@xxxxxxxxx> wrote:
> Hey!
>
> I'm trying to understand the peering algorithm based on [1] and [2]. There
> are things that aren't really clear or I'm not entirely sure if I understood
> them correctly, so I'd like to ask some clarification on the points below:
>
> 1, Is it right, that the primary writes the operations to the PG log
> immediately upon its reception?

The operation is written into the PG log as part of the same
transaction that logs the op itself. The primary ships off the
operations to the replicas concurrently with this happening (it
doesn't care about the ordering of those bits) so while it might
happen first on the primary, there's no particular guarantee of that.

>
> 2, Is it possible that an operation is persisted, but never acknowledged?
> Imagine this situation: a write arrives to an object, the operation is
> copied to and get written to the journal by the replicas, but the primary
> OSD dies and never recovers before it could acknowledge to the user. Upon
> the next peering, this operations will make part of the authoritative
> history?

Operations can be persisted without being acknowledged, yes. Persisted
operations that aren't acknowledged will *usually* end up as part of
the authoritative history, but it depends on which OSDs persisted it
and which are involved in peering as part of the next set.

>
> 3, Quote from the second step of the peering algorithm: "generate a list of
> past intervals since last epoch started"
> If there was no peering failure, than there is exactly one past interval?

Yes? I'm not quite clear on your question.

>
> 4, Quote from the same step: "the subset for which peering could have
> completed before the acting set changed to another set of OSDs".
> The other intervals are ignored, because we can be sure that no write
> operations were allowed during those?

I can't find these quotes and don't know which bit you're asking about.

>
> 5, In each moment, the Up set is either equals to, or a strict subset of the
> Acting set?

No. CRUSH calculates the set of OSDs responsible for a PG. That set
can include OSDs which are not currently running, so the filtered set
of OSDs which are both responsible for a PG and are currently up
(running, not dead, etc) is the "up set".
However, in some cases it's possible that Ceph has forcibly remapped
the PG to a different set of OSDs. This happens a lot when
rebalancing, for instance — if a PG moves from a,b,c to a,b,d it will
make a,b,c the "acting set" in order to maintain the requested 3
copies while OSD d gets backfilled.

>
> 6, When does OSDs repeer? Only when an OSD goes from in -> out, or even if
> an OSD goes down (but not yet marked automatically out)?

OSDs go through the peering process whenever a member of one of their
PGs changes state (up, down, in, out, whichever). This is usually a
fast process if data doesn't actually have to move.

> 7, For what reasons can the peering fail? If the OSD map changes before the
> peering completes, then it's a failure? If the OSD map doesn't change, then
> a reason for failure is not being able to contact "at least one OSD from
> each of past interval‘s acting set"?

Peering only "fails" if the OSDs can't find enough members of prior
acting sets. OSD map changes won't cause failure, they just might
require peering to re-run a lot.

> 8, up_thru: is a per OSD value in the OSD map, which is updated for the
> primary after successfully agreeing on the authoritative history, but before
> completing the peering. What about the secondaries?

up_thru is an indicator that the PGs on this OSD might have been
written to. It's an optimization (albeit an important one) to keep
track of it (and allow later peering processes to skip any epoch which
doesn't have a high enough up_thru value) and requiring it of the
secondaries wouldn't really improve anything, since the primary OSD
doesn't necessarily require any individual one of them in order to go
active.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com