Re: Regarding Primary affinity configuration

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 9 Oct 2014 16:51:18 -0700

On Thu, Oct 9, 2014 at 4:24 PM, Johnu George (johnugeo)
<johnugeo@xxxxxxxxx> wrote:
> Hi Greg,
>          Thanks for your extremely informative post. My related questions
> are posted inline
>
> On 10/9/14, 2:21 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:
>
>>On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo)
>><johnugeo@xxxxxxxxx> wrote:
>>> Hi All,
>>>           I have few questions regarding the Primary affinity.  In the
>>> original blueprint
>>>
>>>(https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role
>>>_affinity
>>> ), one example has been given.
>>>
>>> For PG x, CRUSH returns [a, b, c]
>>> If a has primary_affinity of .5, b and c have 1 , with 50% probability,
>>>we
>>> will choose b or c instead of a. (25% for b, 25% for c)
>>>
>>> A) I was browsing through the code, but I could not find this logic of
>>> splitting the rest of configured primary affinity value between other
>>>osds.
>>> How is this handled?
>>>
>>>     if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
>>>     (crush_hash32_2(CRUSH_HASH_RJENKINS1,
>>>             seed, o) >> 16) >= a) {
>>>       // we chose not to use this primary.  note it anyway as a
>>>       // fallback in case we don't pick anyone else, but keep looking.
>>>       if (pos < 0)
>>>     pos = i;
>>>     } else {
>>>       pos = i;
>>>       break;
>>>     }
>>>   }
>>
>>It's a fallback mechanism ‹ if the chosen primary for a PG has primary
>>affinity less than the default (max), we (probabilistically) look for
>>a different OSD to be the primary. We decide whether to offload by
>>running a hash and discarding the OSD if the output value is greater
>>than the OSDs affinity, and then we go through the list and run that
>>calculation in order (obviously if the affinity is 1, then it passes
>>without needing to run the hash).
>>If no OSD in the list has a high enough hash value, we take the
>>originally-chosen primary.
>  As in example for [0.5,1,1], I got your point that with 50% probability,
> first osd will be chosen. But, how do we ensure that second and third osd
> will be having remaining 25% and 25% respectively?. I could see only
> individual primary affinity values but not a sum value anywhere to ensure
> that.

Well, for any given PG with that pattern, the second OSD in the list
is going to be chosen. But *which* osd is listed second is random, so
if you only have 3 OSDs 0,1,2 (with weights .5, 1, 1, respectively),
then the PGs in total will work in a 1:2:2 ratio because OSDs 1 and 2
will between themselves be first in half of the PG lists.

>
>>
>>> B) Since, primary affinity value is configured independently, there can
>>>be a
>>> situation with [0.1,0.1,0.1]  with total value that don¹t add to 1.
>>>How is
>>> this taken care of?
>>
>>These primary affinity values are just compared against the hash
>>output I mentioned, so the sum doesn't matter. In general we simply
>>expect that OSDs which don't have the max weight value will be chosen
>>as primary in proportion to their share of the total weight of their
>>PG membership (ie, if they have a weight of .5 and everybody else has
>>weight 1, they will be primary in half the normal number of PGs. If
>>everybody has a weight of .5, they will be primary in the normal
>>proportions. Etc).
>
> I got your idea but I couldn¹t figure out that from the code. You said
> that max weight value will be chosen as primary in proportion to their
> share of the total weight of their
> PG membership. But, from what I understood from code, if it is
> [0.1,0.1,0.1], first osd will be chosen always. (Probabilistically for 10%
> reads, it will choose first osd. However,first osd will still be chosen
> for rest of the reads as part of fallback mechanism which is the
> originally chosen primary.) Am I wrong?

If each OSD has affinity of 0.1, then the hash is run until its output
is <0.1 for one of the OSDs in the list. If *none* of the OSDs in the
list hashes out a number smaller than that, then the first one in the
list (which would be the primary by default!) will be selected.

>
>>
>>>
>>> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is
>>>osd.0
>>> always returned?
>>
>>If the first OSD in the PG list has primary affinity of 1 then it is
>>always the primary for that OSD, yes. That's not osd.0, though; just
>>the first OSD in the PG list. ;)
>
> Sorry. I meant the first OSD, but accidentally wrote as osd.0 . As you
> said, if first osd is always selected in the PG list for this scenario,
> doesn¹t it violate our assumption to have probabilistically  25%, 50%, 25%
> reads for first ,second and third osd respectively?

Err, your numbers don't match the code here — we have two OSDs in that
list with affinity 1 and one with affinity 0.5. That would be a 2:1:2
ratio, or 40%, 20%, 40%. In this case the first OSD in the list is
selected because it's got the max affinity. And the ratios don't
actually work out like that if some of your OSDs have the max affinity
and others don't (because a max affinity OSD will happily take
whatever you throw at it) — these are really intended only for
reducing overloaded OSDs, not to be used as another weight system (we
already have normal CRUSH weights for that).

I think maybe part of the problem is that you're trying to extrapolate
from the rules on a single list to the behavior over the whole set of
PGs and it isn't working out for you. Eg, say osd.0_1 means osd.0 with
primary affinity 1. osd.1_0.5 means osd.1 with primary affinity 0.5.
So if we have three OSDs in a cluster: osd.0_1, osd.1_0.5, osd.2_1,
then there are a bunch of possible configurations for the PG selection
code to spit out:
A: osd.0_1, osd.1_0.5, osd.2_1
B: osd.0_1, osd.2_1, osd.1_0.5
C: osd.1_0.5, osd.0_1, osd.2_1
D: osd.1_0.5, osd.2_1, osd.0_1
E: osd.2_1, osd.0_1, osd.1_0.5
F: osd.2_1, osd.1_0.5, osd.0_1

Each of these orders is going to appear roughly 1/6 of the time; let's
say there are 600 PGs because it makes the math easy. Then we have
each configuration A-F 100 times.
Configurations A,B,E,F are simple enough; the first OSD has affinity 1
and so it's selected. So from those configurations the OSDs are
primary like so:
osd.0: 200
osd.1: 0
osd.2: 200
200 PGs remaining
For PGs with configuration C or D, half of each will stay on osd.1.
That's 100/2*2, or 100. Then the second entry in the list gets the
rest for that configuration, which is 50 each.
osd.0: 250
osd.1: 100
osd.2: 250

So we have a ratio of 5:2:5. This is expected — a primary affinity of
0.5 means that the OSD should offload half of its primary
responsibilities to other OSDs, which increases their effective
weight. Which we've done — each OSD would have had 200 PGs, but now
osd.1 has only 100 PGs and 50 have been offloaded to each of the other
two OSDs in the system.

>>
>>> D) After calculating primary based on the affinity values, I see a
>>>shift of
>>> osds so that primary comes to the front. Why is this needed?. I thought,
>>> primary affinity value affects only reads and hence, osd ordering need
>>>not
>>> be changed.
>>
>>Primary affinity impacts which OSD is chosen to be primary; the
>>primary is the ordering point for *all* access to the PG. That
>>includes writes as well as reads, plus coordination of the cluster on
>>map changes. We move the primary to the front of the list...well, I
>>think it's just because we were lazy and there are a bunch of places
>>that assume the first OSD in a replicated pool is the primary.
>
> Does that mean that osd set ordering keeps on changing(in real time) for
> various object reads in a pg if primary affinity is configured?  Whenever
> osd set is returned from pg_to_up_acting_osds, can we always say that the
> first osd is the current primary for read and writes? .  Is it the same
> for osd set returned by ceph pg dump? However, I am surprised that the
> ordering remains same when I dump values at different times.

No! These hashes that we're running are deterministic and change only
when the state of the cluster does! They will be static unless you
have flapping OSDs or something.

What are you trying to use the primary affinities for?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com