Re: Regarding Primary affinity configuration

"Johnu George (johnugeo)" <johnugeo@xxxxxxxxx> · Fri, 10 Oct 2014 17:14:55 +0000

Thanks for detailed post, Greg. I was trying to configure primary affinity
in my cluster but I didn’t see any expected results. As you said, I was
just looking into single pg and got wrong. I also had primary affinity
value configured for multiple osds in a pg, which makes the calculation
more complex. As in your eg: if osd0, osd1,osd2 has primary affinity value
of [1,0.5,0.1] and there are 600 pgs, the final distribution comes in
440:140:20 or 22:7:1 which is slighly skewed from expected.

Johnu

On 10/9/14, 4:51 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:

>On Thu, Oct 9, 2014 at 4:24 PM, Johnu George (johnugeo)
><johnugeo@xxxxxxxxx> wrote:
>> Hi Greg,
>>          Thanks for your extremely informative post. My related
>>questions
>> are posted inline
>>
>> On 10/9/14, 2:21 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:
>>
>>>On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo)
>>><johnugeo@xxxxxxxxx> wrote:
>>>> Hi All,
>>>>           I have few questions regarding the Primary affinity.  In the
>>>> original blueprint
>>>>
>>>>(https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_ro
>>>>le
>>>>_affinity
>>>> ), one example has been given.
>>>>
>>>> For PG x, CRUSH returns [a, b, c]
>>>> If a has primary_affinity of .5, b and c have 1 , with 50%
>>>>probability,
>>>>we
>>>> will choose b or c instead of a. (25% for b, 25% for c)
>>>>
>>>> A) I was browsing through the code, but I could not find this logic of
>>>> splitting the rest of configured primary affinity value between other
>>>>osds.
>>>> How is this handled?
>>>>
>>>>     if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
>>>>     (crush_hash32_2(CRUSH_HASH_RJENKINS1,
>>>>             seed, o) >> 16) >= a) {
>>>>       // we chose not to use this primary.  note it anyway as a
>>>>       // fallback in case we don't pick anyone else, but keep looking.
>>>>       if (pos < 0)
>>>>     pos = i;
>>>>     } else {
>>>>       pos = i;
>>>>       break;
>>>>     }
>>>>   }
>>>
>>>It's a fallback mechanism ‹ if the chosen primary for a PG has primary
>>>affinity less than the default (max), we (probabilistically) look for
>>>a different OSD to be the primary. We decide whether to offload by
>>>running a hash and discarding the OSD if the output value is greater
>>>than the OSDs affinity, and then we go through the list and run that
>>>calculation in order (obviously if the affinity is 1, then it passes
>>>without needing to run the hash).
>>>If no OSD in the list has a high enough hash value, we take the
>>>originally-chosen primary.
>>  As in example for [0.5,1,1], I got your point that with 50%
>>probability,
>> first osd will be chosen. But, how do we ensure that second and third
>>osd
>> will be having remaining 25% and 25% respectively?. I could see only
>> individual primary affinity values but not a sum value anywhere to
>>ensure
>> that.
>
>Well, for any given PG with that pattern, the second OSD in the list
>is going to be chosen. But *which* osd is listed second is random, so
>if you only have 3 OSDs 0,1,2 (with weights .5, 1, 1, respectively),
>then the PGs in total will work in a 1:2:2 ratio because OSDs 1 and 2
>will between themselves be first in half of the PG lists.
>
>>
>>>
>>>> B) Since, primary affinity value is configured independently, there
>>>>can
>>>>be a
>>>> situation with [0.1,0.1,0.1]  with total value that don¹t add to 1.
>>>>How is
>>>> this taken care of?
>>>
>>>These primary affinity values are just compared against the hash
>>>output I mentioned, so the sum doesn't matter. In general we simply
>>>expect that OSDs which don't have the max weight value will be chosen
>>>as primary in proportion to their share of the total weight of their
>>>PG membership (ie, if they have a weight of .5 and everybody else has
>>>weight 1, they will be primary in half the normal number of PGs. If
>>>everybody has a weight of .5, they will be primary in the normal
>>>proportions. Etc).
>>
>> I got your idea but I couldn¹t figure out that from the code. You said
>> that max weight value will be chosen as primary in proportion to their
>> share of the total weight of their
>> PG membership. But, from what I understood from code, if it is
>> [0.1,0.1,0.1], first osd will be chosen always. (Probabilistically for
>>10%
>> reads, it will choose first osd. However,first osd will still be chosen
>> for rest of the reads as part of fallback mechanism which is the
>> originally chosen primary.) Am I wrong?
>
>If each OSD has affinity of 0.1, then the hash is run until its output
>is <0.1 for one of the OSDs in the list. If *none* of the OSDs in the
>list hashes out a number smaller than that, then the first one in the
>list (which would be the primary by default!) will be selected.
>
>>
>>>
>>>>
>>>> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is
>>>>osd.0
>>>> always returned?
>>>
>>>If the first OSD in the PG list has primary affinity of 1 then it is
>>>always the primary for that OSD, yes. That's not osd.0, though; just
>>>the first OSD in the PG list. ;)
>>
>> Sorry. I meant the first OSD, but accidentally wrote as osd.0 . As you
>> said, if first osd is always selected in the PG list for this scenario,
>> doesn¹t it violate our assumption to have probabilistically  25%, 50%,
>>25%
>> reads for first ,second and third osd respectively?
>
>Err, your numbers don't match the code here — we have two OSDs in that
>list with affinity 1 and one with affinity 0.5. That would be a 2:1:2
>ratio, or 40%, 20%, 40%. In this case the first OSD in the list is
>selected because it's got the max affinity. And the ratios don't
>actually work out like that if some of your OSDs have the max affinity
>and others don't (because a max affinity OSD will happily take
>whatever you throw at it) — these are really intended only for
>reducing overloaded OSDs, not to be used as another weight system (we
>already have normal CRUSH weights for that).
>
>I think maybe part of the problem is that you're trying to extrapolate
>from the rules on a single list to the behavior over the whole set of
>PGs and it isn't working out for you. Eg, say osd.0_1 means osd.0 with
>primary affinity 1. osd.1_0.5 means osd.1 with primary affinity 0.5.
>So if we have three OSDs in a cluster: osd.0_1, osd.1_0.5, osd.2_1,
>then there are a bunch of possible configurations for the PG selection
>code to spit out:
>A: osd.0_1, osd.1_0.5, osd.2_1
>B: osd.0_1, osd.2_1, osd.1_0.5
>C: osd.1_0.5, osd.0_1, osd.2_1
>D: osd.1_0.5, osd.2_1, osd.0_1
>E: osd.2_1, osd.0_1, osd.1_0.5
>F: osd.2_1, osd.1_0.5, osd.0_1
>
>Each of these orders is going to appear roughly 1/6 of the time; let's
>say there are 600 PGs because it makes the math easy. Then we have
>each configuration A-F 100 times.
>Configurations A,B,E,F are simple enough; the first OSD has affinity 1
>and so it's selected. So from those configurations the OSDs are
>primary like so:
>osd.0: 200
>osd.1: 0
>osd.2: 200
>200 PGs remaining
>For PGs with configuration C or D, half of each will stay on osd.1.
>That's 100/2*2, or 100. Then the second entry in the list gets the
>rest for that configuration, which is 50 each.
>osd.0: 250
>osd.1: 100
>osd.2: 250
>
>So we have a ratio of 5:2:5. This is expected — a primary affinity of
>0.5 means that the OSD should offload half of its primary
>responsibilities to other OSDs, which increases their effective
>weight. Which we've done — each OSD would have had 200 PGs, but now
>osd.1 has only 100 PGs and 50 have been offloaded to each of the other
>two OSDs in the system.
>
>>>
>>>> D) After calculating primary based on the affinity values, I see a
>>>>shift of
>>>> osds so that primary comes to the front. Why is this needed?. I
>>>>thought,
>>>> primary affinity value affects only reads and hence, osd ordering need
>>>>not
>>>> be changed.
>>>
>>>Primary affinity impacts which OSD is chosen to be primary; the
>>>primary is the ordering point for *all* access to the PG. That
>>>includes writes as well as reads, plus coordination of the cluster on
>>>map changes. We move the primary to the front of the list...well, I
>>>think it's just because we were lazy and there are a bunch of places
>>>that assume the first OSD in a replicated pool is the primary.
>>
>> Does that mean that osd set ordering keeps on changing(in real time) for
>> various object reads in a pg if primary affinity is configured?
>>Whenever
>> osd set is returned from pg_to_up_acting_osds, can we always say that
>>the
>> first osd is the current primary for read and writes? .  Is it the same
>> for osd set returned by ceph pg dump? However, I am surprised that the
>> ordering remains same when I dump values at different times.
>
>No! These hashes that we're running are deterministic and change only
>when the state of the cluster does! They will be static unless you
>have flapping OSDs or something.
>
>What are you trying to use the primary affinities for?
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com