Re: Regarding Primary affinity configuration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,
         Thanks for your extremely informative post. My related questions
are posted inline

On 10/9/14, 2:21 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:

>On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo)
><johnugeo@xxxxxxxxx> wrote:
>> Hi All,
>>           I have few questions regarding the Primary affinity.  In the
>> original blueprint
>> 
>>(https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role
>>_affinity
>> ), one example has been given.
>>
>> For PG x, CRUSH returns [a, b, c]
>> If a has primary_affinity of .5, b and c have 1 , with 50% probability,
>>we
>> will choose b or c instead of a. (25% for b, 25% for c)
>>
>> A) I was browsing through the code, but I could not find this logic of
>> splitting the rest of configured primary affinity value between other
>>osds.
>> How is this handled?
>>
>>     if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
>>     (crush_hash32_2(CRUSH_HASH_RJENKINS1,
>>             seed, o) >> 16) >= a) {
>>       // we chose not to use this primary.  note it anyway as a
>>       // fallback in case we don't pick anyone else, but keep looking.
>>       if (pos < 0)
>>     pos = i;
>>     } else {
>>       pos = i;
>>       break;
>>     }
>>   }
>
>It's a fallback mechanism ‹ if the chosen primary for a PG has primary
>affinity less than the default (max), we (probabilistically) look for
>a different OSD to be the primary. We decide whether to offload by
>running a hash and discarding the OSD if the output value is greater
>than the OSDs affinity, and then we go through the list and run that
>calculation in order (obviously if the affinity is 1, then it passes
>without needing to run the hash).
>If no OSD in the list has a high enough hash value, we take the
>originally-chosen primary.
 As in example for [0.5,1,1], I got your point that with 50% probability,
first osd will be chosen. But, how do we ensure that second and third osd
will be having remaining 25% and 25% respectively?. I could see only
individual primary affinity values but not a sum value anywhere to ensure
that.

>
>> B) Since, primary affinity value is configured independently, there can
>>be a
>> situation with [0.1,0.1,0.1]  with total value that don¹t add to 1.
>>How is
>> this taken care of?
>
>These primary affinity values are just compared against the hash
>output I mentioned, so the sum doesn't matter. In general we simply
>expect that OSDs which don't have the max weight value will be chosen
>as primary in proportion to their share of the total weight of their
>PG membership (ie, if they have a weight of .5 and everybody else has
>weight 1, they will be primary in half the normal number of PGs. If
>everybody has a weight of .5, they will be primary in the normal
>proportions. Etc).

I got your idea but I couldn¹t figure out that from the code. You said
that max weight value will be chosen as primary in proportion to their
share of the total weight of their
PG membership. But, from what I understood from code, if it is
[0.1,0.1,0.1], first osd will be chosen always. (Probabilistically for 10%
reads, it will choose first osd. However,first osd will still be chosen
for rest of the reads as part of fallback mechanism which is the
originally chosen primary.) Am I wrong?

>
>>
>> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is
>>osd.0
>> always returned?
>
>If the first OSD in the PG list has primary affinity of 1 then it is
>always the primary for that OSD, yes. That's not osd.0, though; just
>the first OSD in the PG list. ;)

Sorry. I meant the first OSD, but accidentally wrote as osd.0 . As you
said, if first osd is always selected in the PG list for this scenario,
doesn¹t it violate our assumption to have probabilistically  25%, 50%, 25%
reads for first ,second and third osd respectively?
>
>> D) After calculating primary based on the affinity values, I see a
>>shift of
>> osds so that primary comes to the front. Why is this needed?. I thought,
>> primary affinity value affects only reads and hence, osd ordering need
>>not
>> be changed.
>
>Primary affinity impacts which OSD is chosen to be primary; the
>primary is the ordering point for *all* access to the PG. That
>includes writes as well as reads, plus coordination of the cluster on
>map changes. We move the primary to the front of the list...well, I
>think it's just because we were lazy and there are a bunch of places
>that assume the first OSD in a replicated pool is the primary.

Does that mean that osd set ordering keeps on changing(in real time) for
various object reads in a pg if primary affinity is configured?  Whenever
osd set is returned from pg_to_up_acting_osds, can we always say that the
first osd is the current primary for read and writes? .  Is it the same
for osd set returned by ceph pg dump? However, I am surprised that the
ordering remains same when I dump values at different times.

Thanks,
Johnu 

>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux