Re: Regarding Primary affinity configuration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo)
<johnugeo@xxxxxxxxx> wrote:
> Hi All,
>           I have few questions regarding the Primary affinity.  In the
> original blueprint
> (https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity
> ), one example has been given.
>
> For PG x, CRUSH returns [a, b, c]
> If a has primary_affinity of .5, b and c have 1 , with 50% probability, we
> will choose b or c instead of a. (25% for b, 25% for c)
>
> A) I was browsing through the code, but I could not find this logic of
> splitting the rest of configured primary affinity value between other osds.
> How is this handled?
>
>     if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
>     (crush_hash32_2(CRUSH_HASH_RJENKINS1,
>             seed, o) >> 16) >= a) {
>       // we chose not to use this primary.  note it anyway as a
>       // fallback in case we don't pick anyone else, but keep looking.
>       if (pos < 0)
>     pos = i;
>     } else {
>       pos = i;
>       break;
>     }
>   }

It's a fallback mechanism — if the chosen primary for a PG has primary
affinity less than the default (max), we (probabilistically) look for
a different OSD to be the primary. We decide whether to offload by
running a hash and discarding the OSD if the output value is greater
than the OSDs affinity, and then we go through the list and run that
calculation in order (obviously if the affinity is 1, then it passes
without needing to run the hash).
If no OSD in the list has a high enough hash value, we take the
originally-chosen primary.

> B) Since, primary affinity value is configured independently, there can be a
> situation with [0.1,0.1,0.1]  with total value that don’t add to 1.  How is
> this taken care of?

These primary affinity values are just compared against the hash
output I mentioned, so the sum doesn't matter. In general we simply
expect that OSDs which don't have the max weight value will be chosen
as primary in proportion to their share of the total weight of their
PG membership (ie, if they have a weight of .5 and everybody else has
weight 1, they will be primary in half the normal number of PGs. If
everybody has a weight of .5, they will be primary in the normal
proportions. Etc).

>
> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is osd.0
> always returned?

If the first OSD in the PG list has primary affinity of 1 then it is
always the primary for that OSD, yes. That's not osd.0, though; just
the first OSD in the PG list. ;)

> D) After calculating primary based on the affinity values, I see a shift of
> osds so that primary comes to the front. Why is this needed?. I thought,
> primary affinity value affects only reads and hence, osd ordering need not
> be changed.

Primary affinity impacts which OSD is chosen to be primary; the
primary is the ordering point for *all* access to the PG. That
includes writes as well as reads, plus coordination of the cluster on
map changes. We move the primary to the front of the list...well, I
think it's just because we were lazy and there are a bunch of places
that assume the first OSD in a replicated pool is the primary.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux