On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo) <johnugeo@xxxxxxxxx> wrote: > Hi All, > I have few questions regarding the Primary affinity. In the > original blueprint > (https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity > ), one example has been given. > > For PG x, CRUSH returns [a, b, c] > If a has primary_affinity of .5, b and c have 1 , with 50% probability, we > will choose b or c instead of a. (25% for b, 25% for c) > > A) I was browsing through the code, but I could not find this logic of > splitting the rest of configured primary affinity value between other osds. > How is this handled? > > if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY && > (crush_hash32_2(CRUSH_HASH_RJENKINS1, > seed, o) >> 16) >= a) { > // we chose not to use this primary. note it anyway as a > // fallback in case we don't pick anyone else, but keep looking. > if (pos < 0) > pos = i; > } else { > pos = i; > break; > } > } It's a fallback mechanism — if the chosen primary for a PG has primary affinity less than the default (max), we (probabilistically) look for a different OSD to be the primary. We decide whether to offload by running a hash and discarding the OSD if the output value is greater than the OSDs affinity, and then we go through the list and run that calculation in order (obviously if the affinity is 1, then it passes without needing to run the hash). If no OSD in the list has a high enough hash value, we take the originally-chosen primary. > B) Since, primary affinity value is configured independently, there can be a > situation with [0.1,0.1,0.1] with total value that don’t add to 1. How is > this taken care of? These primary affinity values are just compared against the hash output I mentioned, so the sum doesn't matter. In general we simply expect that OSDs which don't have the max weight value will be chosen as primary in proportion to their share of the total weight of their PG membership (ie, if they have a weight of .5 and everybody else has weight 1, they will be primary in half the normal number of PGs. If everybody has a weight of .5, they will be primary in the normal proportions. Etc). > > C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is osd.0 > always returned? If the first OSD in the PG list has primary affinity of 1 then it is always the primary for that OSD, yes. That's not osd.0, though; just the first OSD in the PG list. ;) > D) After calculating primary based on the affinity values, I see a shift of > osds so that primary comes to the front. Why is this needed?. I thought, > primary affinity value affects only reads and hence, osd ordering need not > be changed. Primary affinity impacts which OSD is chosen to be primary; the primary is the ordering point for *all* access to the PG. That includes writes as well as reads, plus coordination of the cluster on map changes. We move the primary to the front of the list...well, I think it's just because we were lazy and there are a bunch of places that assume the first OSD in a replicated pool is the primary. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com