Thanks for detailed post, Greg. I was trying to configure primary affinity in my cluster but I didn’t see any expected results. As you said, I was just looking into single pg and got wrong. I also had primary affinity value configured for multiple osds in a pg, which makes the calculation more complex. As in your eg: if osd0, osd1,osd2 has primary affinity value of [1,0.5,0.1] and there are 600 pgs, the final distribution comes in 440:140:20 or 22:7:1 which is slighly skewed from expected. Johnu On 10/9/14, 4:51 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote: >On Thu, Oct 9, 2014 at 4:24 PM, Johnu George (johnugeo) ><johnugeo@xxxxxxxxx> wrote: >> Hi Greg, >> Thanks for your extremely informative post. My related >>questions >> are posted inline >> >> On 10/9/14, 2:21 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote: >> >>>On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo) >>><johnugeo@xxxxxxxxx> wrote: >>>> Hi All, >>>> I have few questions regarding the Primary affinity. In the >>>> original blueprint >>>> >>>>(https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_ro >>>>le >>>>_affinity >>>> ), one example has been given. >>>> >>>> For PG x, CRUSH returns [a, b, c] >>>> If a has primary_affinity of .5, b and c have 1 , with 50% >>>>probability, >>>>we >>>> will choose b or c instead of a. (25% for b, 25% for c) >>>> >>>> A) I was browsing through the code, but I could not find this logic of >>>> splitting the rest of configured primary affinity value between other >>>>osds. >>>> How is this handled? >>>> >>>> if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY && >>>> (crush_hash32_2(CRUSH_HASH_RJENKINS1, >>>> seed, o) >> 16) >= a) { >>>> // we chose not to use this primary. note it anyway as a >>>> // fallback in case we don't pick anyone else, but keep looking. >>>> if (pos < 0) >>>> pos = i; >>>> } else { >>>> pos = i; >>>> break; >>>> } >>>> } >>> >>>It's a fallback mechanism ‹ if the chosen primary for a PG has primary >>>affinity less than the default (max), we (probabilistically) look for >>>a different OSD to be the primary. We decide whether to offload by >>>running a hash and discarding the OSD if the output value is greater >>>than the OSDs affinity, and then we go through the list and run that >>>calculation in order (obviously if the affinity is 1, then it passes >>>without needing to run the hash). >>>If no OSD in the list has a high enough hash value, we take the >>>originally-chosen primary. >> As in example for [0.5,1,1], I got your point that with 50% >>probability, >> first osd will be chosen. But, how do we ensure that second and third >>osd >> will be having remaining 25% and 25% respectively?. I could see only >> individual primary affinity values but not a sum value anywhere to >>ensure >> that. > >Well, for any given PG with that pattern, the second OSD in the list >is going to be chosen. But *which* osd is listed second is random, so >if you only have 3 OSDs 0,1,2 (with weights .5, 1, 1, respectively), >then the PGs in total will work in a 1:2:2 ratio because OSDs 1 and 2 >will between themselves be first in half of the PG lists. > >> >>> >>>> B) Since, primary affinity value is configured independently, there >>>>can >>>>be a >>>> situation with [0.1,0.1,0.1] with total value that don¹t add to 1. >>>>How is >>>> this taken care of? >>> >>>These primary affinity values are just compared against the hash >>>output I mentioned, so the sum doesn't matter. In general we simply >>>expect that OSDs which don't have the max weight value will be chosen >>>as primary in proportion to their share of the total weight of their >>>PG membership (ie, if they have a weight of .5 and everybody else has >>>weight 1, they will be primary in half the normal number of PGs. If >>>everybody has a weight of .5, they will be primary in the normal >>>proportions. Etc). >> >> I got your idea but I couldn¹t figure out that from the code. You said >> that max weight value will be chosen as primary in proportion to their >> share of the total weight of their >> PG membership. But, from what I understood from code, if it is >> [0.1,0.1,0.1], first osd will be chosen always. (Probabilistically for >>10% >> reads, it will choose first osd. However,first osd will still be chosen >> for rest of the reads as part of fallback mechanism which is the >> originally chosen primary.) Am I wrong? > >If each OSD has affinity of 0.1, then the hash is run until its output >is <0.1 for one of the OSDs in the list. If *none* of the OSDs in the >list hashes out a number smaller than that, then the first one in the >list (which would be the primary by default!) will be selected. > >> >>> >>>> >>>> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is >>>>osd.0 >>>> always returned? >>> >>>If the first OSD in the PG list has primary affinity of 1 then it is >>>always the primary for that OSD, yes. That's not osd.0, though; just >>>the first OSD in the PG list. ;) >> >> Sorry. I meant the first OSD, but accidentally wrote as osd.0 . As you >> said, if first osd is always selected in the PG list for this scenario, >> doesn¹t it violate our assumption to have probabilistically 25%, 50%, >>25% >> reads for first ,second and third osd respectively? > >Err, your numbers don't match the code here — we have two OSDs in that >list with affinity 1 and one with affinity 0.5. That would be a 2:1:2 >ratio, or 40%, 20%, 40%. In this case the first OSD in the list is >selected because it's got the max affinity. And the ratios don't >actually work out like that if some of your OSDs have the max affinity >and others don't (because a max affinity OSD will happily take >whatever you throw at it) — these are really intended only for >reducing overloaded OSDs, not to be used as another weight system (we >already have normal CRUSH weights for that). > >I think maybe part of the problem is that you're trying to extrapolate >from the rules on a single list to the behavior over the whole set of >PGs and it isn't working out for you. Eg, say osd.0_1 means osd.0 with >primary affinity 1. osd.1_0.5 means osd.1 with primary affinity 0.5. >So if we have three OSDs in a cluster: osd.0_1, osd.1_0.5, osd.2_1, >then there are a bunch of possible configurations for the PG selection >code to spit out: >A: osd.0_1, osd.1_0.5, osd.2_1 >B: osd.0_1, osd.2_1, osd.1_0.5 >C: osd.1_0.5, osd.0_1, osd.2_1 >D: osd.1_0.5, osd.2_1, osd.0_1 >E: osd.2_1, osd.0_1, osd.1_0.5 >F: osd.2_1, osd.1_0.5, osd.0_1 > >Each of these orders is going to appear roughly 1/6 of the time; let's >say there are 600 PGs because it makes the math easy. Then we have >each configuration A-F 100 times. >Configurations A,B,E,F are simple enough; the first OSD has affinity 1 >and so it's selected. So from those configurations the OSDs are >primary like so: >osd.0: 200 >osd.1: 0 >osd.2: 200 >200 PGs remaining >For PGs with configuration C or D, half of each will stay on osd.1. >That's 100/2*2, or 100. Then the second entry in the list gets the >rest for that configuration, which is 50 each. >osd.0: 250 >osd.1: 100 >osd.2: 250 > >So we have a ratio of 5:2:5. This is expected — a primary affinity of >0.5 means that the OSD should offload half of its primary >responsibilities to other OSDs, which increases their effective >weight. Which we've done — each OSD would have had 200 PGs, but now >osd.1 has only 100 PGs and 50 have been offloaded to each of the other >two OSDs in the system. > >>> >>>> D) After calculating primary based on the affinity values, I see a >>>>shift of >>>> osds so that primary comes to the front. Why is this needed?. I >>>>thought, >>>> primary affinity value affects only reads and hence, osd ordering need >>>>not >>>> be changed. >>> >>>Primary affinity impacts which OSD is chosen to be primary; the >>>primary is the ordering point for *all* access to the PG. That >>>includes writes as well as reads, plus coordination of the cluster on >>>map changes. We move the primary to the front of the list...well, I >>>think it's just because we were lazy and there are a bunch of places >>>that assume the first OSD in a replicated pool is the primary. >> >> Does that mean that osd set ordering keeps on changing(in real time) for >> various object reads in a pg if primary affinity is configured? >>Whenever >> osd set is returned from pg_to_up_acting_osds, can we always say that >>the >> first osd is the current primary for read and writes? . Is it the same >> for osd set returned by ceph pg dump? However, I am surprised that the >> ordering remains same when I dump values at different times. > >No! These hashes that we're running are deterministic and change only >when the state of the cluster does! They will be static unless you >have flapping OSDs or something. > >What are you trying to use the primary affinities for? >-Greg >Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com