On Thu, Oct 9, 2014 at 4:24 PM, Johnu George (johnugeo) <johnugeo@xxxxxxxxx> wrote: > Hi Greg, > Thanks for your extremely informative post. My related questions > are posted inline > > On 10/9/14, 2:21 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote: > >>On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo) >><johnugeo@xxxxxxxxx> wrote: >>> Hi All, >>> I have few questions regarding the Primary affinity. In the >>> original blueprint >>> >>>(https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role >>>_affinity >>> ), one example has been given. >>> >>> For PG x, CRUSH returns [a, b, c] >>> If a has primary_affinity of .5, b and c have 1 , with 50% probability, >>>we >>> will choose b or c instead of a. (25% for b, 25% for c) >>> >>> A) I was browsing through the code, but I could not find this logic of >>> splitting the rest of configured primary affinity value between other >>>osds. >>> How is this handled? >>> >>> if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY && >>> (crush_hash32_2(CRUSH_HASH_RJENKINS1, >>> seed, o) >> 16) >= a) { >>> // we chose not to use this primary. note it anyway as a >>> // fallback in case we don't pick anyone else, but keep looking. >>> if (pos < 0) >>> pos = i; >>> } else { >>> pos = i; >>> break; >>> } >>> } >> >>It's a fallback mechanism ‹ if the chosen primary for a PG has primary >>affinity less than the default (max), we (probabilistically) look for >>a different OSD to be the primary. We decide whether to offload by >>running a hash and discarding the OSD if the output value is greater >>than the OSDs affinity, and then we go through the list and run that >>calculation in order (obviously if the affinity is 1, then it passes >>without needing to run the hash). >>If no OSD in the list has a high enough hash value, we take the >>originally-chosen primary. > As in example for [0.5,1,1], I got your point that with 50% probability, > first osd will be chosen. But, how do we ensure that second and third osd > will be having remaining 25% and 25% respectively?. I could see only > individual primary affinity values but not a sum value anywhere to ensure > that. Well, for any given PG with that pattern, the second OSD in the list is going to be chosen. But *which* osd is listed second is random, so if you only have 3 OSDs 0,1,2 (with weights .5, 1, 1, respectively), then the PGs in total will work in a 1:2:2 ratio because OSDs 1 and 2 will between themselves be first in half of the PG lists. > >> >>> B) Since, primary affinity value is configured independently, there can >>>be a >>> situation with [0.1,0.1,0.1] with total value that don¹t add to 1. >>>How is >>> this taken care of? >> >>These primary affinity values are just compared against the hash >>output I mentioned, so the sum doesn't matter. In general we simply >>expect that OSDs which don't have the max weight value will be chosen >>as primary in proportion to their share of the total weight of their >>PG membership (ie, if they have a weight of .5 and everybody else has >>weight 1, they will be primary in half the normal number of PGs. If >>everybody has a weight of .5, they will be primary in the normal >>proportions. Etc). > > I got your idea but I couldn¹t figure out that from the code. You said > that max weight value will be chosen as primary in proportion to their > share of the total weight of their > PG membership. But, from what I understood from code, if it is > [0.1,0.1,0.1], first osd will be chosen always. (Probabilistically for 10% > reads, it will choose first osd. However,first osd will still be chosen > for rest of the reads as part of fallback mechanism which is the > originally chosen primary.) Am I wrong? If each OSD has affinity of 0.1, then the hash is run until its output is <0.1 for one of the OSDs in the list. If *none* of the OSDs in the list hashes out a number smaller than that, then the first one in the list (which would be the primary by default!) will be selected. > >> >>> >>> C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is >>>osd.0 >>> always returned? >> >>If the first OSD in the PG list has primary affinity of 1 then it is >>always the primary for that OSD, yes. That's not osd.0, though; just >>the first OSD in the PG list. ;) > > Sorry. I meant the first OSD, but accidentally wrote as osd.0 . As you > said, if first osd is always selected in the PG list for this scenario, > doesn¹t it violate our assumption to have probabilistically 25%, 50%, 25% > reads for first ,second and third osd respectively? Err, your numbers don't match the code here — we have two OSDs in that list with affinity 1 and one with affinity 0.5. That would be a 2:1:2 ratio, or 40%, 20%, 40%. In this case the first OSD in the list is selected because it's got the max affinity. And the ratios don't actually work out like that if some of your OSDs have the max affinity and others don't (because a max affinity OSD will happily take whatever you throw at it) — these are really intended only for reducing overloaded OSDs, not to be used as another weight system (we already have normal CRUSH weights for that). I think maybe part of the problem is that you're trying to extrapolate from the rules on a single list to the behavior over the whole set of PGs and it isn't working out for you. Eg, say osd.0_1 means osd.0 with primary affinity 1. osd.1_0.5 means osd.1 with primary affinity 0.5. So if we have three OSDs in a cluster: osd.0_1, osd.1_0.5, osd.2_1, then there are a bunch of possible configurations for the PG selection code to spit out: A: osd.0_1, osd.1_0.5, osd.2_1 B: osd.0_1, osd.2_1, osd.1_0.5 C: osd.1_0.5, osd.0_1, osd.2_1 D: osd.1_0.5, osd.2_1, osd.0_1 E: osd.2_1, osd.0_1, osd.1_0.5 F: osd.2_1, osd.1_0.5, osd.0_1 Each of these orders is going to appear roughly 1/6 of the time; let's say there are 600 PGs because it makes the math easy. Then we have each configuration A-F 100 times. Configurations A,B,E,F are simple enough; the first OSD has affinity 1 and so it's selected. So from those configurations the OSDs are primary like so: osd.0: 200 osd.1: 0 osd.2: 200 200 PGs remaining For PGs with configuration C or D, half of each will stay on osd.1. That's 100/2*2, or 100. Then the second entry in the list gets the rest for that configuration, which is 50 each. osd.0: 250 osd.1: 100 osd.2: 250 So we have a ratio of 5:2:5. This is expected — a primary affinity of 0.5 means that the OSD should offload half of its primary responsibilities to other OSDs, which increases their effective weight. Which we've done — each OSD would have had 200 PGs, but now osd.1 has only 100 PGs and 50 have been offloaded to each of the other two OSDs in the system. >> >>> D) After calculating primary based on the affinity values, I see a >>>shift of >>> osds so that primary comes to the front. Why is this needed?. I thought, >>> primary affinity value affects only reads and hence, osd ordering need >>>not >>> be changed. >> >>Primary affinity impacts which OSD is chosen to be primary; the >>primary is the ordering point for *all* access to the PG. That >>includes writes as well as reads, plus coordination of the cluster on >>map changes. We move the primary to the front of the list...well, I >>think it's just because we were lazy and there are a bunch of places >>that assume the first OSD in a replicated pool is the primary. > > Does that mean that osd set ordering keeps on changing(in real time) for > various object reads in a pg if primary affinity is configured? Whenever > osd set is returned from pg_to_up_acting_osds, can we always say that the > first osd is the current primary for read and writes? . Is it the same > for osd set returned by ceph pg dump? However, I am surprised that the > ordering remains same when I dump values at different times. No! These hashes that we're running are deterministic and change only when the state of the cluster does! They will be static unless you have flapping OSDs or something. What are you trying to use the primary affinities for? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com