Since pgp_num is constraining the placement, making the pg_num larger isn't going to improve the balance. Mapping directly from objects to OSDs would require a much higher metadata overhead, which is part of the reason we have PGs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Mar 10, 2014 at 2:37 AM, Li Wang <liwang@xxxxxxxxxxxxxxx> wrote: > pgp_num is the upper bound of number of OSD combinations, right? > so we can reduce pgp_num to constrain the possible combinations, > and the data loss probability is only dependent on pgp_num, > say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, so > it is permutation rather than combination). But we can still > maintain a big pg_num, will it make the object distribution more > uniform? Currently object_id is mapped to pg_id, then pg_id mapped to > OSD combinations, why does it need two levels of mapping, why not map > object_id to OSD combinations directly, will it achieve a more uniform > distribution? > > > On 2014/3/8 1:43, Sage Weil wrote: >> >> On Fri, 7 Mar 2014, Gregory Farnum wrote: >>> >>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: >>>> >>>> On Fri, 7 Mar 2014, Dan van der Ster wrote: >>>>> >>>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >>>>>> >>>>>> Sheldon just >>>>>> pointed out a talk from ATC that discusses the basic problem: >>>>>> >>>>>> >>>>>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >>>>>> >>>>>> The situation with CRUSH is slightly better, I think, because the >>>>>> number >>>>>> of peers for a given OSD in a large cluster is bounded (pg_num / >>>>>> num_osds), but I think we may still be able improve things. >>>>> >>>>> >>>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement >>>>> groups? >>>> >>>> >>>> I think so (I didn't listen to the whole talk :). My ears did perk up >>>> when Carlos (who was part of the original team at UCSC) asked the >>>> question >>>> about the CRUSH paper at the end, though. :) >>>> >>>> Anyway, now I'm thinking that this *is* really just all about tuning >>>> pg_num/pgp_num. And of course managing failure domains in the CRUSH map >>>> as best we can to align placement with expected sources of correlated >>>> failure. But again, I would appreciate any confirmation from others' >>>> intuitions or (better yet) a proper mathematical model. This bit of my >>>> brain is full of cobwebs, and wasn't particularly strong here to begin >>>> with. >>> >>> >>> Well, yes and no. They're constraining data sharing in order to reduce >>> the probability of any given data loss event, and we can reduce data >>> sharing by reducing the pgp_num. But the example you cited was "place >>> all copies in the top third of the selected racks", and that's a >>> little different because it means they can independently scale the >>> data sharing *within* that grouping to maintain a good data balance, >>> which CRUSH would have trouble with. >>> Unfortunately my intuition around probability and stats isn't much >>> good, so that's about as far as I can take this effectively. ;) >> >> >> Yeah I'm struggling with this too, but I *think* the top/middle/bottom >> rack analogy is just an easy way to think about constraining the placement >> options, which we're doing anyway with the placement group count--just in >> a way that looks random but is still sampling a small portion of the >> possible combinations. In the end, whether you eliminate 8/9 of the >> options of the rack layers and *then* scale pg_num, or just scale pg_num, >> I think it still boils down to the number of distinct 3-disk sets out of >> the total possible 3-disk sets. >> >> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so >> that the crush hierarchy goes like: >> >> root >> layer of rack (top/middle/bottom) >> rack >> host >> osd >> >> and make the crush rule first pick 1 layer before doing the chooseleaf >> over racks. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html