Re: Pgs distribution evenly across osds

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 13 Feb 2017 17:34:12 -0600

You could do tricks like recreate a distribution until an OSD is 
associated with the number of PGs you want, and you could even do it 
until all OSDs match some pre-specified values, but you can't make it 
persist beyond a topology change (well, you could, but it'd become 
prohibitively expensive very quickly for all possible changes).

I still wonder if we should be throwing away really poor initial 
distributions just as a quality-of-life thing, but we've never done it.

Mark

On 02/13/2017 05:29 PM, hufh wrote:
Hi Mark,

Do you think it's possible to directly specify how many PGs a osd could
host? Thanks.

Tiger

2017年2月14日星期二，Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> 写道：

    Hi James,

    Yes, it's not surprising.  Especially if you are primarily targeting
    a single pool and can reweight to achieve optimal distribution
    specifically for that pool.  It's much harder to do this if you have
    lots of pools that are all used equally though.

    I'm glad to hear that reweighting improved things!  I still am
    hopeful that some day we can find a way to distribute pgs/data/etc
    evenly with something crush-like that does better than random
    distributions, but I suspect it will be very difficult.  Crush is
    already very clever at what it does.  We may have to trade away
    something to get something.

    Mark

    On 02/13/2017 04:48 PM, LIU, Fei wrote:

        Hi Mark,
           After we periodically even out the pg distribution with
        reweight-by-pgs ,we can see our cluster performance(IOPS and
        bandwidth ) up almost 25%~40% and latency drop almost 25%.  It
        somehow prove pg distribution affect the whole cluster
        performance a lot.

          Regards,
          James

        本邮件及其附件含有阿里巴巴集团的商业秘密信息，仅限于发送给上面地
        址中列出的个人和群组，禁止任何其他人以任何形式使用（包括但不限于
        全部或部分地泄露、复制和散发）本邮件及其附件中的信息，如果您错收
        本邮件，请您立即电话或邮件通知发件人并删除本邮件。
        This email and its attachments contain confidential information
        from Alibaba Group.which is intended only for the person or
        entity whose address is listed above.Any use of information
        contained herein in any way(including,but not limited to,total
        or partial disclosure,reproduction or dissemination)by persons
        other than the intended recipient(s) is prohibited.If you
        receive this email in error,please notify the sender by phone or
        email immediately and delete it.

        On 2/8/17, 4:57 AM, "LIU, Fei" <ceph-devel-owner@xxxxxxxxxxxxxxx
        on behalf of james.liu@xxxxxxxxxxxxxxx> wrote:

            Hi Mark,
               Thanks very much for sharing. We found as we tested more
        , somehow the oversubscribed OSD node perform worse than other
        OSD node with less PGs ,However , the CPU utilization even in
        worst oversubscribed OSD node is around ~50%. We also observed
        the apply latency and commit latency are pretty high comparing
        to others with less PGs. The networking traffic is reasonable .
        The disk utilization sometime is around 99%.but most of time is
        around 60%. We see latency spike from time to time caused by the
        oversubscribed OSD we guess. Any suggestions?

            Regards,
            James

            本邮件及其附件含有阿里巴巴集团的商业秘密信息，仅限于发送给上
        面地址中列出的个人和群组，禁止任何其他人以任何形式使用（包括但不
        限于全部或部分地泄露、复制和散发）本邮件及其附件中的信息，如果您
        错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。
            This email and its attachments contain confidential
        information from Alibaba Group.which is intended only for the
        person or entity whose address is listed above.Any use of
        information contained herein in any way(including,but not
        limited to,total or partial disclosure,reproduction or
        dissemination)by persons other than the intended recipient(s) is
        prohibited.If you receive this email in error,please notify the
        sender by phone or email immediately and delete it.

            On 2/7/17, 6:48 PM, "Mark Nelson" <mnelson@xxxxxxxxxx> wrote:

                Hi James,

                I'm not sage, but I'll chime in since I spent some time
        thinking about
                this stuff a while back when I was playing around with
        halton
                distributions for PG placement.  It's very difficult to
        get even
                distributions using random sampling unless you have a
        *very* high number
                of samples.  The following equations give you a
        reasonable expectation
                of what the min/max should be assuming an evenly
        weighted random
                distribution:

                min = (pgs / osds) - sqrt(2*pgs*log(osds)/osds)
                max = (pgs / osds) + sqrt(2*pgs*log(osds)/osds)

                In your case that's:

                min = 49152/168 - sqrt(2*49152*log(168)/168) = 256
                max = 49152/168 + sqrt(2*49152*log(168)/168) = 329

                In terms of performance potential and data distribution
        evenness, I'd
                argue you really want to know how bad your worst
        oversubscribed PG is vs
                the average:

                Expected: (49152/168)/329 = ~88.9%
                Actual: (49152/168)/333 = = ~87.9%

                Your numbers are a little worse, though typically I see our
                distributions hover right around expected or just
        slightly better.  This
                particular roll of the dice might have just been a
        little worse.

                If you jumped up to say 100K PGs:

                min = 1000000/168 - sqrt(2*1000000*log(168)/168) = 544
                max = 1000000/168 + sqrt(2*1000000*log(168)/168) = 647

                Expected: (100000/168)/647 = ~92%

                Now if you jumped up to 1 million PGs:

                min = 1000000/168 - sqrt(2*1000000*log(168)/168) = 5790
                max = 1000000/168 + sqrt(2*1000000*log(168)/168) = 6115

                Expected: (1000000/168)/6115 = ~97.3%

                Thanks,
                Mark

                On 02/07/2017 06:14 PM, LIU, Fei wrote:
                > Hi Sage,
                >   We are trying to distribute pgs evenly across osds.
        However, after certain tunes, we still got 30% difference among
        max pgs  and min pg  of OSDs (OSD 9 has 13.8% more pgs than
        average and OSD 86 has 15.2% less pgs than average).  Any good
        suggestions to make PGs distributed evenly across OSDs?
                >
                >   Thanks,
                >   James
                >
                >
                > SUM :   49152   |
                > Osd :   168     |
                > AVE :   292.57  |
                > Max :   333     |
                > Osdid : osd.9   |
                > per:    13.8%   |
                > ------------------------
                > min :   248     |
                > osdid : osd.86  |
                > per:    -15.2%  |
                >
                > [james.liu@a18d13422.eu13 /home/james.liu]
                > $
                >
                >
                > 本邮件及其附件含有阿里巴巴集团的商业秘密信息，仅限于发
        送给上面地址中列出的个人和群组，禁止任何其他人以任何形式使用（包
        括但不限于全部或部分地泄露、复制和散发）本邮件及其附件中的信息，
        如果您错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。
                > This email and its attachments contain confidential
        information from Alibaba Group.which is intended only for the
        person or entity whose address is listed above.Any use of
        information contained herein in any way(including,but not
        limited to,total or partial disclosure,reproduction or
        dissemination)by persons other than the intended recipient(s) is
        prohibited.If you receive this email in error,please notify the
        sender by phone or email immediately and delete it.
                >
                >
                > --
                > To unsubscribe from this list: send the line
        "unsubscribe ceph-devel" in
                > the body of a message to majordomo@xxxxxxxxxxxxxxx
                > More majordomo info at
        http://vger.kernel.org/majordomo-info.html
        <http://vger.kernel.org/majordomo-info.html>
                >

            --
            To unsubscribe from this list: send the line "unsubscribe
        ceph-devel" in
            the body of a message to majordomo@xxxxxxxxxxxxxxx
            More majordomo info at
        http://vger.kernel.org/majordomo-info.html
        <http://vger.kernel.org/majordomo-info.html>

        --
        To unsubscribe from this list: send the line "unsubscribe
        ceph-devel" in
        the body of a message to majordomo@xxxxxxxxxxxxxxx
        More majordomo info at
        http://vger.kernel.org/majordomo-info.html
        <http://vger.kernel.org/majordomo-info.html>

    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@xxxxxxxxxxxxxxx
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    <http://vger.kernel.org/majordomo-info.html>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html