Hi Tiger,
reweight-by-distribution is one option, but my thought is more generally
to modify the pool creation step itself. The idea is that you can't
easily search through the entire space of possible topologies of a pool
given the initial conditions, but you can at least look at the initial
distribution that gets created. A very brute force approach would be to
just throw away distributions and recompute with different seeds until
you hit a distribution you like.
It seems to me though that you could create a giant hadoop cluster who's
sole purpose is to look for combinations of topologies and seeds that
provide useful properties (good initial distribution, good 1-off
distributions, potentially good doubling distributions). Assuming you
could replicate the initial conditions, a database of these seeds might
allow folks to choose seeds with ideal properties.
Mark
On 02/13/2017 05:48 PM, hufh wrote:
Mark,
Thanks for the valuable suggestions. Yes, we did redistribute pg by
command reweight, but it's very hard and slow process. I am wondering if
we could add a new parameter for reweight to specify OSD's pg number. We
could trade some scalability for pg distribution evenness. Thanks.
Tiger
2017年2月14日星期二,Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> 写道:
You could do tricks like recreate a distribution until an OSD is
associated with the number of PGs you want, and you could even do it
until all OSDs match some pre-specified values, but you can't make
it persist beyond a topology change (well, you could, but it'd
become prohibitively expensive very quickly for all possible changes).
I still wonder if we should be throwing away really poor initial
distributions just as a quality-of-life thing, but we've never done it.
Mark
On 02/13/2017 05:29 PM, hufh wrote:
Hi Mark,
Do you think it's possible to directly specify how many PGs a
osd could
host? Thanks.
Tiger
2017年2月14日星期二,Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> 写道:
Hi James,
Yes, it's not surprising. Especially if you are primarily
targeting
a single pool and can reweight to achieve optimal distribution
specifically for that pool. It's much harder to do this if
you have
lots of pools that are all used equally though.
I'm glad to hear that reweighting improved things! I still am
hopeful that some day we can find a way to distribute
pgs/data/etc
evenly with something crush-like that does better than random
distributions, but I suspect it will be very difficult.
Crush is
already very clever at what it does. We may have to trade away
something to get something.
Mark
On 02/13/2017 04:48 PM, LIU, Fei wrote:
Hi Mark,
After we periodically even out the pg distribution with
reweight-by-pgs ,we can see our cluster performance(IOPS and
bandwidth ) up almost 25%~40% and latency drop almost
25%. It
somehow prove pg distribution affect the whole cluster
performance a lot.
Regards,
James
本邮件及其附件含有阿里巴巴集团的商业秘密信息,仅限于发送
给上面地
址中列出的个人和群组,禁止任何其他人以任何形式使用(包括
但不限于
全部或部分地泄露、复制和散发)本邮件及其附件中的信息,如
果您错收
本邮件,请您立即电话或邮件通知发件人并删除本邮件。
This email and its attachments contain confidential
information
from Alibaba Group.which is intended only for the person or
entity whose address is listed above.Any use of information
contained herein in any way(including,but not limited
to,total
or partial disclosure,reproduction or dissemination)by
persons
other than the intended recipient(s) is prohibited.If you
receive this email in error,please notify the sender by
phone or
email immediately and delete it.
On 2/8/17, 4:57 AM, "LIU, Fei"
<ceph-devel-owner@xxxxxxxxxxxxxxx
on behalf of james.liu@xxxxxxxxxxxxxxx> wrote:
Hi Mark,
Thanks very much for sharing. We found as we
tested more
, somehow the oversubscribed OSD node perform worse than
other
OSD node with less PGs ,However , the CPU utilization
even in
worst oversubscribed OSD node is around ~50%. We also
observed
the apply latency and commit latency are pretty high
comparing
to others with less PGs. The networking traffic is
reasonable .
The disk utilization sometime is around 99%.but most of
time is
around 60%. We see latency spike from time to time
caused by the
oversubscribed OSD we guess. Any suggestions?
Regards,
James
本邮件及其附件含有阿里巴巴集团的商业秘密信息,仅限于
发送给上
面地址中列出的个人和群组,禁止任何其他人以任何形式使用
(包括但不
限于全部或部分地泄露、复制和散发)本邮件及其附件中的信
息,如果您
错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。
This email and its attachments contain confidential
information from Alibaba Group.which is intended only
for the
person or entity whose address is listed above.Any use of
information contained herein in any way(including,but not
limited to,total or partial disclosure,reproduction or
dissemination)by persons other than the intended
recipient(s) is
prohibited.If you receive this email in error,please
notify the
sender by phone or email immediately and delete it.
On 2/7/17, 6:48 PM, "Mark Nelson"
<mnelson@xxxxxxxxxx> wrote:
Hi James,
I'm not sage, but I'll chime in since I spent
some time
thinking about
this stuff a while back when I was playing
around with
halton
distributions for PG placement. It's very
difficult to
get even
distributions using random sampling unless you
have a
*very* high number
of samples. The following equations give you a
reasonable expectation
of what the min/max should be assuming an evenly
weighted random
distribution:
min = (pgs / osds) - sqrt(2*pgs*log(osds)/osds)
max = (pgs / osds) + sqrt(2*pgs*log(osds)/osds)
In your case that's:
min = 49152/168 - sqrt(2*49152*log(168)/168) = 256
max = 49152/168 + sqrt(2*49152*log(168)/168) = 329
In terms of performance potential and data
distribution
evenness, I'd
argue you really want to know how bad your worst
oversubscribed PG is vs
the average:
Expected: (49152/168)/329 = ~88.9%
Actual: (49152/168)/333 = = ~87.9%
Your numbers are a little worse, though
typically I see our
distributions hover right around expected or just
slightly better. This
particular roll of the dice might have just been a
little worse.
If you jumped up to say 100K PGs:
min = 1000000/168 - sqrt(2*1000000*log(168)/168)
= 544
max = 1000000/168 + sqrt(2*1000000*log(168)/168)
= 647
Expected: (100000/168)/647 = ~92%
Now if you jumped up to 1 million PGs:
min = 1000000/168 - sqrt(2*1000000*log(168)/168)
= 5790
max = 1000000/168 + sqrt(2*1000000*log(168)/168)
= 6115
Expected: (1000000/168)/6115 = ~97.3%
Thanks,
Mark
On 02/07/2017 06:14 PM, LIU, Fei wrote:
> Hi Sage,
> We are trying to distribute pgs evenly
across osds.
However, after certain tunes, we still got 30%
difference among
max pgs and min pg of OSDs (OSD 9 has 13.8% more pgs than
average and OSD 86 has 15.2% less pgs than average).
Any good
suggestions to make PGs distributed evenly across OSDs?
>
> Thanks,
> James
>
>
> SUM : 49152 |
> Osd : 168 |
> AVE : 292.57 |
> Max : 333 |
> Osdid : osd.9 |
> per: 13.8% |
> ------------------------
> min : 248 |
> osdid : osd.86 |
> per: -15.2% |
>
> [james.liu@a18d13422.eu13 /home/james.liu]
> $
>
>
> 本邮件及其附件含有阿里巴巴集团的商业秘密信息,
仅限于发
送给上面地址中列出的个人和群组,禁止任何其他人以任何形式
使用(包
括但不限于全部或部分地泄露、复制和散发)本邮件及其附件中
的信息,
如果您错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。
> This email and its attachments contain
confidential
information from Alibaba Group.which is intended only
for the
person or entity whose address is listed above.Any use of
information contained herein in any way(including,but not
limited to,total or partial disclosure,reproduction or
dissemination)by persons other than the intended
recipient(s) is
prohibited.If you receive this email in error,please
notify the
sender by phone or email immediately and delete it.
>
>
> --
> To unsubscribe from this list: send the line
"unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at
http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>
<http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>>
>
--
To unsubscribe from this list: send the line
"unsubscribe
ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>
<http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>>
--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>
<http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>>
--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>
<http://vger.kernel.org/majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html