Re: PG distribution scattered

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 19 Sep 2013 09:34:25 -0700

It will not lose any of your data. But it will try and move pretty much all of it, which will probably send performance down the toilet.-Greg

On Thursday, September 19, 2013, Mark Nelson  wrote:

Honestly I don't remember, but I would be wary if it's not a test system. :)

Mark

On 09/19/2013 11:28 AM, Warren Wang wrote:

Is this safe to enable on a running cluster?

--

Warren

On Sep 19, 2013, at 9:43 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

On 09/19/2013 08:36 AM, Niklas Goerke wrote:

Hi there

I'm currently evaluating ceph and started filling my cluster for the

first time. After filling it up to about 75%, it reported some OSDs

being "near-full".

After some evaluation I found that the PGs are not distributed evenly

over all the osds.

My Setup:

* Two Hosts with 45 Disks each --> 90 OSDs

* Only one newly created pool with 4500 PGs and a Replica Size of 2 -->

should be about 100 PGs per OSD

What I found was that one OSD only had 72 PGs, while another had 123 PGs

[1]. That means that - if I did the math correctly - I can only fill the

cluster to about 81%, because thats when the first OSD is completely

full[2].

Does distribution improve if you make a pool with significantly more PGs?

I did some experimenting and found, that if I add another pool with 4500

PGs, each OSD will have exacly doubled the amount of PGs as with one

pool. So this is not an accident (tried it multiple times). On another

test-cluster with 4 Hosts and 15 Disks each, the Distribution was

similarly worse.

This is a bug that causes each pool to more or less be distributed the same way on the same hosts.  We have a fix, but it impacts backwards compatibility so it's off by default.  If you set:

osd pool default flag hashpspool = true

Theoretically that will cause different pools to be distributed more randomly.

To me it looks like the rjenkins algorithm is not working as it - in my

opinion - should be.

Am I doing anything wrong?

Is this behaviour to be expected?

Can I don something about it?

Thank you very much in advance

Niklas

[1] I built a small script that will parse pgdump and output the amount

of pgs on each osd: http://pastebin.com/5ZVqhy5M

[2] I know I should not fill my cluster completely but I'm talking about

theory and adding a margin only makes it worse.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com