Uneven distribution of PG across OSDs

Gleb Borisov <borisov.gleb@xxxxxxxxx> · Wed, 8 Jul 2015 18:02:37 +0300

Hi
We've faced with uneven distribution of data on our production cluster running Hammer (0.94.2).
We have 1056 OSD which are running on 352 hosts in 10 racks. Failure domain set to rack. All hosts (except several in ssd_root branch, which is not used for RGW data placement) are in same configuration and disks are identical.

Disk usage vary from about 8% to 22% (in worst cases), but average disk utilization is about 15%. I've checked PG's size and it seems very constant (std variance in GB is 0.25, avg is 5.34GB and median is 5.0GB).

I've downloaded CRUSH map from cluster and used crushtool to investigate this issue (--test --show-utilization --num-rep 3 --rule 0 --min-x 1 --max-x <PGCOUNT>) and found that our current CRUSH map gives huge variance of assigned PG count across devices:

min/avg/max: 28 / 46.60 / 71
stddev: 6.65

Numbers provided by crushtool heavily correlate to actual state of cluster.

I've also tried:

  * changing bucket algo from straw to straw2 gives very little profit (also tried all available algorithms and found that straw2 is the best one);
  * changing count of PG using --max-x (no luck, still very unbalanced using all values from 1024, 2048, 4096, 8192, 16384, 16404, 32768);
  * changing failure domain from rack to host (still unbalanced).

Is there any way to achive more balanced cluster?

More information on our cluster's state:

  * ceph df detail - http://pastebin.com/0MWqwDXe
  * ceph osd df - http://pastebin.com/kcEiSH6g
  * ceph osd tree - http://pastebin.com/4n02Pc1J
  * ceph pg dump osds - http://pastebin.com/B9Uv73Li
  * decompiled CRUSH map - http://pastebin.com/G5Pki8wj

Thanks in advance. 

-- 
Best regards,
Gleb M Borisov

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com