Re: Issue 15653 and imbalanced clusters

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Tue, 22 Nov 2016 16:09:31 +0000

We run with multiple hosts of differing sizes from 24x 3TB drives (version 1 node) to 32x 4TB drives (version 2 node).  I've balanced crush maps for
 this with host failure domain and rack failure domain where one rack had 10x version 1 nodes and another had 10x version 2 nodes and it balanced perfectly.

We do this by using an offline crush map reweighting tool that I've been working on.  It allows you to generate a new map, test it with nobackfill turned on, and upload a single map update that moves all data to a balanced state.  Unfortunately it currently
 only matches our use case of 1 pool with 99% of the data per root.  Time and lack of differing environments limit my ability to do much more towards making this a more diverse tool.

One of our smaller clusters that is 58.23% full with 208 OSDs (6x version 1 nodes and 2x version 2 nodes) has the following stats from `ceph osd df`:

MIN/MAX VAR: 0.99/1.01  STDDEV: 0.40

Lowest %USE: 57.40% (3TB drive)

Highest %USE: 59.00% (3TB drive)

And all of the other 3TB and 4TB drives are somewhere in the middle 1.60%

Possibly the most helpful part of the tool is when you add storage to a mostly full cluster.  It will add it in to a balanced state without the need to worry about pgs that will become backfill_toofull.  Another is that you can specify an OSD or a host (or
 a combination of them comma delimited) to remove them from the cluster by weighting them to 0.00 so they will offload their data and then be ready for removal (or ready to be removed and re-added as blue storage when Luminous comes out).

We can get clusters up over 75% without having any osds over 80% regardless of differences in total sizes of failure domains.  Adding storage also doesn't fill osds over 80% in these clusters.

If you feel so inclined, feel free to message me to help develop this tool further.  It is currently written in BASH utilizing osdmaptool and crushtool.  OTOH, if your troublesome cluster matches our use case, feel free to send me a copy of your osdmap for
 to reweight.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: Ceph-large [ceph-large-bounces@xxxxxxxxxxxxxx] on behalf of Dan Van Der Ster [daniel.vanderster@xxxxxxx]

Sent: Tuesday, November 22, 2016 8:29 AM

To: Sage Weil

Cc: ceph-large@xxxxxxxxxxxxxx

Subject: Re:  Issue 15653 and imbalanced clusters

> On 22 Nov 2016, at 16:17, Sage Weil <sage@xxxxxxxxxxxx> wrote:

>

> On Tue, 22 Nov 2016, Dan Van Der Ster wrote:

>> Hi,

>>

>> I have a couple questions about http://tracker.ceph.com/issues/15653

>>

>> In the ticket Sage discusses small/big drives, and the small drives get

>> more data than expected.

>>

>> But we observe this at the rack level: our cluster has four racks, with

>> 7, 8, 8, 4 hosts respectively. The rack with 4 hosts is ~35% more full

>> than the others.

>>

>> So AFAICT, because of #15653, CRUSH does not currently work well if you

>> try to build a pool which is replicated rack/host-wise when your

>> rack/hosts are not all ~identical in size.

>

> Right--it's not about devices, but items within a CRUSH bucket.

> Unfortunately we don't have a good technical solution for this yet.  The

> best proposal so far is Adam's PR at

>

>       https://github.com/ceph/ceph/pull/10218

>

> but it leaves much to be desired.  I think we can do better, hopefully in

> time for lumninous.

>

> In the meantime, you can underweight (devices in) small racks.  :(

Thanks Sage. So you confirm that reweighting alone won't solve this?

-- dan

>

> sage

>

>

>

>> Are others noticing this pattern? Or are we unusual in that our clusters

>> are not flat/uniform in structure?

>>

>> Cheers, Dan

>>

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com