Re: Ceph OSD imbalance and performance

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Tue, 28 Feb 2023 13:47:46 -0600

On 2/28/23 13:11, Dave Ingram wrote:
On Tue, Feb 28, 2023 at 12:56 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

I think a few other things that could help would be `ceph osd df tree`
which will show the hierarchy across different crush domains.

Good idea: https://pastebin.com/y07TKt52

Yeah, it looks like OSD.147 has over 3x the amount of data on it vs some 
of the smaller HDD OSDs.  I bet it's getting hammered.  Are the drives 
different rotational speeds?  That's going to hurt too, especially if 
the bigger drives are slower and you aren't using flash for Journals/WALs.

You might want to look at the device queue wait times and see which 
drives are slow to service IOs.  I suspect it will be 147 leading the 
pack with the other 16TB drives following.  You never know though, 
sometimes you see an odd one that's slow but not showing smartctl errors 
yet.

Mark

And if you’re doing something like erasure coded pools, or something other
than replication 3, maybe `ceph osd crush rule dump` may provide some
further context with the tree output.

No erasure coded pools - all replication.

Also, the cluster is running Luminous (12) which went EOL 3 years ago
tomorrow
<https://docs.ceph.com/en/latest/releases/index.html#archived-releases>.
So there are also likely a good bit of improvements all around under the
hood to be gained by moving forward from Luminous.

Yes, nobody here wants to touch upgrading this at all - too terrified of
breaking things. This ceph deployment is serving several hundred VMs.

The general feeling is that we're stuck on luminous and that it's
destructive to upgrade to anything else. I refuse to believe that is true.
At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff
that came with I think 12.2.2...

What would you recommend upgrading luminous to?

Though, I would say take care of the scrub errors prior to doing any major
upgrades, as well as checking your upgrade path (can only upgrade two
releases at a time, if you have filestore OSDs, etc).

Yeah, there seems to be a fear that attempting to repair those will
negatively impact performance even more. I disagree and think we should do
them immediately.

Also, there seems to be a belief that bluestore is an 'all-or-nothing'
proposition and that it's impossible to migrate from filestore to
bluestore. Yet I see that you can have a mixture of both in your
deployments and it's indeed possible to migrate from filestore to bluestore.

TL;DR -- there is a *lot* of fear of touching this thing because nobody is
truly an 'expert' in it atm. But not touching it is why we've gotten
ourselves into a situation with broken stuff and horrendous performance.

Thanks Reed!
-Dave

-Reed

On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave@xxxxxxxxxxxx> wrote:

There is a
lot of variability in drive sizes - two different sets of admins added
disks sized between 6TB and 16TB and I suspect this and imbalanced
weighting is to blame.

CEPH OSD DF:

(not going to paste that all in here): https://pastebin.com/CNW5RKWx

What else am I missing in terms of what to share with you all?

Thanks all,
-Dave
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx