Re: Ceph OSD imbalance and performance

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 28 Feb 2023 15:55:32 -0600

Yeah, there seems to be a fear that attempting to repair those will negatively impact performance even more. I disagree and think we should do them immediately.

There shouldn’t really be too much of a noticeable performance hit.
Some good documentation here.

The general feeling is that we're stuck on luminous and that it's destructive to upgrade to anything else.
I refuse to believe that is true.
At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff that came with I think 12.2.2...

Upgrades are definitely not destructive, however, they also aren’t trivial.
You can upgrade 2 releases at a time, but the distro’s those packages are for may vary release to release.

For example, if you were to want to get to Quincy from Luminous, you should be able to step from Luminous (12) to Nautilus (14), then to Pacific (16), and on to Quincy (17) if you wanted.
However, your Luminous install may be on Ubuntu 14.04 or 16.04, which you can immediately move to Nautilus with.
To get to Pacific, you’re going to then need to move to Ubuntu 18.04 (Nautilus compatible), and then on to Pacific.
If you then wanted to move to Quincy, you then need to upgrade to Ubuntu 20.04, before moving on to Quincy with 20.04.

This probably sounds daunting, and it is certainly non-trivial, but definitely doable if you take things in small steps, and should be possible with no downtime if planned out.

Also, there seems to be a belief that bluestore is an 'all-or-nothing' proposition
Yet I see that you can have a mixture of both in your deployments 

You can mix filestore and bluestore OSDs in your cluster, however — 

[…] and that it's impossible to migrate from filestore to bluestore.
[…] and it's indeed possible to migrate from filestore to bluestore.

If you have filestore OSDs, the only way to migrate them to bluestore is by destroying the OSD, and recreating it as bluestore, see here.
This can be a time consuming process if you drain an OSD, let it backfill off, blow it away, recreate, and then bring data back.
This can also prove to be IO expensive as well if your ceph cluster is already IO saturated, due to all of the backfill IO on top of the client IO.

TL;DR -- there is a *lot* of fear of touching this thing because nobody is truly an 'expert' in it atm.
But not touching it is why we've gotten ourselves into a situation with broken stuff and horrendous performance.

Given how critical (and brittle) this infrastructure is sounding to your org, it might be best to pull in some experts, and I think most on the mailing list would likely recommend Croit as a good place to start outside of any existing support contracts.

Hope thats helpful,
Reed

On Feb 28, 2023, at 1:11 PM, Dave Ingram <dave@xxxxxxxxxxxx> wrote:

On Tue, Feb 28, 2023 at 12:56 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
I think a few other things that could help would be `ceph osd df tree` which will show the hierarchy across different crush domains.

Good idea: https://pastebin.com/y07TKt52

And if you’re doing something like erasure coded pools, or something other than replication 3, maybe `ceph osd crush rule dump` may provide some further context with the tree output.

No erasure coded pools - all replication.

Also, the cluster is running Luminous (12) which went EOL 3 years ago tomorrow.
So there are also likely a good bit of improvements all around under the hood to be gained by moving forward from Luminous.

Yes, nobody here wants to touch upgrading this at all - too terrified of breaking things. This ceph deployment is serving several hundred VMs.

The general feeling is that we're stuck on luminous and that it's destructive to upgrade to anything else. I refuse to believe that is true. At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff that came with I think 12.2.2...

What would you recommend upgrading luminous to?

Though, I would say take care of the scrub errors prior to doing any major upgrades, as well as checking your upgrade path (can only upgrade two releases at a time, if you have filestore OSDs, etc).

Yeah, there seems to be a fear that attempting to repair those will negatively impact performance even more. I disagree and think we should do them immediately.

Also, there seems to be a belief that bluestore is an 'all-or-nothing' proposition and that it's impossible to migrate from filestore to bluestore. Yet I see that you can have a mixture of both in your deployments and it's indeed possible to migrate from filestore to bluestore.

TL;DR -- there is a *lot* of fear of touching this thing because nobody is truly an 'expert' in it atm. But not touching it is why we've gotten ourselves into a situation with broken stuff and horrendous performance.

Thanks Reed!
-Dave

-Reed

On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave@xxxxxxxxxxxx> wrote:

There is a
lot of variability in drive sizes - two different sets of admins added
disks sized between 6TB and 16TB and I suspect this and imbalanced
weighting is to blame.

CEPH OSD DF:

(not going to paste that all in here): https://pastebin.com/CNW5RKWx

What else am I missing in terms of what to share with you all?

Thanks all,
-Dave
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx