Re: Decrepit ceph cluster performance

J David <j.david.lists@xxxxxxxxx> · Sun, 13 Aug 2023 23:55:22 -0400

On Sun, Aug 13, 2023 at 11:34 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
> Ahhhhh, from the original phrasing I thought that Quincy correlated with a sharp drop.

It does, but the causation is in the other direction.

We recently started experimenting with Proxmox Backup Server, which is
really cool, but performs enough IO to basically lock out the VM being
backed up, leading to IO timeouts, leading to user complaints. :-(

We've always thought this was an unavoidable (but intermittent)
problem specific to our workload until this happened.  Once this
happened, we then upgraded to Quincy because that is the version
Proxmox currently recommends/supports and we were approaching it as a
Proxmox problem.  Then we reproduced it with rados bench within the
Ceph cluster with no Proxmox involvement. So we no longer think that.

TLDR we upgraded to Quincy because IO demands from new backup software
made the problem worse, rather than Quincy made the problem worse.

> > They have the latest firmware
>
> As per recent isdct/intelmas/sst?  The web site?

Yes.  It's all "Solidigm" now, which has made information harder to
find and firmware harder to get, but these drives aren't exactly
getting regular updates at this point.

> Just for grins, I might suggest -- once you have a fully healthy cluster -- destroying, secure-erasing, and redeploying a few OSDs at a time within a single failure domain. How old are the OSDs?

The SSDs are probably 5-8 years old.  The OSDs were rebuilt to
bluestore around the luminous timeframe. (Nautilus, maybe.  It was a
while ago.)

> I suspect at least some might be Filestore and thus would be redeployed with BlueStore.

They are not; we manually converted them all to Bluestore

> Newer SSD controllers / models are better than older models at housekeeping over time, so the secure-erase might freshen performance.

I mean... I don't have much else to try, so I may give it a shot!  My
only hesitation is that there's not really any problem indicator I
could check afterward. So I don't know how I would tell if it made a
difference unless I did them all and then the problem went away.
Which at the speed this thing rebuilds might well be a 3-month
project. :-/

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx