Re: Decrepit ceph cluster performance

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Mon, 14 Aug 2023 18:59:45 +0100

> We recently started experimenting with Proxmox Backup Server,
> which is really cool, but performs enough IO to basically lock
> out the VM being backed up, leading to IO timeouts, leading to
> user complaints. :-(

The two most common things I have had to fix over years as to
storage systems I hav inherited have been:

* Too low IOPS-per-TB to handle a realistic workload.
* Too few total IOPS to handle the user and sysadmin (checking,
  scrubbing, backup, balancing, backfilling, ...) workloads.

Both happen because most sysadmins are heavily incentivized to
save money now even if there is a huge price to pay later when
the storage capacity fills up.

An SSD based storage cluster like the one you have to deal with
has plenty of IOPS, so your case is strange, in particular that
latencies in your tests are low at the same time as IO rates are
low; badly overloaded storage complexes have latencies 1 second
and way above.

That your test reports small latencies as average but a max
latency of 37s and long pauses with 0 IOPS are reported is
suspicious. It could be that *some* OSD SSDs are not in good
condition and they slow down everything, as the Ceph daemons
wait for the slowest OSD to respond. 37s looks like retries on a
failing SSD.

In an ideal world you would have on the cluster a capacity
monitor like Ganglia etc. showing year-long graphs of network
bandwidth and IO rates and latencies, but I guess this was not
setup like that.

> The SSDs are probably 5-8 years old. The OSDs were rebuilt to
> bluestore around the luminous timeframe. (Nautilus, maybe. It
> was a while ago.)

>> Newer SSD controllers / models are better than older models
>> at housekeeping over time, so the secure-erase might freshen
>> performance.

Indeed 5-8 year old firmware may not be as sophisticated as more
recent firmware, in particular as to needing periodic explicit
TRIMs. As to that I noticed this:

>>> Its primary use is serving RBD VM block devices for Proxmox

A VM workload, and in particular RBD, involves often very small
random writes and "mixed-use SSD"s are not as suitable to that,
in particular if the usual and insane practice of having VM
operating systems log to virtual disks has been followed.

So the physical storage on the SSDs may have become hideously
fragmented, thus indeed requiring TRIMs, especially if the
endurance levels are low (which is dangerous), and especially if
the workload never pauses enough to run the firmware compaction
mechanism (which is likely given that the storage complex cannot
sustain both the user workload and backups).

In particular check the logs of these OSDs to see which specific
SSDs are reporting the slowest IOPs

>>> 36 slow ops, oldest one blocked for 37 sec, daemons [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx