Re: Very slow snaptrim operations blocking client I/O

Matt Vandermeulen <storage@xxxxxxxxxxxx> · Sun, 29 Jan 2023 11:01:10 -0400

I should have explicitly stated that during the recovery, it was still 
quite bumpy for customers.  Some snaptrims were very quick, some took 
what felt like a really long time.  This was however a cluster with a 
very large number of volumes and a long, long history of snapshots.  I'm 
not sure what the difference will be from our case versus a single large 
volume with a big snapshot.

On 2023-01-28 20:45, Victor Rodriguez wrote:
On 1/29/23 00:50, Matt Vandermeulen wrote:
I've observed a similar horror when upgrading a cluster from Luminous 
to Nautilus, which had the same effect of an overwhelming amount of 
snaptrim making the cluster unusable.

In our case, we held its hand by setting all OSDs to have zero max 
trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim 
a few OSDs at a time.  It was painful to babysit but it allowed the 
cluster to catch up without falling over.

That's an interesting approach! Thanks!

On preliminary tests seems that just running snaptrim on a single PG of 
a single OSD still makes the cluster barely usable. I have to increase 
osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable by getting 
a third of its performance. After a while, a few PG got trimmed and 
feels like some of them are harder to trim than others, as some need a 
higher osd_snap_trim_sleep_ssd value to let the cluster perform.

I don't know how long this is going to take... Maybe recreating the 
OSD's and dealing with the rebalance is a better option?

There's something ugly going on here... I would really like to put my 
finger on it.

On 2023-01-28 19:43, Victor Rodriguez wrote:
After some investigation this is what I'm seeing:

- OSD processes get stuck at least at 100% CPU if I ceph osd unset 
nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim. 
They stayed like that for at least 26 hours. Some quick benchmarks 
don't show a reduction of the performance of the cluster.

- Restarting a OSD lowers it's CPU usage to typical levels, as 
expected, but it also usually sets some other OSD in a different host 
to typical levels.

- All OSDs in this cluster take quite a bit to start: between 35 to 
70 seconds depending on the OSD. Clearly much longer than any other 
OSD in any of my clusters.

- I believe that the size of the rocksdb database is dumped in the 
OSD log when an automatic compact operation is triggered. The "sum" 
sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger 
that those in any other cluster I have.

- ceph daemon osd.* calc_objectstore_db_histogram is giving values 
for num_pgmeta_omap (I don't know what it is) way bigger than those 
on any other of my clusters for some OSD. Also, values are not 
similar among the OSD which hold the same PGs.

osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

- Compacting the OSD barely reduces rocksdb size and does not reduce 
num_pgmeta_omap at all.

- This is the only cluster I have were there are some RBD images that 
I mount directly from some clients, that is, they are not disks for 
QEMU/Proxmox VMs. Maybe I have something misconfigured related to 
this?  This cluster is at least two and half years old an never had 
this issue with snaptrims.

Thanks in advance!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx