Snaptriming speed degrade with pg increase

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Fri, 29 Nov 2024 02:30:09 +0000

Hi,

When we scale the placement group on a pool located in a full nvme cluster, the snaptriming speed degrades a lot.
Currently we are running with these values to not degrade client op and have some progress on snaptrimmin, but it is terrible. (octopus 15.2.17 on ubuntu 20.04)

-osd_max_trimming_pgs=2
--osd_snap_trim_sleep=0.1
--osd_pg_max_concurrent_snap_trims=2

We had a big pool which we used to have 128PG and that length of the snaptrimming took around 45-60 minutes.
Due to impossible to do maintenance on the cluster with 600GB pg sizes because it can easily max out a cluster (which we did), we increased to 1024 and the snaptrimming duration increased to 3.5 hours.

Is there any good solution that we are missing to fix this?

On the hardware level I've changed server profile to tune some numa settings but seems like didn't help still.

Thank you
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx