Snaptrim making cluster unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

We are running a small cluster with three nodes and 6-8 OSDs each.
The OSDs are SSDs with sizes from 2 to 4 TB. Crush map is configured so all data is replicated to each node.
The Ceph version is Ceph 15.2.6.

Today I deleted 4 Snapshots of the same two 400GB and 500GB rbd volumes.
Shortly after issuing the delete, I noticed the cluster became unresponsive to an extend where almost all our services went down due high IO latency.

After a while, I noticed about 20 active snaptrim tasks + another 200 or so snaptrim_wait.

I tried setting
osd_snap_trim_sleep to 3,
osd_pg_max_concurrent_snap_trims to 1
rbd_balance_snap_reads to true,
rbd_localize_snap_reads to true


Still the only way to make the cluster responsive again was to set osd_pg_max_concurrent_snap_trims to 0 and thus disable snaptrimming. I tried a few other options, but whenever snaptrims are running for a significant number of PGs, the cluster becomes completely unusable.

Are there any other options to throttle snaptrimming for that I haven't tried, yet?


Thank you,

Pascal
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux