Re: snaptrim blocks io on ceph pacific even on fast NVMEs

Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@xxxxxxx> · Wed, 10 Nov 2021 17:39:14 +0100

Hi,

On 11/10/21 16:14, Christoph Adomeit wrote:
But the cluster seemed to slowly "eat" storage space. So yesterday I decided to add 3 more NVMEs, 1 for each node. In the second i added the first nvme as ceph osd the cluster was crashing. I had high loads on all osds and all the osds where dying again and again until i set nodown,noout,noscrub,nodeep-scrub and rtemoved the new osd. The the cluster recovered but had slow io and lots of snaptriHm and snaptrim wait processes.

You may have hit this issue https://tracker.ceph.com/issues/52026. AFAIU 
there could be some untrimmed snapshots (visible in snaptrimq_len with 
`ceph pg dump pgs`) which are only trimmed once the pg is repeered. We 
experience that during testing, but the root cause is not fully 
understood (at least to me).

Maybe once you added your new OSDs made the snaptrim state appeared on 
various PGs which affected your cluster apparently.

I made this smoother by setting --osd_snap_trim_sleep=3.0

Over night the snaptrim_wait pgs became 0 and i had 15% mor free space in the ceph cluster. But during the day the snaptrim_waits increased and increased.

I then set osd_snap_trim_sleep to 0.0 again and most vms had extremely high iowaits ore crashed.

Now I did a ceph osd set nosnaptrim and the cluster is flying again. Iowait 0 on all vms but count
of snaptrim wait is slowly increasing.

How can I get the snaptrims running fast and not affect ceph io performance ?
My theory is until yesterday for some reasons the snaptrims were not running for some reason and therefore the cluster was "eating" storage space. After crash yesterday and restarting the snaptrims the started.

On our test cluster we actually decreased `osd_snap_trim_sleep` to 0.1s 
instead of the default 2s for hybrid OSD because the snaptrim we had 
would have lasted a few weeks without it IIRC. We didn't notice any 
slowdowns, HDD crashing or anything like that (but this cluster doesn't 
have any real production workloads, so we may have overlooked some aspects).

In your case the default value should be set to 
`osd_snap_trim_sleep_ssd` which is 0, so maybe with some SSD/NVME OSD 
the snaptrim do affect performance (with the default settings at 
least)... Therefore, you may want to set `osd_snap_trim_sleep` to 
something different than 0. The 0.1s sleep worked smoothly in our tests, 
but this was needed because I was stress testing snapshots and there was 
many many objects that needed this snaptrim process. You could probably 
increase this value for safety reasons, any value between 0.1s and 3s 
(that you already tested!) is probably fine!

Cheers,

--
Arthur Outhenin-Chalandre
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx