Re: Very slow snaptrim operations blocking client I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This might be due to tombstone accumulation in rocksdb. You can try to
issue a compact to all of your OSDs and see if that helps (ceph tell
osd.XXX compact). I usually prefer to do this one host at a time just
in case it causes issues, though on a reasonably fast RBD cluster you
can often get away with compacting everything at once.

Josh

On Fri, Jan 27, 2023 at 6:52 AM Victor Rodriguez
<vrodriguez@xxxxxxxxxxxxx> wrote:
>
> Hello,
>
> Asking for help with an issue. Maybe someone has a clue about what's
> going on.
>
> Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
> it. A bit later, nearly half of the PGs of the pool entered snaptrim and
> snaptrim_wait state, as expected. The problem is that such operations
> ran extremely slow and client I/O was nearly nothing, so all VMs in the
> cluster got stuck as they could not I/O to the storage. Taking and
> removing big snapshots is a normal operation that we do often and this
> is the first time I see this issue in any of my clusters.
>
> Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
> performance for the use case and never had an issue with the hardware.
>
> Both disk I/O and network I/O was very low. Still, client I/O seemed to
> get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
> any active snaptrim operation and client I/O resumes back to normal.
> Enabling snaptrim again makes client I/O to almost halt again.
>
> I've been playing with some settings:
>
> ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
> ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
>
> None really seemed to help. Also tried restarting OSD services.
>
> This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
> there any setting that must be changed which may cause this problem?
>
> I have scheduled a maintenance window, what should I look for to
> diagnose this problem?
>
> Any help is very appreciated. Thanks in advance.
>
> Victor
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux