Re: snaptrim number of objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 07/08/2023 18:04, Patrick Donnelly wrote:
I'm trying to figure out what's happening to my backup cluster that
often grinds to a halt when cephfs automatically removes snapshots.

CephFS does not "automatically" remove snapshots. Do you mean the
snap_schedule mgr module?

Yup.


Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
CephFS stops doing client i/o.

What health warnings do you see? You can try configuring snap trim:

https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep

Mostly a looot of SLOW_OPS. And I guess as a result of that
MDS_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO, MDS_TRIM warnings.


>> That won't explain why my cluster bogs down, but at least it gives
>> some visibility. Running 17.2.6 everywhere by the way.
>
> Please let us know how configuring snaptrim helps or not.
>

When I set nosnaptrim, all I/O immediately restores. When I unset nosnaptrim, i/o stops again.

One of the symptoms is that OSD's go to about 350% cpu per daemon.

I got the feeling for a while that setting osd_snap_trim_sleep_ssd to 1 helped. I have 120 HDD osd's with wal/journal on ssd, does it even use this value? Everything seemed stable, but eventually another few days passed, and suddenly removing a snapshot brought the cluster down again. So I guess that wasn't the cause.

Now what I'm trying to do is set osd_max_trimming_pgs to 0 for all disks, and slowly setting it to 1 for a few osd's. This seems to work for a while, but still it brings the cluster down every now and then, and if not, the cluster is so slow it's almost unusable.

This whole troubleshooting process is taking weeks. I just noticed that when 'the problem occurs', a lot of OSD's on a host (15 osd's per host) start using a lot of CPU, even though for example only 3 OSD's on this machine have their osd_max_trimming_pgs set to 1, the rest to 0. Disk doesn't seem to be the bottleneck.

Restarting the daemons seems to solve the problem for a while, although the high cpu usage pops up on a different osd node every time.

I am at a loss here. I'm almost thinking it's some kind of bug in the osd daemons, but I have no idea how to troubleshoot this.

Angelo.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux