Re: snaptrim number of objects

Frank Schilder <frans@xxxxxx> · Mon, 21 Aug 2023 10:38:02 +0000

Hi Angelo,

was this cluster upgraded (major version upgrade) before these issues started? We observed that with certain paths of a major version upgrade and the only way to fix that was to re-deploy all OSDs step by step.

You can try a rocks-DB compaction first. If that doesn't help, rebuilding the OSDs might be the only way out.

You should also confirm that all ceph-daemons are on the same version and that require-osd-release is reporting the same major version as well:

ceph report | jq '.osdmap.require_osd_release'

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Angelo Hongens <angelo@xxxxxxxxxx>
Sent: Saturday, August 19, 2023 9:58 AM
To: Patrick Donnelly
Cc: ceph-users@xxxxxxx
Subject:  Re: snaptrim number of objects

On 07/08/2023 18:04, Patrick Donnelly wrote:
>> I'm trying to figure out what's happening to my backup cluster that
>> often grinds to a halt when cephfs automatically removes snapshots.
>
> CephFS does not "automatically" remove snapshots. Do you mean the
> snap_schedule mgr module?

Yup.

>> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
>> CephFS stops doing client i/o.
>
> What health warnings do you see? You can try configuring snap trim:
>
> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep

Mostly a looot of SLOW_OPS. And I guess as a result of that
MDS_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO,
MDS_TRIM warnings.

 >> That won't explain why my cluster bogs down, but at least it gives
 >> some visibility. Running 17.2.6 everywhere by the way.
 >
 > Please let us know how configuring snaptrim helps or not.
 >

When I set nosnaptrim, all I/O immediately restores. When I unset
nosnaptrim, i/o stops again.

One of the symptoms is that OSD's go to about 350% cpu per daemon.

I got the feeling for a while that setting osd_snap_trim_sleep_ssd to 1
helped. I have 120 HDD osd's with wal/journal on ssd, does it even use
this value? Everything seemed stable, but eventually another few days
passed, and suddenly removing a snapshot brought the cluster down again.
So I guess that wasn't the cause.

Now what I'm trying to do is set osd_max_trimming_pgs to 0 for all
disks, and slowly setting it to 1 for a few osd's. This seems to work
for a while, but still it brings the cluster down every now and then,
and if not, the cluster is so slow it's almost unusable.

This whole troubleshooting process is taking weeks. I just noticed that
when 'the problem occurs', a lot of OSD's on a host (15 osd's per host)
start using a lot of CPU, even though for example only 3 OSD's on this
machine have their osd_max_trimming_pgs set to 1, the rest to 0. Disk
doesn't seem to be the bottleneck.

Restarting the daemons seems to solve the problem for a while, although
the high cpu usage pops up on a different osd node every time.

I am at a loss here. I'm almost thinking it's some kind of bug in the
osd daemons, but I have no idea how to troubleshoot this.

Angelo.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx