Re: The snaptrim queue of PGs has not decreased for several days.

MARTEL Arnaud <arnaud.martel@xxxxxx> · Tue, 20 Aug 2024 08:25:15 +0000

I had this problem once in the past and found that it was related to a particular osd. To identify it, I ran the command “ceph pg dump | grep snaptrim | grep -v ‘snaptrim_wait’” and found that the osd displayed in the “UP_PRIMARY” column was almost always the same. 
So I restarted this osd and that was enough to unblock the snaptrim (the number of snaps in the queue then decreased and, after a few minutes, all snaptrims were completed).

Arnaud

Le 20/08/2024 09:47, « Eugen Block » <eblock@xxxxxx <mailto:eblock@xxxxxx>> a écrit :

Did you reduce the default values I mentioned? You could also look 
into the historic_ops of the primary OSD for one affected PG:

ceph tell osd.<OSD_ID> dump_historic_ops_by_duration

But I'm not sure if that can actually help here. There are plenty of 
places to look at, you could turn on debug logs on one primary OSD and 
inspect the output.

I just get the feeling that this is one of the corner cases with too 
few OSDs, although the cluster load seems to be low.

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx <mailto:giovanna.ratini@xxxxxxxxxxxxxxx>>:

> Hello Eugen,
>
> yesterday after stop and go of snaptrim the queue decrease a little 
> and then remain blocked.
> They didn't grow and didn't decrease.
>
> Is that good or bad?
>
>
> Am 19.08.2024 um 15:43 schrieb Eugen Block:
>> There's a lengthy thread [0] where several approaches are proposed. 
>> The worst is a OSD recreation, but that's the last resort, of course.
>>
>> What's are the current values for these configs?
>>
>> ceph config get osd osd_pg_max_concurrent_snap_trims
>> ceph config get osd osd_max_trimming_pgs
>>
>> Maybe decrease them to 1 each while the nosnaptrim flag is set, 
>> then unset it. You could also try online (and/or offline osd 
>> compaction) before unsetting the flag. Are the OSD processes 
>> utilizing an entire CPU?
>>
>> [0] https://www.spinics.net/lists/ceph-users/msg75626.html <https://www.spinics.net/lists/ceph-users/msg75626.html>
>>
>> Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx <mailto:giovanna.ratini@xxxxxxxxxxxxxxx>>:
>>
>>> Hallo Eugen,
>>>
>>> yes, the load is for now not too much.
>>>
>>> I stop the snap and now this is the output. No changes in the queue.
>>>
>>> root@kube-master02:~# k ceph -s
>>> Info: running 'ceph' command with args: [-s]
>>> cluster:
>>> id: 3a35629a-6129-4daf-9db6-36e0eda637c7
>>> health: HEALTH_WARN
>>> nosnaptrim flag(s) set
>>> 32 pgs not deep-scrubbed in time
>>> 32 pgs not scrubbed in time
>>>
>>> services:
>>> mon: 3 daemons, quorum bx,bz,ca (age 30h)
>>> mgr: a(active, since 29h), standbys: b
>>> mds: 1/1 daemons up, 1 hot standby
>>> osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
>>> flags nosnaptrim
>>>
>>> data:
>>> volumes: 1/1 healthy
>>> pools: 4 pools, 97 pgs
>>> objects: 4.21M objects, 2.5 TiB
>>> usage: 7.7 TiB used, 76 TiB / 84 TiB avail
>>> pgs: 65 active+clean
>>> 32 active+clean+snaptrim_wait
>>>
>>> io:
>>> client: 7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr
>>>
>>> Am 19.08.2024 um 14:54 schrieb Eugen Block:
>>>> What happens when you disable snaptrimming entirely?
>>>>
>>>> ceph osd set nosnaptrim
>>>>
>>>> So the load on your cluster seems low, but are the OSDs heavily 
>>>> utilized? Have you checked iostat?
>>>>
>>>> Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx <mailto:giovanna.ratini@xxxxxxxxxxxxxxx>>:
>>>>
>>>>> Hello Eugen,
>>>>>
>>>>> *root@kube-master02:~# k ceph -s*
>>>>>
>>>>> Info: running 'ceph' command with args: [-s]
>>>>> cluster:
>>>>> id: 3a35629a-6129-4daf-9db6-36e0eda637c7
>>>>> health: HEALTH_WARN
>>>>> 32 pgs not deep-scrubbed in time
>>>>> 32 pgs not scrubbed in time
>>>>>
>>>>> services:
>>>>> mon: 3 daemons, quorum bx,bz,ca (age 13h)
>>>>> mgr: a(active, since 13h), standbys: b
>>>>> mds: 1/1 daemons up, 1 hot standby
>>>>> osd: 6 osds: 6 up (since 5h), 6 in (since 5d)
>>>>>
>>>>> data:
>>>>> volumes: 1/1 healthy
>>>>> pools: 4 pools, 97 pgs
>>>>> objects: 4.20M objects, 2.5 TiB
>>>>> usage: 7.7 TiB used, 76 TiB / 84 TiB avail
>>>>> pgs: 65 active+clean
>>>>> 20 active+clean+snaptrim_wait
>>>>> 12 active+clean+snaptrim
>>>>>
>>>>> io:
>>>>> client: 3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr
>>>>>
>>>>> If I understand the documentation correctly, I will never have a 
>>>>> scrub unless the PGs (Placement Groups) are active and clean.
>>>>>
>>>>> All 32 PGs of the CephFS pool have been in this status for several days:
>>>>>
>>>>> * 20 active+clean+snaptrim_wait
>>>>> * 12 active+clean+snaptrim"
>>>>>
>>>>> Today, I restarted the MON, MGR, and MDS, but no changes in the growing.
>>>>>
>>>>> Am 18.08.2024 um 18:39 schrieb Eugen Block:
>>>>>> Can you share the current ceph status? Are the OSDs reporting 
>>>>>> anything suspicious? How is the disk utilization?
>>>>>>
>>>>>> Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx <mailto:giovanna.ratini@xxxxxxxxxxxxxxx>>:
>>>>>>
>>>>>>> More information:
>>>>>>>
>>>>>>> The snaptrim take a lot of time but the he objects_trimmed are "0"
>>>>>>>
>>>>>>> "objects_trimmed": 0,
>>>>>>> "snaptrim_duration": 500.58076017500002,
>>>>>>>
>>>>>>> It could explain, why the queue are growing up..
>>>>>>>
>>>>>>>
>>>>>>> Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:
>>>>>>>> Hello again,
>>>>>>>>
>>>>>>>> I checked the pgs dump. Snapshot grow up
>>>>>>>>
>>>>>>>> Query für PG: 3.12
>>>>>>>> {
>>>>>>>> "snap_trimq": 
>>>>>>>> "[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",
>>>>>>>> * "snap_trimq_len": 5421,*
>>>>>>>> "state": "active+clean+snaptrim",
>>>>>>>> "epoch": 734130,
>>>>>>>>
>>>>>>>> Query für PG: 3.12
>>>>>>>> {
>>>>>>>> "snap_trimq": 
>>>>>>>> "[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",
>>>>>>>> * "snap_trimq_len": 5741,*
>>>>>>>> "state": "active+clean+snaptrim",
>>>>>>>> "epoch": 734240,
>>>>>>>> "up": [
>>>>>>>>
>>>>>>>> Do you know the way to see if the snaptim "process" works?
>>>>>>>>
>>>>>>>> Best Regard
>>>>>>>>
>>>>>>>> Gio
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:
>>>>>>>>> Hello Eugen,
>>>>>>>>>
>>>>>>>>> thank you for your answer.
>>>>>>>>>
>>>>>>>>> I restarted all the kube-ceph nodes one after the other. 
>>>>>>>>> Nothing has changed.
>>>>>>>>>
>>>>>>>>> ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /
>>>>>>>>>
>>>>>>>>> Is there a way to see how many snapshots will be deleted per hour?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Gio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 17.08.2024 um 10:12 schrieb Eugen Block:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> have you tried to fail the mgr? Sometimes the PG stats are 
>>>>>>>>>> not correct. You could also temporarily disable snapshots 
>>>>>>>>>> to see if things settle down.
>>>>>>>>>>
>>>>>>>>>> Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx <mailto:giovanna.ratini@xxxxxxxxxxxxxxx>>:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a 
>>>>>>>>>>> Kubernetes environment. Last week, we had a problem with 
>>>>>>>>>>> the MDS falling behind on trimming every 4-5 days (GitHub 
>>>>>>>>>>> issue link). We resolved the issue using the steps 
>>>>>>>>>>> outlined in the GitHub issue.
>>>>>>>>>>>
>>>>>>>>>>> We have 3 hosts (I know, I need to increase this as soon 
>>>>>>>>>>> as possible, and I will!) and 6 OSDs. After running the 
>>>>>>>>>>> commands:
>>>>>>>>>>>
>>>>>>>>>>> ceph config set mds mds_dir_max_commit_size 80,
>>>>>>>>>>>
>>>>>>>>>>> ceph fs fail <fs_name>, and
>>>>>>>>>>>
>>>>>>>>>>> ceph fs set <fs_name> joinable true,
>>>>>>>>>>>
>>>>>>>>>>> After that, the snaptrim queue for our PGs has stopped 
>>>>>>>>>>> decreasing. All PGs of our CephFS are in either 
>>>>>>>>>>> active+clean+snaptrim_wait or active+clean+snaptrim 
>>>>>>>>>>> states. For example, the PG 3.12 is in the 
>>>>>>>>>>> active+clean+snaptrim state, and its snap_trimq_len was 
>>>>>>>>>>> 4077 yesterday but has increased to 4538 today.
>>>>>>>>>>>
>>>>>>>>>>> I increased the osd_snap_trim_priority to 10 (ceph config 
>>>>>>>>>>> set osd osd_snap_trim_priority 10), but it didn't help. 
>>>>>>>>>>> Only the PGs of our CephFS have this problem.
>>>>>>>>>>>
>>>>>>>>>>> Do you have any ideas on how we can resolve this issue?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Giovanna
>>>>>>>>>>> p.s. I'm not a ceph expert :-).
>>>>>>>>>>> Faulkener asked me for more information, so here it is:
>>>>>>>>>>> MDS Memory: 11GB
>>>>>>>>>>> mds_cache_memory_limit: 11,811,160,064 bytes
>>>>>>>>>>>
>>>>>>>>>>> root@kube-master02:~# ceph fs snap-schedule status /
>>>>>>>>>>> {
>>>>>>>>>>> "fs": "rook-cephfs",
>>>>>>>>>>> "subvol": null,
>>>>>>>>>>> "path": "/",
>>>>>>>>>>> "rel_path": "/",
>>>>>>>>>>> "schedule": "3h",
>>>>>>>>>>> "retention": {"h": 24, "w": 4},
>>>>>>>>>>> "start": "2024-05-05T00:00:00",
>>>>>>>>>>> "created": "2024-05-05T17:28:18",
>>>>>>>>>>> "first": "2024-05-05T18:00:00",
>>>>>>>>>>> "last": "2024-08-15T18:00:00",
>>>>>>>>>>> "last_pruned": "2024-08-15T18:00:00",
>>>>>>>>>>> "created_count": 817,
>>>>>>>>>>> "pruned_count": 817,
>>>>>>>>>>> "active": true
>>>>>>>>>>> }
>>>>>>>>>>> I do not understand if the snapshots in the PGs are 
>>>>>>>>>>> correlated with the snapshots on CephFS. Until we 
>>>>>>>>>>> encountered the issue with the "MDS falling behind on 
>>>>>>>>>>> trimming every 4-5 days," we didn't have any problems with 
>>>>>>>>>>> snapshots.
>>>>>>>>>>>
>>>>>>>>>>> Could someone explain me this or send me to the documentation?
>>>>>>>>>>> Thank you
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>
>>>>> -- 
>>>>> Giovanna Ratini
>>>>> Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx <mailto:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx>
>>>>> Phone: +49 (0) 7531 88 - 4550
>>>>>
>>>>> Technical Support
>>>>> Data Analysis and Visualization Group
>>>>> Department of Computer and Information Science
>>>>> University of Konstanz (Box 78)
>>>>> Universitätsstr. 10
>>>>> 78457 Konstanz, Germany
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>
>>> -- 
>>> Giovanna Ratini
>>> Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx <mailto:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx>
>>> Phone: +49 (0) 7531 88 - 4550
>>>
>>> Technical Support
>>> Data Analysis and Visualization Group
>>> Department of Computer and Information Science
>>> University of Konstanz (Box 78)
>>> Universitätsstr. 10
>>> 78457 Konstanz, Germany
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>
> -- 
> Giovanna Ratini
> Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx <mailto:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx>
> Phone: +49 (0) 7531 88 - 4550
>
> Technical Support
> Data Analysis and Visualization Group
> Department of Computer and Information Science
> University of Konstanz (Box 78)
> Universitätsstr. 10
> 78457 Konstanz, Germany
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx