Re: The snaptrim queue of PGs has not decreased for several days.

Eugen Block <eblock@xxxxxx> · Thu, 22 Aug 2024 12:29:50 +0000

Just a quick update on this topic. I assisted Giovanna directly off  
list. For now the issue seems resolved, although I don't think we  
really fixed anything but rather got rid of the current symptoms.

A couple of findings for posterity:

- There's a k8s pod creating new snap-schedules every couple of hours  
or so, we removed dozens of them, probably around 30.
- We removed all existing cephfs snapshots after mounting the root  
dir, this didn't have any effect on the snaptrims yet.
- We increased the number of parallel snaptrim operations to 32 since  
the NVMe OSDs were basically idle, which only marked all 32 PGs as  
snaptrimming and none were in snaptrim_wait status. But still no real  
progress visible. Inspecting the OSD logs in debug level 10 didn't  
reveal anything obvious.
- We then increased pg_num to 64 (and disabled autoscaler for this  
pool) since 'ceph osd df' showed only around 40 PGs per OSD. This  
actually did slowly get rid of the snaptrimming PGs while backfilling.  
Yay!
- All config changes have been reset to default.

My interpretation is that the increasing number of snap-schedules  
accumulated so many snapshots, causing slow trimming. Here's a snippet  
of the queue (from 'ceph osd pool ls detail output'):

removed_snaps_queue  
[5b3ee~1,5be5c~6f,5bf71~a1,5c0b8~a1,5c1fd~2,5c201~9f,5c346~a1,5c48d~a1,5c5d0~1,5c71a~1,5c85d~1,5c85f~1,5c861~1,5c865~1,5c9a6~1,5c9a8~1,5c9aa~1,5c9ad~1,5caef~1,5caf1~1,5caf3~1,5caf6~1,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~4,5ce2a~1,5ce2c~1,5ce2f~a3,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a7,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a9,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a9,5f0bd~a7,5f166~2,5f206~a9,5f34f~a9,5f499~a9,5f5e3~a7,5f68c~2,5f72d~a7,5f875~a1,5f9b7~a7,5fa61~2,5fb01~a7,5fba9~1,5fbab~1,5fc4b~a7,5fcf3~2,5fd95~a9,5fedf~a7,5ff88~2,60028~a1,600ca~6,600d1~1,600d3~1,60173~a7,6021c~2,602bd~bd]

My assumption was, if we have more PGs, more trimming can be done in  
parallel to finally catch up. My suspicion is that this could have  
something to do with mclock although I have no real evidence, except a  
thread I found yesterday [1].
I recommended to keep an eye on the removed_snaps_queue as well as  
checking the pod creating so many snap-schedules (btw., they were all  
exactly the same, same retention time etc.) and modify it so it  
doesn't flood the cluster with unnecessary snapshots.
If this situation comes up again, we'll try it with wpq scheduler  
instead of mclock, or search for better mclock settings. But since the  
general recommendation to stick to wpq hasn't been revoked yet, it  
might be the better approach anyway.

We'll see how it goes.

[1] https://www.spinics.net/lists/ceph-users/msg78514.html

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hello Eugen,

Hi (please don't drop the ML from your responses),
Sorry. I didn't pay attention. I will.

All PGs of pool cephfs are affected and they are in all OSDs

then just pick a random one and check if anything stands out. I'm  
not sure if you mentioned it already, did you also try restarting  
OSDs?

Yes, I've done everything, including compaction, reducing defaults,  
and OSD restarts.

The growth seems to have stopped, but there hasn't been a decrease.  
It appears that only the CephFS pool is problematic. I'm an Oracle  
admin and I don't have much experience with Ceph, so perhaps my  
questions might seem a bit naive.

I have a lot of space in this cluster. Could I create a new cephfs  
pool (cephfs01) and copy the data over to it?
Then I would change the name of the pool in Rook and hope that the  
pods will find their PVs."

Regards,

Gio

Oh, not yesterday. I do it now, then I compat all osds with nostrim.
Do I add OSDs?

Let's wait for the other results first (compaction, reducing  
defaults, OSD restart). If that doesn't change anything, I would  
probably try to add three more OSDs. I assume you have three hosts?

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hello Eugen,

Am 20.08.2024 um 09:44 schrieb Eugen Block:
 You could also look into the historic_ops of the primary OSD for  
one affected PG:

All PGs of pool cephfs are affected and they are in all OSDs :-(

Did you reduce the default values I mentioned?

Oh, not yesterday. I do it now, then I compat all osds with nostrim.

Do I add OSDs?

Regars,

Gio

ceph tell osd.<OSD_ID> dump_historic_ops_by_duration

But I'm not sure if that can actually help here. There are plenty  
of places to look at, you could turn on debug logs on one primary  
OSD and inspect the output.

I just get the feeling that this is one of the corner cases with  
too few OSDs, although the cluster load seems to be low.

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hello Eugen,

yesterday after stop and go of snaptrim the queue decrease a  
little and then remain blocked.
They didn't grow and didn't decrease.

Is that good or bad?

Am 19.08.2024 um 15:43 schrieb Eugen Block:
There's a lengthy thread [0] where several approaches are  
proposed. The worst is a OSD recreation, but that's the last  
resort, of course.

What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set,  
then unset it. You could also try online (and/or offline osd  
compaction) before unsetting the flag. Are the OSD processes  
utilizing an entire CPU?

[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id:     3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
            nosnaptrim flag(s) set
            32 pgs not deep-scrubbed in time
            32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
         flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs:     65 active+clean
             32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:
What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs  
heavily utilized? Have you checked iostat?

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id:     3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
            32 pgs not deep-scrubbed in time
            32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs:     65 active+clean
             20 active+clean+snaptrim_wait
             12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never  
have a scrub unless the PGs (Placement Groups) are active  
and clean.

All 32 PGs of the CephFS pool have been in this status for  
several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in  
the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs  
reporting anything suspicious? How is the disk utilization?

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.58076017500002,

It could explain, why the queue are growing up..

Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:
Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",
*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",
*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio

Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:
Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other.  
Nothing has changed.

ok, I deactivate the snap ... : ceph fs snap-schedule  
deactivate /

Is there a way to see how many snapshots will be deleted  
per hour?

Regards,

Gio

Am 17.08.2024 um 10:12 schrieb Eugen Block:
Hi,

have you tried to fail the mgr? Sometimes the PG stats  
are not correct. You could also temporarily disable  
snapshots to see if things settle down.

Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:

Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for  
a Kubernetes environment. Last week, we had a problem  
with the MDS falling behind on trimming every 4-5 days  
(GitHub issue link). We resolved the issue using the  
steps outlined in the GitHub issue.

We have 3 hosts (I know, I need to increase this as  
soon as possible, and I will!) and 6 OSDs. After  
running the commands:

ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail <fs_name>, and

ceph fs set <fs_name> joinable true,

After that, the snaptrim queue for our PGs has stopped  
decreasing. All PGs of our CephFS are in either  
active+clean+snaptrim_wait or active+clean+snaptrim  
states. For example, the PG 3.12 is in the  
active+clean+snaptrim state, and its snap_trimq_len  
was 4077 yesterday but has increased to 4538 today.

I increased the osd_snap_trim_priority to 10 (ceph  
config set osd osd_snap_trim_priority 10), but it  
didn't help. Only the PGs of our CephFS have this  
problem.

Do you have any ideas on how we can resolve this issue?

Thanks in advance,
Giovanna
p.s. I'm not a ceph expert :-).
Faulkener asked me for more information, so here it is:
MDS Memory: 11GB
mds_cache_memory_limit: 11,811,160,064 bytes

root@kube-master02:~# ceph fs snap-schedule status /
{
    "fs": "rook-cephfs",
    "subvol": null,
    "path": "/",
    "rel_path": "/",
    "schedule": "3h",
    "retention": {"h": 24, "w": 4},
    "start": "2024-05-05T00:00:00",
    "created": "2024-05-05T17:28:18",
    "first": "2024-05-05T18:00:00",
    "last": "2024-08-15T18:00:00",
    "last_pruned": "2024-08-15T18:00:00",
    "created_count": 817,
    "pruned_count": 817,
    "active": true
}
I do not understand if the snapshots in the PGs are  
correlated with the snapshots on CephFS. Until we  
encountered the issue with the "MDS falling behind on  
trimming every 4-5 days," we didn't have any problems  
with snapshots.

Could someone explain me this or send me to the documentation?
Thank you
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Giovanna Ratini
Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx
Phone: +49 (0) 7531 88 - 4550

Technical Support
Data Analysis and Visualization Group
Department of Computer and Information Science
University of Konstanz (Box 78)
Universitätsstr. 10
78457 Konstanz, Germany
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Giovanna Ratini
Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx
Phone: +49 (0) 7531 88 - 4550

Technical Support
Data Analysis and Visualization Group
Department of Computer and Information Science
University of Konstanz (Box 78)
Universitätsstr. 10
78457 Konstanz, Germany
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Giovanna Ratini
Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx
Phone: +49 (0) 7531 88 - 4550

Technical Support
Data Analysis and Visualization Group
Department of Computer and Information Science
University of Konstanz (Box 78)
Universitätsstr. 10
78457 Konstanz, Germany
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Giovanna Ratini
Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx
Phone: +49 (0) 7531 88 - 4550

Technical Support
Data Analysis and Visualization Group
Department of Computer and Information Science
University of Konstanz (Box 78)
Universitätsstr. 10
78457 Konstanz, Germany

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Giovanna Ratini
Mail:ratini@xxxxxxxxxxxxxxxxxxxxxxxxx
Phone: +49 (0) 7531 88 - 4550

Technical Support
Data Analysis and Visualization Group
Department of Computer and Information Science
University of Konstanz (Box 78)
Universitätsstr. 10
78457 Konstanz, Germany
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx