More information:
The snaptrim take a lot of time but the he objects_trimmed are "0"
"objects_trimmed": 0,
"snaptrim_duration": 500.58076017500002,
It could explain, why the queue are growing up..
Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:
Hello again,
I checked the pgs dump. Snapshot grow up
Query für PG: 3.12
{
"snap_trimq":
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",
* "snap_trimq_len": 5421,*
"state": "active+clean+snaptrim",
"epoch": 734130,
Query für PG: 3.12
{
"snap_trimq":
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",
* "snap_trimq_len": 5741,*
"state": "active+clean+snaptrim",
"epoch": 734240,
"up": [
Do you know the way to see if the snaptim "process" works?
Best Regard
Gio
Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:
Hello Eugen,
thank you for your answer.
I restarted all the kube-ceph nodes one after the other. Nothing has
changed.
ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /
Is there a way to see how many snapshots will be deleted per hour?
Regards,
Gio
Am 17.08.2024 um 10:12 schrieb Eugen Block:
Hi,
have you tried to fail the mgr? Sometimes the PG stats are not
correct. You could also temporarily disable snapshots to see if
things settle down.
Zitat von Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx>:
Hello all,
We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a Kubernetes
environment. Last week, we had a problem with the MDS falling
behind on trimming every 4-5 days (GitHub issue link). We resolved
the issue using the steps outlined in the GitHub issue.
We have 3 hosts (I know, I need to increase this as soon as
possible, and I will!) and 6 OSDs. After running the commands:
ceph config set mds mds_dir_max_commit_size 80,
ceph fs fail <fs_name>, and
ceph fs set <fs_name> joinable true,
After that, the snaptrim queue for our PGs has stopped decreasing.
All PGs of our CephFS are in either active+clean+snaptrim_wait or
active+clean+snaptrim states. For example, the PG 3.12 is in the
active+clean+snaptrim state, and its snap_trimq_len was 4077
yesterday but has increased to 4538 today.
I increased the osd_snap_trim_priority to 10 (ceph config set osd
osd_snap_trim_priority 10), but it didn't help. Only the PGs of our
CephFS have this problem.
Do you have any ideas on how we can resolve this issue?
Thanks in advance,
Giovanna
p.s. I'm not a ceph expert :-).
Faulkener asked me for more information, so here it is:
MDS Memory: 11GB
mds_cache_memory_limit: 11,811,160,064 bytes
root@kube-master02:~# ceph fs snap-schedule status /
{
"fs": "rook-cephfs",
"subvol": null,
"path": "/",
"rel_path": "/",
"schedule": "3h",
"retention": {"h": 24, "w": 4},
"start": "2024-05-05T00:00:00",
"created": "2024-05-05T17:28:18",
"first": "2024-05-05T18:00:00",
"last": "2024-08-15T18:00:00",
"last_pruned": "2024-08-15T18:00:00",
"created_count": 817,
"pruned_count": 817,
"active": true
}
I do not understand if the snapshots in the PGs are correlated with
the snapshots on CephFS. Until we encountered the issue with the
"MDS falling behind on trimming every 4-5 days," we didn't have any
problems with snapshots.
Could someone explain me this or send me to the documentation?
Thank you
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx