CephFS snaptrim bug?

Linkriver Technology <technology@xxxxxxxxxxxxxxxxxxxxx> · Wed, 23 Feb 2022 21:43:48 +0100

Hello,

I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15) over the
weekend. The upgrade went well as far as I can tell.

Earlier today, noticing that our CephFS data pool was approaching capacity, I
removed some old CephFS snapshots (taken weekly at the root of the filesystem),
keeping only the most recent one (created today, 2022-02-21). As expected, a
good fraction of the PGs transitioned from active+clean to active+clean+snaptrim
or active+clean+snaptrim_wait. In previous occasions when I removed a snapshot
it took a few days for snaptrimming to complete. This would happen without
noticeably impacting other workloads, and would also free up an appreciable
amount of disk space.

This time around, after a few hours of snaptrimming, users complained of high IO
latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
active MDS. I attributed this to the snaptrimming and decided to reduce it by
initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to
help much, so I then set it to 0, which had the surprising effect of
transitioning all PGs back to active+clean (is this intended?). I also restarted
the MDS which seemed to be struggling. IO latency went back to normal
immediately.

Outside of users' working hours, I decided to resume snaptrimming by setting
osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise, nothing
happened. All PGs remained (and still remain at time of writing) in the state
active+clean, even after restarting some of them. This definitely seems
abnormal, as I mentioned earlier, snaptrimming this FS previously would take in
the order of multiple days. Moreover, if snaptrim were truly complete, I would
expect pool usage to have dropped by appreciable amounts (at least a dozen
terabytes), but that doesn't seem to be the case.

A du on the CephFS root gives:

# du -sh /mnt/pve/cephfs
31T    /mnt/pve/cephfs

But:

# ceph df
<snip>
--- POOLS ---
POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
cephfs_data             7   512   43 TiB  190.83M  147 TiB  93.22    3.6 TiB
cephfs_metadata         8    32   89 GiB  694.60k  266 GiB   1.32    6.4 TiB
<snip>

ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.

Did CephFS just leak a massive 12 TiB worth of objects...? It seems to me that
the snaptrim operation did not complete at all.

Perhaps relatedly:

# ceph daemon mds.choi dump snaps
{
    "last_created": 93,
    "last_destroyed": 94,
    "snaps": [
        {
            "snapid": 93,
            "ino": 1,
            "stamp": "2022-02-21T00:00:01.245459+0800",
            "name": "2022-02-21"
        }
    ]
}

How can last_destroyed > last_created? The last snapshot to have been taken on
this FS is indeed #93, and the removed snapshots were all created on previous
weeks.

Could someone shed some light please? Assuming that snaptrim didn't run to
completion, how can I manually delete objects from now-removed snapshots? I
believe this is what the Ceph documentation calls a "backwards scrub" - but I
didn't find anything in the Ceph suite that can run such a scrub. This pool is
filling up fast, I'll throw in some more OSDs for the moment to buy some time,
but I certainly would appreciate your help!

Happy to attach any logs or info you deem necessary.

Regards,

LRT
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx