Re: CephFS snaptrim bug?

Kári Bertilsson <karibertils@xxxxxxxxx> · Sat, 25 Jun 2022 19:36:04 +0000

Hello

I am also having this issue after having
set osd_pg_max_concurrent_snap_trims = 0 previously to pause the snaptrim.
I upgraded to ceph 17.2.0. Have tried restarting, repeering, deep-scrubbing
all OSD's, so far nothing works.

For one of the affected pools `cephfs_10k` I have tested removing ALL data
and it's still showing 26% usage. All snapshots have been deleted and all
pg's for the pool remain at SNAPTRIMQ_LEN = 0. All pg's are active+clean.

The pool still shows 589k object usage. When testing `rados get object` on
all the objects, it only works for 2.420 of them. The rest seem to be in
some kind of limbo and can not be read or deleted using rados.

# rados -p cephfs_10k listsnaps 10010539c22.00000000

10010539c22.00000000:
cloneid snaps   size    overlap
288     288     30767656        []

# rados -p cephfs_10k get 10010539c22.00000000 10010539c22.00000000
error getting cephfs_10k/10010539c22.00000000: (2) No such file or directory

# rados -p cephfs_10k rm 10010539c22.00000000
error removing cephfs_10k>10010539c22.00000000: (2) No such file or
directory

Is there some way to make the snap trimmer rediscover these objects and
remove them ?

On Fri, Mar 18, 2022 at 2:21 PM Linkriver Technology <
technology@xxxxxxxxxxxxxxxxxxxxx> wrote:

> Hello,
>
> If I understand my issue correctly, it is in fact unrelated to CephFS
> itself,
> rather the problem happens at a lower level (in Ceph itself). IOW, it
> affects
> all kind of snapshots, not just CephFS ones. I believe my FS is healthy
> otherwise. In any case, here is the output of the command you asked:
>
> I ran it a few hours ago:
>
>         "num_strays": 235,
>         "num_strays_delayed": 38,
>         "num_strays_enqueuing": 0,
>         "strays_created": 5414436,
>         "strays_enqueued": 5405983,
>         "strays_reintegrated": 17892,
>         "strays_migrated": 0,
>
> And just now:
>
>         "num_strays": 186,
>         "num_strays_delayed": 0,
>         "num_strays_enqueuing": 0,
>         "strays_created": 5540016,
>         "strays_enqueued": 5531494,
>         "strays_reintegrated": 18128,
>         "strays_migrated": 0,
>
>
> Regards,
>
> LRT
>
> -----Original Message-----
> From: Arnaud M <arnaud.meauzoone@xxxxxxxxx>
> To: Linkriver Technology <technology@xxxxxxxxxxxxxxxxxxxxx>
> Cc: Dan van der Ster <dvanders@xxxxxxxxx>, Ceph Users <ceph-users@xxxxxxx>
> Subject:  Re: CephFS snaptrim bug?
> Date: Thu, 17 Mar 2022 21:48:18 +0100
>
> Hello Linkriver
>
> I might have an issue close to your
>
> Can you tell us if your strays dirs are full ?
>
> What does this command output to you ?
>
> ceph tell mds.0 perf dump | grep strays
>
> Does the value change over time ?
>
> All the best
>
> Arnaud
>
> Le mer. 16 mars 2022 à 15:35, Linkriver Technology <
> technology@xxxxxxxxxxxxxxxxxxxxx> a écrit :
>
> > Hi,
> >
> > Has anyone figured whether those "lost" snaps are rediscoverable /
> > trimmable?
> > All pgs in the cluster have been deep scrubbed since my previous email
> and
> > I'm
> > not seeing any of that wasted space being recovered.
> >
> > Regards,
> >
> > LRT
> >
> > -----Original Message-----
> > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > To: technology@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Ceph Users <ceph-users@xxxxxxx>, Neha Ojha <nojha@xxxxxxxxxx>
> > Subject: Re:  CephFS snaptrim bug?
> > Date: Thu, 24 Feb 2022 09:48:04 +0100
> >
> > See https://tracker.ceph.com/issues/54396
> >
> > I don't know how to tell the osds to rediscover those trimmed snaps.
> > Neha does that possible?
> >
> > Cheers, Dan
> >
> > On Thu, Feb 24, 2022 at 9:27 AM Dan van der Ster <dvanders@xxxxxxxxx>
> > wrote:
> > >
> > > Hi,
> > >
> > > I had a look at the code -- looks like there's a flaw in the logic:
> > > the snaptrim queue is cleared if osd_pg_max_concurrent_snap_trims = 0.
> > >
> > > I'll open a tracker and send a PR to restrict
> > > osd_pg_max_concurrent_snap_trims to >= 1.
> > >
> > > Cheers, Dan
> > >
> > > On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
> > > <technology@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15)
> > over the
> > > > weekend. The upgrade went well as far as I can tell.
> > > >
> > > > Earlier today, noticing that our CephFS data pool was approaching
> > capacity, I
> > > > removed some old CephFS snapshots (taken weekly at the root of the
> > filesystem),
> > > > keeping only the most recent one (created today, 2022-02-21). As
> > expected, a
> > > > good fraction of the PGs transitioned from active+clean to
> > active+clean+snaptrim
> > > > or active+clean+snaptrim_wait. In previous occasions when I removed a
> > snapshot
> > > > it took a few days for snaptrimming to complete. This would happen
> > without
> > > > noticeably impacting other workloads, and would also free up an
> > appreciable
> > > > amount of disk space.
> > > >
> > > > This time around, after a few hours of snaptrimming, users complained
> > of high IO
> > > > latency, and indeed Ceph reported "slow ops" on a number of OSDs and
> > on the
> > > > active MDS. I attributed this to the snaptrimming and decided to
> > reduce it by
> > > > initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't
> > seem to
> > > > help much, so I then set it to 0, which had the surprising effect of
> > > > transitioning all PGs back to active+clean (is this intended?). I
> also
> > restarted
> > > > the MDS which seemed to be struggling. IO latency went back to normal
> > > > immediately.
> > > >
> > > > Outside of users' working hours, I decided to resume snaptrimming by
> > setting
> > > > osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise,
> > nothing
> > > > happened. All PGs remained (and still remain at time of writing) in
> > the state
> > > > active+clean, even after restarting some of them. This definitely
> seems
> > > > abnormal, as I mentioned earlier, snaptrimming this FS previously
> > would take in
> > > > the order of multiple days. Moreover, if snaptrim were truly
> complete,
> > I would
> > > > expect pool usage to have dropped by appreciable amounts (at least a
> > dozen
> > > > terabytes), but that doesn't seem to be the case.
> > > >
> > > > A du on the CephFS root gives:
> > > >
> > > > # du -sh /mnt/pve/cephfs
> > > > 31T    /mnt/pve/cephfs
> > > >
> > > > But:
> > > >
> > > > # ceph df
> > > > <snip>
> > > > --- POOLS ---
> > > > POOL                   ID  PGS   STORED   OBJECTS  USED     %USED
> MAX
> > AVAIL
> > > > cephfs_data             7   512   43 TiB  190.83M  147 TiB  93.22
> > 3.6 TiB
> > > > cephfs_metadata         8    32   89 GiB  694.60k  266 GiB   1.32
> > 6.4 TiB
> > > > <snip>
> > > >
> > > > ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
> > > >
> > > > Did CephFS just leak a massive 12 TiB worth of objects...? It seems
> to
> > me that
> > > > the snaptrim operation did not complete at all.
> > > >
> > > > Perhaps relatedly:
> > > >
> > > > # ceph daemon mds.choi dump snaps
> > > > {
> > > >     "last_created": 93,
> > > >     "last_destroyed": 94,
> > > >     "snaps": [
> > > >         {
> > > >             "snapid": 93,
> > > >             "ino": 1,
> > > >             "stamp": "2022-02-21T00:00:01.245459+0800",
> > > >             "name": "2022-02-21"
> > > >         }
> > > >     ]
> > > > }
> > > >
> > > > How can last_destroyed > last_created? The last snapshot to have been
> > taken on
> > > > this FS is indeed #93, and the removed snapshots were all created on
> > previous
> > > > weeks.
> > > >
> > > > Could someone shed some light please? Assuming that snaptrim didn't
> > run to
> > > > completion, how can I manually delete objects from now-removed
> > snapshots? I
> > > > believe this is what the Ceph documentation calls a "backwards scrub"
> > - but I
> > > > didn't find anything in the Ceph suite that can run such a scrub.
> This
> > pool is
> > > > filling up fast, I'll throw in some more OSDs for the moment to buy
> > some time,
> > > > but I certainly would appreciate your help!
> > > >
> > > > Happy to attach any logs or info you deem necessary.
> > > >
> > > > Regards,
> > > >
> > > > LRT
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx