Re: CephFS snaptrim bug?

Linkriver Technology <technology@xxxxxxxxxxxxxxxxxxxxx> · Thu, 26 Sep 2024 09:05:55 +0800

Hello,

We recently upgraded to Quincy (17.2.7) and I can see in the ceph logs
many messages of the form:

1713256584.3135679 osd.28 (osd.28) 66398 : cluster 4 osd.28 found snap
mapper error on pg 7.284 oid 7:214b503b:::100125de9b8.00000000:5c snaps
in mapper: {}, oi: {5a} ...repaired
1713256584.3136106 osd.28 (osd.28) 66399 : cluster 4 osd.28 found snap
mapper error on pg 7.284 oid 7:214b4f95:::1001654390d.00000000:5c snaps
in mapper: {}, oi: {5a} ...repaired
1713256584.3136535 osd.28 (osd.28) 66400 : cluster 4 osd.28 found snap
mapper error on pg 7.284 oid 7:214b4f3f:::1001549ed54.00000000:5c snaps
in mapper: {}, oi: {5a} ...repaired
1713256584.9496887 osd.29 (osd.29) 70001 : cluster 4 osd.29 found snap
mapper error on pg 7.b4 oid 7:2d089bdc:::10016105140.00000000:5c snaps
in mapper: {}, oi: {5a} ...repaired
1713256590.9785151 osd.28 (osd.28) 66401 : cluster 4 osd.28 found snap
mapper error on pg 7.284 oid 7:214b5179:::100128b85a0.00000cfe:5c snaps
in mapper: {}, oi: {5a} ...repaired
1713256598.6286905 osd.29 (osd.29) 70002 : cluster 4 osd.29 found snap
mapper error on pg 7.17c oid 7:3e877f95:::100151d8670.00000000:5c snaps
in mapper: {}, oi: {5a} ...repaired
...

A cursory reading of the code involved suggests that the scrubber in
Quincy has acquired the capacity of detecting and removing the lost
snapshots from Octopus, if I understand it correctly.

Cheers,

Linkriver Technology

On Sat, 2022-06-25 at 19:36 +0000, Kári Bertilsson wrote:
> Hello
> 
> I am also having this issue after having
> set osd_pg_max_concurrent_snap_trims = 0 previously to pause the
> snaptrim. I upgraded to ceph 17.2.0. Have tried restarting,
> repeering, deep-scrubbing all OSD's, so far nothing works.
> 
> For one of the affected pools `cephfs_10k` I have tested removing
> ALL data and it's still showing 26% usage. All snapshots have been
> deleted and all pg's for the pool remain at SNAPTRIMQ_LEN = 0. All
> pg's are active+clean.
> 
> The pool still shows 589k object usage. When testing `rados get
> object` on all the objects, it only works for 2.420 of them. The rest
> seem to be in some kind of limbo and can not be read or deleted using
> rados.
> 
> # rados -p cephfs_10k listsnaps 10010539c22.00000000
>                                                                      
>     
> 10010539c22.00000000:
> cloneid snaps   size    overlap
> 288     288     30767656        []
> 
> # rados -p cephfs_10k get 10010539c22.00000000 10010539c22.00000000
> error getting cephfs_10k/10010539c22.00000000: (2) No such file or
> directory
> 
> # rados -p cephfs_10k rm 10010539c22.00000000                      
> error removing cephfs_10k>10010539c22.00000000: (2) No such file or
> directory
> 
> Is there some way to make the snap trimmer rediscover these objects
> and remove them ?
> 
> On Fri, Mar 18, 2022 at 2:21 PM Linkriver Technology
> <technology@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > Hello,
> > 
> > If I understand my issue correctly, it is in fact unrelated to
> > CephFS itself,
> > rather the problem happens at a lower level (in Ceph itself). IOW,
> > it affects
> > all kind of snapshots, not just CephFS ones. I believe my FS is
> > healthy
> > otherwise. In any case, here is the output of the command you
> > asked:
> > 
> > I ran it a few hours ago:
> > 
> >         "num_strays": 235,
> >         "num_strays_delayed": 38,
> >         "num_strays_enqueuing": 0,
> >         "strays_created": 5414436,
> >         "strays_enqueued": 5405983,
> >         "strays_reintegrated": 17892,
> >         "strays_migrated": 0,
> > 
> > And just now:
> > 
> >         "num_strays": 186,
> >         "num_strays_delayed": 0,
> >         "num_strays_enqueuing": 0,
> >         "strays_created": 5540016,
> >         "strays_enqueued": 5531494,
> >         "strays_reintegrated": 18128,
> >         "strays_migrated": 0,
> > 
> > 
> > Regards,
> > 
> > LRT
> > 
> > -----Original Message-----
> > From: Arnaud M <arnaud.meauzoone@xxxxxxxxx>
> > To: Linkriver Technology <technology@xxxxxxxxxxxxxxxxxxxxx>
> > Cc: Dan van der Ster <dvanders@xxxxxxxxx>, Ceph Users
> > <ceph-users@xxxxxxx>
> > Subject:  Re: CephFS snaptrim bug?
> > Date: Thu, 17 Mar 2022 21:48:18 +0100
> > 
> > Hello Linkriver
> > 
> > I might have an issue close to your
> > 
> > Can you tell us if your strays dirs are full ?
> > 
> > What does this command output to you ?
> > 
> > ceph tell mds.0 perf dump | grep strays
> > 
> > Does the value change over time ?
> > 
> > All the best
> > 
> > Arnaud
> > 
> > Le mer. 16 mars 2022 à 15:35, Linkriver Technology <
> > technology@xxxxxxxxxxxxxxxxxxxxx> a écrit :
> > 
> > > Hi,
> > > 
> > > Has anyone figured whether those "lost" snaps are rediscoverable
> > /
> > > trimmable?
> > > All pgs in the cluster have been deep scrubbed since my previous
> > email and
> > > I'm
> > > not seeing any of that wasted space being recovered.
> > > 
> > > Regards,
> > > 
> > > LRT
> > > 
> > > -----Original Message-----
> > > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > > To: technology@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Ceph Users <ceph-users@xxxxxxx>, Neha Ojha <nojha@xxxxxxxxxx>
> > > Subject: Re:  CephFS snaptrim bug?
> > > Date: Thu, 24 Feb 2022 09:48:04 +0100
> > > 
> > > See https://tracker.ceph.com/issues/54396
> > > 
> > > I don't know how to tell the osds to rediscover those trimmed
> > snaps.
> > > Neha does that possible?
> > > 
> > > Cheers, Dan
> > > 
> > > On Thu, Feb 24, 2022 at 9:27 AM Dan van der Ster
> > <dvanders@xxxxxxxxx>
> > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > I had a look at the code -- looks like there's a flaw in the
> > logic:
> > > > the snaptrim queue is cleared if
> > osd_pg_max_concurrent_snap_trims = 0.
> > > > 
> > > > I'll open a tracker and send a PR to restrict
> > > > osd_pg_max_concurrent_snap_trims to >= 1.
> > > > 
> > > > Cheers, Dan
> > > > 
> > > > On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
> > > > <technology@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I have upgraded our Ceph cluster from Nautilus to Octopus
> > (15.2.15)
> > > over the
> > > > > weekend. The upgrade went well as far as I can tell.
> > > > > 
> > > > > Earlier today, noticing that our CephFS data pool was
> > approaching
> > > capacity, I
> > > > > removed some old CephFS snapshots (taken weekly at the root
> > of the
> > > filesystem),
> > > > > keeping only the most recent one (created today, 2022-02-21).
> > As
> > > expected, a
> > > > > good fraction of the PGs transitioned from active+clean to
> > > active+clean+snaptrim
> > > > > or active+clean+snaptrim_wait. In previous occasions when I
> > removed a
> > > snapshot
> > > > > it took a few days for snaptrimming to complete. This would
> > happen
> > > without
> > > > > noticeably impacting other workloads, and would also free up
> > an
> > > appreciable
> > > > > amount of disk space.
> > > > > 
> > > > > This time around, after a few hours of snaptrimming, users
> > complained
> > > of high IO
> > > > > latency, and indeed Ceph reported "slow ops" on a number of
> > OSDs and
> > > on the
> > > > > active MDS. I attributed this to the snaptrimming and decided
> > to
> > > reduce it by
> > > > > initially setting osd_pg_max_concurrent_snap_trims to 1,
> > which didn't
> > > seem to
> > > > > help much, so I then set it to 0, which had the surprising
> > effect of
> > > > > transitioning all PGs back to active+clean (is this
> > intended?). I also
> > > restarted
> > > > > the MDS which seemed to be struggling. IO latency went back
> > to normal
> > > > > immediately.
> > > > > 
> > > > > Outside of users' working hours, I decided to resume
> > snaptrimming by
> > > setting
> > > > > osd_pg_max_concurrent_snap_trims back to 1. Much to my
> > surprise,
> > > nothing
> > > > > happened. All PGs remained (and still remain at time of
> > writing) in
> > > the state
> > > > > active+clean, even after restarting some of them. This
> > definitely seems
> > > > > abnormal, as I mentioned earlier, snaptrimming this FS
> > previously
> > > would take in
> > > > > the order of multiple days. Moreover, if snaptrim were truly
> > complete,
> > > I would
> > > > > expect pool usage to have dropped by appreciable amounts (at
> > least a
> > > dozen
> > > > > terabytes), but that doesn't seem to be the case.
> > > > > 
> > > > > A du on the CephFS root gives:
> > > > > 
> > > > > # du -sh /mnt/pve/cephfs
> > > > > 31T    /mnt/pve/cephfs
> > > > > 
> > > > > But:
> > > > > 
> > > > > # ceph df
> > > > > <snip>
> > > > > --- POOLS ---
> > > > > POOL                   ID  PGS   STORED   OBJECTS  USED   
> >  %USED  MAX
> > > AVAIL
> > > > > cephfs_data             7   512   43 TiB  190.83M  147 TiB 
> > 93.22
> > > 3.6 TiB
> > > > > cephfs_metadata         8    32   89 GiB  694.60k  266 GiB 
> >  1.32
> > > 6.4 TiB
> > > > > <snip>
> > > > > 
> > > > > ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
> > > > > 
> > > > > Did CephFS just leak a massive 12 TiB worth of objects...? It
> > seems to
> > > me that
> > > > > the snaptrim operation did not complete at all.
> > > > > 
> > > > > Perhaps relatedly:
> > > > > 
> > > > > # ceph daemon mds.choi dump snaps
> > > > > {
> > > > >     "last_created": 93,
> > > > >     "last_destroyed": 94,
> > > > >     "snaps": [
> > > > >         {
> > > > >             "snapid": 93,
> > > > >             "ino": 1,
> > > > >             "stamp": "2022-02-21T00:00:01.245459+0800",
> > > > >             "name": "2022-02-21"
> > > > >         }
> > > > >     ]
> > > > > }
> > > > > 
> > > > > How can last_destroyed > last_created? The last snapshot to
> > have been
> > > taken on
> > > > > this FS is indeed #93, and the removed snapshots were all
> > created on
> > > previous
> > > > > weeks.
> > > > > 
> > > > > Could someone shed some light please? Assuming that snaptrim
> > didn't
> > > run to
> > > > > completion, how can I manually delete objects from now-
> > removed
> > > snapshots? I
> > > > > believe this is what the Ceph documentation calls a
> > "backwards scrub"
> > > - but I
> > > > > didn't find anything in the Ceph suite that can run such a
> > scrub. This
> > > pool is
> > > > > filling up fast, I'll throw in some more OSDs for the moment
> > to buy
> > > some time,
> > > > > but I certainly would appreciate your help!
> > > > > 
> > > > > Happy to attach any logs or info you deem necessary.
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > LRT
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > 
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx