Re: CephFS snaptrim bug?

Arnaud M <arnaud.meauzoone@xxxxxxxxx> · Thu, 17 Mar 2022 21:48:18 +0100

Hello Linkriver

I might have an issue close to your

Can you tell us if your strays dirs are full ?

What does this command output to you ?

ceph tell mds.0 perf dump | grep strays

Does the value change over time ?

All the best

Arnaud

Le mer. 16 mars 2022 à 15:35, Linkriver Technology <
technology@xxxxxxxxxxxxxxxxxxxxx> a écrit :

> Hi,
>
> Has anyone figured whether those "lost" snaps are rediscoverable /
> trimmable?
> All pgs in the cluster have been deep scrubbed since my previous email and
> I'm
> not seeing any of that wasted space being recovered.
>
> Regards,
>
> LRT
>
> -----Original Message-----
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> To: technology@xxxxxxxxxxxxxxxxxxxxx
> Cc: Ceph Users <ceph-users@xxxxxxx>, Neha Ojha <nojha@xxxxxxxxxx>
> Subject: Re:  CephFS snaptrim bug?
> Date: Thu, 24 Feb 2022 09:48:04 +0100
>
> See https://tracker.ceph.com/issues/54396
>
> I don't know how to tell the osds to rediscover those trimmed snaps.
> Neha does that possible?
>
> Cheers, Dan
>
> On Thu, Feb 24, 2022 at 9:27 AM Dan van der Ster <dvanders@xxxxxxxxx>
> wrote:
> >
> > Hi,
> >
> > I had a look at the code -- looks like there's a flaw in the logic:
> > the snaptrim queue is cleared if osd_pg_max_concurrent_snap_trims = 0.
> >
> > I'll open a tracker and send a PR to restrict
> > osd_pg_max_concurrent_snap_trims to >= 1.
> >
> > Cheers, Dan
> >
> > On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
> > <technology@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15)
> over the
> > > weekend. The upgrade went well as far as I can tell.
> > >
> > > Earlier today, noticing that our CephFS data pool was approaching
> capacity, I
> > > removed some old CephFS snapshots (taken weekly at the root of the
> filesystem),
> > > keeping only the most recent one (created today, 2022-02-21). As
> expected, a
> > > good fraction of the PGs transitioned from active+clean to
> active+clean+snaptrim
> > > or active+clean+snaptrim_wait. In previous occasions when I removed a
> snapshot
> > > it took a few days for snaptrimming to complete. This would happen
> without
> > > noticeably impacting other workloads, and would also free up an
> appreciable
> > > amount of disk space.
> > >
> > > This time around, after a few hours of snaptrimming, users complained
> of high IO
> > > latency, and indeed Ceph reported "slow ops" on a number of OSDs and
> on the
> > > active MDS. I attributed this to the snaptrimming and decided to
> reduce it by
> > > initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't
> seem to
> > > help much, so I then set it to 0, which had the surprising effect of
> > > transitioning all PGs back to active+clean (is this intended?). I also
> restarted
> > > the MDS which seemed to be struggling. IO latency went back to normal
> > > immediately.
> > >
> > > Outside of users' working hours, I decided to resume snaptrimming by
> setting
> > > osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise,
> nothing
> > > happened. All PGs remained (and still remain at time of writing) in
> the state
> > > active+clean, even after restarting some of them. This definitely seems
> > > abnormal, as I mentioned earlier, snaptrimming this FS previously
> would take in
> > > the order of multiple days. Moreover, if snaptrim were truly complete,
> I would
> > > expect pool usage to have dropped by appreciable amounts (at least a
> dozen
> > > terabytes), but that doesn't seem to be the case.
> > >
> > > A du on the CephFS root gives:
> > >
> > > # du -sh /mnt/pve/cephfs
> > > 31T    /mnt/pve/cephfs
> > >
> > > But:
> > >
> > > # ceph df
> > > <snip>
> > > --- POOLS ---
> > > POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
> AVAIL
> > > cephfs_data             7   512   43 TiB  190.83M  147 TiB  93.22
> 3.6 TiB
> > > cephfs_metadata         8    32   89 GiB  694.60k  266 GiB   1.32
> 6.4 TiB
> > > <snip>
> > >
> > > ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
> > >
> > > Did CephFS just leak a massive 12 TiB worth of objects...? It seems to
> me that
> > > the snaptrim operation did not complete at all.
> > >
> > > Perhaps relatedly:
> > >
> > > # ceph daemon mds.choi dump snaps
> > > {
> > >     "last_created": 93,
> > >     "last_destroyed": 94,
> > >     "snaps": [
> > >         {
> > >             "snapid": 93,
> > >             "ino": 1,
> > >             "stamp": "2022-02-21T00:00:01.245459+0800",
> > >             "name": "2022-02-21"
> > >         }
> > >     ]
> > > }
> > >
> > > How can last_destroyed > last_created? The last snapshot to have been
> taken on
> > > this FS is indeed #93, and the removed snapshots were all created on
> previous
> > > weeks.
> > >
> > > Could someone shed some light please? Assuming that snaptrim didn't
> run to
> > > completion, how can I manually delete objects from now-removed
> snapshots? I
> > > believe this is what the Ceph documentation calls a "backwards scrub"
> - but I
> > > didn't find anything in the Ceph suite that can run such a scrub. This
> pool is
> > > filling up fast, I'll throw in some more OSDs for the moment to buy
> some time,
> > > but I certainly would appreciate your help!
> > >
> > > Happy to attach any logs or info you deem necessary.
> > >
> > > Regards,
> > >
> > > LRT
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx