Re: CephFS snaptrim bug?

Linkriver Technology <technology@xxxxxxxxxxxxxxxxxxxxx> · Wed, 16 Mar 2022 15:34:58 +0100

Hi,

Has anyone figured whether those "lost" snaps are rediscoverable / trimmable?
All pgs in the cluster have been deep scrubbed since my previous email and I'm
not seeing any of that wasted space being recovered.

Regards,

LRT

-----Original Message-----
From: Dan van der Ster <dvanders@xxxxxxxxx>
To: technology@xxxxxxxxxxxxxxxxxxxxx
Cc: Ceph Users <ceph-users@xxxxxxx>, Neha Ojha <nojha@xxxxxxxxxx>
Subject: Re:  CephFS snaptrim bug?
Date: Thu, 24 Feb 2022 09:48:04 +0100

See https://tracker.ceph.com/issues/54396

I don't know how to tell the osds to rediscover those trimmed snaps.
Neha does that possible?

Cheers, Dan

On Thu, Feb 24, 2022 at 9:27 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I had a look at the code -- looks like there's a flaw in the logic:
> the snaptrim queue is cleared if osd_pg_max_concurrent_snap_trims = 0.
> 
> I'll open a tracker and send a PR to restrict
> osd_pg_max_concurrent_snap_trims to >= 1.
> 
> Cheers, Dan
> 
> On Wed, Feb 23, 2022 at 9:44 PM Linkriver Technology
> <technology@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > 
> > Hello,
> > 
> > I have upgraded our Ceph cluster from Nautilus to Octopus (15.2.15) over the
> > weekend. The upgrade went well as far as I can tell.
> > 
> > Earlier today, noticing that our CephFS data pool was approaching capacity, I
> > removed some old CephFS snapshots (taken weekly at the root of the filesystem),
> > keeping only the most recent one (created today, 2022-02-21). As expected, a
> > good fraction of the PGs transitioned from active+clean to active+clean+snaptrim
> > or active+clean+snaptrim_wait. In previous occasions when I removed a snapshot
> > it took a few days for snaptrimming to complete. This would happen without
> > noticeably impacting other workloads, and would also free up an appreciable
> > amount of disk space.
> > 
> > This time around, after a few hours of snaptrimming, users complained of high IO
> > latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
> > active MDS. I attributed this to the snaptrimming and decided to reduce it by
> > initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to
> > help much, so I then set it to 0, which had the surprising effect of
> > transitioning all PGs back to active+clean (is this intended?). I also restarted
> > the MDS which seemed to be struggling. IO latency went back to normal
> > immediately.
> > 
> > Outside of users' working hours, I decided to resume snaptrimming by setting
> > osd_pg_max_concurrent_snap_trims back to 1. Much to my surprise, nothing
> > happened. All PGs remained (and still remain at time of writing) in the state
> > active+clean, even after restarting some of them. This definitely seems
> > abnormal, as I mentioned earlier, snaptrimming this FS previously would take in
> > the order of multiple days. Moreover, if snaptrim were truly complete, I would
> > expect pool usage to have dropped by appreciable amounts (at least a dozen
> > terabytes), but that doesn't seem to be the case.
> > 
> > A du on the CephFS root gives:
> > 
> > # du -sh /mnt/pve/cephfs
> > 31T    /mnt/pve/cephfs
> > 
> > But:
> > 
> > # ceph df
> > <snip>
> > --- POOLS ---
> > POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
> > cephfs_data             7   512   43 TiB  190.83M  147 TiB  93.22    3.6 TiB
> > cephfs_metadata         8    32   89 GiB  694.60k  266 GiB   1.32    6.4 TiB
> > <snip>
> > 
> > ceph pg dump reports a SNAPTRIMQ_LEN of 0 on all PGs.
> > 
> > Did CephFS just leak a massive 12 TiB worth of objects...? It seems to me that
> > the snaptrim operation did not complete at all.
> > 
> > Perhaps relatedly:
> > 
> > # ceph daemon mds.choi dump snaps
> > {
> >     "last_created": 93,
> >     "last_destroyed": 94,
> >     "snaps": [
> >         {
> >             "snapid": 93,
> >             "ino": 1,
> >             "stamp": "2022-02-21T00:00:01.245459+0800",
> >             "name": "2022-02-21"
> >         }
> >     ]
> > }
> > 
> > How can last_destroyed > last_created? The last snapshot to have been taken on
> > this FS is indeed #93, and the removed snapshots were all created on previous
> > weeks.
> > 
> > Could someone shed some light please? Assuming that snaptrim didn't run to
> > completion, how can I manually delete objects from now-removed snapshots? I
> > believe this is what the Ceph documentation calls a "backwards scrub" - but I
> > didn't find anything in the Ceph suite that can run such a scrub. This pool is
> > filling up fast, I'll throw in some more OSDs for the moment to buy some time,
> > but I certainly would appreciate your help!
> > 
> > Happy to attach any logs or info you deem necessary.
> > 
> > Regards,
> > 
> > LRT
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx