Re: cephfs: num_stray growing without bounds (octopus)

Venky Shankar <vshankar@xxxxxxxxxx> · Tue, 9 Aug 2022 19:57:07 +0530

Hi Frank,

On Sun, Aug 7, 2022 at 6:46 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dhairya,
>
> I have some new results (below) and also some wishes as an operator that might even help with the decision you mentioned in your e-mails:
>
> - Please implement both ways, a possibility to trigger an evaluation manually via a "ceph tell|daemon" command and a periodic evaluation.
> - For the periodic evaluation, please introduce a tuning parameter, for example, mds_gc_interval (in seconds). If set to 0, disable periodic evaluation.

FWIW, reintegration can be triggered with filesystem scrub on pacific
ceph-mds (16.2.8+) daemons. This was planned for octopus backport

        https://github.com/ceph/ceph/pull/44657

but the PR couldn't make it to the octopus release.

>
> Reasons:
>
> - On most production systems, doing this once per 24 hours seems enough (my benchmark is very special, it needs to delete aggressively). The default for mds_gc_interval could therefore be 86400 (24h).
> - On my production system I would probably disable periodic evaluation and rather do a single shot manual evaluation some time after snapshot removal but before users start working to synchronise with snapshot removal (where the "lost" entries are created).
>
> This follows a general software design principle: Whenever there is a choice like this to take, it is best to try to implement an API that can support all use cases and to leave the choice of what fits best for their workloads to the operators. Try not to restrict operators by hard-coding decisions. Rather pick reasonable defaults but also empower operators to tune things to special needs. One-size-fits-all never works.
>
> Now to the results: Indeed, a restart triggers complete removal of all orphaned stray entries:
>
> [root@rit-tceph bench]# ./mds-stray-num
> 962562
> [root@rit-tceph bench]# ceph mds fail 0
> failed mds gid 371425
> [root@rit-tceph bench]# ./mds-stray-num
> 767329
> [root@rit-tceph bench]# ./mds-stray-num
> 766777
> [root@rit-tceph bench]# ./mds-stray-num
> 572430
> [root@rit-tceph bench]# ./mds-stray-num
> 199172
> [root@rit-tceph bench]# ./mds-stray-num
> 0
> # ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
> TOTAL  2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  205 KiB        9  616 KiB      0    785 GiB
> fs-meta1                2   64  684 MiB       44  2.0 GiB   0.09    785 GiB
> fs-meta2                3  128      0 B        0      0 B      0    785 GiB
> fs-data                 4  128      0 B        0      0 B      0    1.5 TiB
>
> Good to see that the bookkeeping didn't loose track of anything. I will add a periodic mds fail to my benchmark and report back how all of this works under heavy load.
>
> Best regards and thanks for our help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dparmar@xxxxxxxxxx>
> Sent: 05 August 2022 22:53:09
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  cephfs: num_stray growing without bounds (octopus)
>
> On Fri, Aug 5, 2022 at 9:12 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> Hi Dhairya,
>
> thanks to pointing me to this tracker. I can try an MDS fail to see if it clears the stray buckets or if there are still left-overs. Before doing so:
>
> > Thanks for the logs though. It will help me while writing the patch.
>
> I couldn't see if you were asking for logs. Do you want me to collect something or do you mean the session logs included in my e-mail. Also, is it on purpose to leave out the ceph-user list in CC (e-mail address)?
>
> Nah, the session logs included are good enough. I missed CCing ceph-users. Done now.
>
> For my urgent needs, failing the MDS periodically during the benchmark might be an interesting addition any ways - if this helps with the stray count.
>
> Yeah it might be helpful for now. Do let me know if that works for you.
>
> Thanks for your fast reply and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>>
> Sent: 05 August 2022 16:10
> To: Frank Schilder
> Subject: Re:  cephfs: num_stray growing without bounds (octopus)
>
> Hi Frank,
>
> This seems to be related to a tracker<https://tracker.ceph.com/issues/53724> that I'm working on. I've got some rough ideas in my mind, a simple solution would be to run a single thread that would regularly evaluate strays (maybe every 1 or 2 minutes?) or a much better approach would be to evaluate strays whenever snapshot removal takes place but it's not that easy as it looks, therefore I'm currently going through the code to understand it's whole process(snapshot removal), I'll try my best to come up with something as soon as possible. Thanks for the logs though. It will help me while writing the patch.
>
> Regards,
> Dhairya
>
> On Fri, Aug 5, 2022 at 6:55 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> Dear Gregory, Dan and Patrick,
>
> this is a reply to an older thread about num_stray growing without limits (thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW, message https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/). I'm opening a new thread for a better matching subject line.
>
> I now started testing octopus and am afraid I came across a very serious issue with unlimited growth of stray buckets. I'm running a test that puts constant load on a file system by adding a blob of data, creating a snapshot, deleting a blob of data and deleting a snapshot in a cyclic process. A blob of data contains about 330K hard links to make it more interesting.
>
> The benchmark crashed after half a day in rm with "no space left on device", which was due to the stray buckets being too full (old thread). OK, so I increased mds_bal_fragment_size_max and cleaned out all data to start fresh. However, this happened:
>
> [root@rit-tceph ~]# df -h /mnt/adm/cephfs
> Filesystem                             Size  Used Avail Use% Mounted on
> 10.41.24.13,10.41.24.14,10.41.24.15:/  2.5T   35G  2.5T   2% /mnt/adm/cephfs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/
> /mnt/adm/cephfs/
> /mnt/adm/cephfs/data
> /mnt/adm/cephfs/data/blobs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/.snap
> /mnt/adm/cephfs/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap
> /mnt/adm/cephfs/data/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap
> /mnt/adm/cephfs/data/blobs/.snap
>
> All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a long time ago. Now look at this:
>
> [root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | jq .mds_cache.num_strays"
> 962562
>
> Whaaaaat?
>
> There is data left over in the fs pools and the stray buckets are cloaked up.
>
> [root@rit-tceph ~]# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
> TOTAL  2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  170 KiB        9  509 KiB      0    781 GiB
> fs-meta1                2   64  2.2 GiB  160.25k  6.5 GiB   0.28    781 GiB
> fs-meta2                3  128      0 B  802.40k      0 B      0    781 GiB
> fs-data                 4  128      0 B  802.40k      0 B      0    1.5 TiB
>
> There is either a very serious bug with cleaning up stray entries when their last snapshot is deleted, or I'm missing something important here when deleting data. Just for completeness:
>
> [root@rit-tceph ~]# ceph status
>   cluster:
>     id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d)
>     mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03
>     mds: fs:1 {0=tceph-03=up:active} 2 up:standby
>     osd: 9 osds: 9 up (since 4d), 9 in (since 4d)
>
>   data:
>     pools:   4 pools, 321 pgs
>     objects: 1.77M objects, 256 MiB
>     usage:   35 GiB used, 2.4 TiB / 2.4 TiB avail
>     pgs:     321 active+clean
>
> I would be most grateful for both, an explanation what happened here and a way to get out of this. To me it looks very much like unlimited growth of garbage that is never cleaned out.
>
> Many thanks and best regads,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>>
> Sent: 08 February 2022 18:22
> To: Dan van der Ster
> Cc: Frank Schilder; Patrick Donnelly; ceph-users
> Subject: Re:  Re: cephfs: [ERR] loaded dup inode
>
> On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx><mailto:dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>> wrote:
> >
> > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> > > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint.
> >
> > So are you saying that if I do this I'll have 1M files in stray?
>
> No, happily.
>
> The thing that's happening here post-dates my main previous stretch on
> CephFS and I had forgotten it, but there's a note in the developer
> docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
> (I fortuitously stumbled across this from an entirely different
> direction/discussion just after seeing this thread and put the pieces
> together!)
>
> Basically, hard links are *the worst*. For everything in filesystems.
> I spent a lot of time trying to figure out how to handle hard links
> being renamed across snapshots[1] and never managed it, and the
> eventual "solution" was to give up and do the degenerate thing:
> If there's a file with multiple hard links, that file is a member of
> *every* snapshot.
>
> Doing anything about this will take a lot of time. There's probably an
> opportunity to improve it for users of the subvolumes library, as
> those subvolumes do get tagged a bit, so I'll see if we can look into
> that. But for generic CephFS, I'm not sure what the solution will look
> like at all.
>
> Sorry folks. :/
> -Greg
>
> [1]: The issue is that, if you have a hard linked file in two places,
> you would expect it to be snapshotted whenever a snapshot covering
> either location occurs. But in CephFS the file can only live in one
> location, and the other location has to just hold a reference to it
> instead. So say you have inode Y at path A, and then hard link it in
> at path B. Given how snapshots work, when you open up Y from A, you
> would need to check all the snapshots that apply from both A and B's
> trees. But 1) opening up other paths is a challenge all on its own,
> and 2) without an inode and its backtrace to provide a lookup resolve
> point, it's impossible to maintain a lookup that scales and is
> possible to keep consistent.
> (Oh, I did just have one idea, but I'm not sure if it would fix every
> issue or just that scalable backtrace lookup:
> https://tracker.ceph.com/issues/54205)
>
> >
> > mkdir /a
> > cd /a
> > for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> > cd ..
> > mkdir /b
> > mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> > rm -rf /a/
> >
> >
> > Cheers, Dan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>
>
>
>
> --
> Dhairya Parmar
>
> He/Him/His
>
> Associate Software Engineer, CephFS
>
> Red Hat Inc.<https://www.redhat.com/>
>
> dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>
>
> [https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx