Re: cephfs: num_stray growing without bounds (octopus)

Dhairya Parmar <dparmar@xxxxxxxxxx> · Mon, 8 Aug 2022 20:01:23 +0530

On Sun, Aug 7, 2022 at 6:45 PM Frank Schilder <frans@xxxxxx> wrote:

> Hi Dhairya,
>
> I have some new results (below) and also some wishes as an operator that
> might even help with the decision you mentioned in your e-mails:
>
> - Please implement both ways, a possibility to trigger an evaluation
> manually via a "ceph tell|daemon" command and a periodic evaluation.
> - For the periodic evaluation, please introduce a tuning parameter, for
> example, mds_gc_interval (in seconds). If set to 0, disable periodic
> evaluation.
>
Actually these are pretty good ideas! It will definitely be better to have
it both ways. I'll bring this up in our next meeting.

> Reasons:
>
> - On most production systems, doing this once per 24 hours seems enough
> (my benchmark is very special, it needs to delete aggressively). The
> default for mds_gc_interval could therefore be 86400 (24h).
>
I was thinking about a way more aggressive number of a minute or two but If
your tests say that 86400 might be a possible value then It might be very
good performance wise as well. I have discussed this with Greg before and
personally have been brainstorming about a number to come up with and this
might actually be it(or close to it), anyways it would help for sure.
Thanks.

> - On my production system I would probably disable periodic evaluation and
> rather do a single shot manual evaluation some time after snapshot removal
> but before users start working to synchronise with snapshot removal (where
> the "lost" entries are created).
>
I was also thinking about a solution where we evaluate strays as soon as we
delete a snap. What do you think about this on production clusters?

> This follows a general software design principle: Whenever there is a
> choice like this to take, it is best to try to implement an API that can
> support all use cases and to leave the choice of what fits best for their
> workloads to the operators. Try not to restrict operators by hard-coding
> decisions. Rather pick reasonable defaults but also empower operators to
> tune things to special needs. One-size-fits-all never works.
>
+1

> Now to the results: Indeed, a restart triggers complete removal of all
> orphaned stray entries:
>
> [root@rit-tceph bench]# ./mds-stray-num
> 962562
> [root@rit-tceph bench]# ceph mds fail 0
> failed mds gid 371425
> [root@rit-tceph bench]# ./mds-stray-num
> 767329
> [root@rit-tceph bench]# ./mds-stray-num
> 766777
> [root@rit-tceph bench]# ./mds-stray-num
> 572430
> [root@rit-tceph bench]# ./mds-stray-num
> 199172
> [root@rit-tceph bench]# ./mds-stray-num
> 0
>
Awesome, so far it looks like this might be helpful until we come up with a
robust solution.

> # ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
> TOTAL  2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  205 KiB        9  616 KiB      0    785 GiB
> fs-meta1                2   64  684 MiB       44  2.0 GiB   0.09    785 GiB
> fs-meta2                3  128      0 B        0      0 B      0    785 GiB
> fs-data                 4  128      0 B        0      0 B      0    1.5 TiB
>
> Good to see that the bookkeeping didn't loose track of anything. I will
> add a periodic mds fail to my benchmark and report back how all of this
> works under heavy load.
>
Good to hear it keeps the track. Yeah, that report will be very helpful.
Thanks in advance!

>
> Best regards and thanks for our help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dparmar@xxxxxxxxxx>
> Sent: 05 August 2022 22:53:09
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  cephfs: num_stray growing without bounds
> (octopus)
>
> On Fri, Aug 5, 2022 at 9:12 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi Dhairya,
>
> thanks to pointing me to this tracker. I can try an MDS fail to see if it
> clears the stray buckets or if there are still left-overs. Before doing so:
>
> > Thanks for the logs though. It will help me while writing the patch.
>
> I couldn't see if you were asking for logs. Do you want me to collect
> something or do you mean the session logs included in my e-mail. Also, is
> it on purpose to leave out the ceph-user list in CC (e-mail address)?
>
> Nah, the session logs included are good enough. I missed CCing ceph-users.
> Done now.
>
> For my urgent needs, failing the MDS periodically during the benchmark
> might be an interesting addition any ways - if this helps with the stray
> count.
>
> Yeah it might be helpful for now. Do let me know if that works for you.
>
> Thanks for your fast reply and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>>
> Sent: 05 August 2022 16:10
> To: Frank Schilder
> Subject: Re:  cephfs: num_stray growing without bounds
> (octopus)
>
> Hi Frank,
>
> This seems to be related to a tracker<
> https://tracker.ceph.com/issues/53724> that I'm working on. I've got some
> rough ideas in my mind, a simple solution would be to run a single thread
> that would regularly evaluate strays (maybe every 1 or 2 minutes?) or a
> much better approach would be to evaluate strays whenever snapshot removal
> takes place but it's not that easy as it looks, therefore I'm currently
> going through the code to understand it's whole process(snapshot removal),
> I'll try my best to come up with something as soon as possible. Thanks for
> the logs though. It will help me while writing the patch.
>
> Regards,
> Dhairya
>
> On Fri, Aug 5, 2022 at 6:55 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> Dear Gregory, Dan and Patrick,
>
> this is a reply to an older thread about num_stray growing without limits
> (thread
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW,
> message
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/).
> I'm opening a new thread for a better matching subject line.
>
> I now started testing octopus and am afraid I came across a very serious
> issue with unlimited growth of stray buckets. I'm running a test that puts
> constant load on a file system by adding a blob of data, creating a
> snapshot, deleting a blob of data and deleting a snapshot in a cyclic
> process. A blob of data contains about 330K hard links to make it more
> interesting.
>
> The benchmark crashed after half a day in rm with "no space left on
> device", which was due to the stray buckets being too full (old thread).
> OK, so I increased mds_bal_fragment_size_max and cleaned out all data to
> start fresh. However, this happened:
>
> [root@rit-tceph ~]# df -h /mnt/adm/cephfs
> Filesystem                             Size  Used Avail Use% Mounted on
> 10.41.24.13,10.41.24.14,10.41.24.15:/  2.5T   35G  2.5T   2%
> /mnt/adm/cephfs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/
> /mnt/adm/cephfs/
> /mnt/adm/cephfs/data
> /mnt/adm/cephfs/data/blobs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/.snap
> /mnt/adm/cephfs/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap
> /mnt/adm/cephfs/data/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap
> /mnt/adm/cephfs/data/blobs/.snap
>
> All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a
> long time ago. Now look at this:
>
> [root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump |
> jq .mds_cache.num_strays"
> 962562
>
> Whaaaaat?
>
> There is data left over in the fs pools and the stray buckets are cloaked
> up.
>
> [root@rit-tceph ~]# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
> TOTAL  2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  170 KiB        9  509 KiB      0    781 GiB
> fs-meta1                2   64  2.2 GiB  160.25k  6.5 GiB   0.28    781 GiB
> fs-meta2                3  128      0 B  802.40k      0 B      0    781 GiB
> fs-data                 4  128      0 B  802.40k      0 B      0    1.5 TiB
>
> There is either a very serious bug with cleaning up stray entries when
> their last snapshot is deleted, or I'm missing something important here
> when deleting data. Just for completeness:
>
> [root@rit-tceph ~]# ceph status
>   cluster:
>     id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d)
>     mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03
>     mds: fs:1 {0=tceph-03=up:active} 2 up:standby
>     osd: 9 osds: 9 up (since 4d), 9 in (since 4d)
>
>   data:
>     pools:   4 pools, 321 pgs
>     objects: 1.77M objects, 256 MiB
>     usage:   35 GiB used, 2.4 TiB / 2.4 TiB avail
>     pgs:     321 active+clean
>
> I would be most grateful for both, an explanation what happened here and a
> way to get out of this. To me it looks very much like unlimited growth of
> garbage that is never cleaned out.
>
> Many thanks and best regads,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx
> ><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>>
> Sent: 08 February 2022 18:22
> To: Dan van der Ster
> Cc: Frank Schilder; Patrick Donnelly; ceph-users
> Subject: Re:  Re: cephfs: [ERR] loaded dup inode
>
> On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx
> <mailto:dvanders@xxxxxxxxx><mailto:dvanders@xxxxxxxxx<mailto:
> dvanders@xxxxxxxxx>>> wrote:
> >
> > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> > > The reason for this seemingly strange behaviour was an old static
> snapshot taken in an entirely different directory. Apparently, ceph fs
> snapshots are not local to an FS directory sub-tree but always global on
> the entire FS despite the fact that you can only access the sub-tree in the
> snapshot, which easily leads to the wrong conclusion that only data below
> the directory is in the snapshot. As a consequence, the static snapshot was
> accumulating the garbage from the rotating snapshots even though these
> sub-trees were completely disjoint.
> >
> > So are you saying that if I do this I'll have 1M files in stray?
>
> No, happily.
>
> The thing that's happening here post-dates my main previous stretch on
> CephFS and I had forgotten it, but there's a note in the developer
> docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
> (I fortuitously stumbled across this from an entirely different
> direction/discussion just after seeing this thread and put the pieces
> together!)
>
> Basically, hard links are *the worst*. For everything in filesystems.
> I spent a lot of time trying to figure out how to handle hard links
> being renamed across snapshots[1] and never managed it, and the
> eventual "solution" was to give up and do the degenerate thing:
> If there's a file with multiple hard links, that file is a member of
> *every* snapshot.
>
> Doing anything about this will take a lot of time. There's probably an
> opportunity to improve it for users of the subvolumes library, as
> those subvolumes do get tagged a bit, so I'll see if we can look into
> that. But for generic CephFS, I'm not sure what the solution will look
> like at all.
>
> Sorry folks. :/
> -Greg
>
> [1]: The issue is that, if you have a hard linked file in two places,
> you would expect it to be snapshotted whenever a snapshot covering
> either location occurs. But in CephFS the file can only live in one
> location, and the other location has to just hold a reference to it
> instead. So say you have inode Y at path A, and then hard link it in
> at path B. Given how snapshots work, when you open up Y from A, you
> would need to check all the snapshots that apply from both A and B's
> trees. But 1) opening up other paths is a challenge all on its own,
> and 2) without an inode and its backtrace to provide a lookup resolve
> point, it's impossible to maintain a lookup that scales and is
> possible to keep consistent.
> (Oh, I did just have one idea, but I'm not sure if it would fix every
> issue or just that scalable backtrace lookup:
> https://tracker.ceph.com/issues/54205)
>
> >
> > mkdir /a
> > cd /a
> > for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> > cd ..
> > mkdir /b
> > mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> > rm -rf /a/
> >
> >
> > Cheers, Dan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
>
>
>
> --
> Dhairya Parmar
>
> He/Him/His
>
> Associate Software Engineer, CephFS
>
> Red Hat Inc.<https://www.redhat.com/>
>
> dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>
>
> [https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<
> https://www.redhat.com/>
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dparmar@xxxxxxxxxx
<https://www.redhat.com/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx