On Sun, Aug 7, 2022 at 6:45 PM Frank Schilder <frans@xxxxxx> wrote: > Hi Dhairya, > > I have some new results (below) and also some wishes as an operator that > might even help with the decision you mentioned in your e-mails: > > - Please implement both ways, a possibility to trigger an evaluation > manually via a "ceph tell|daemon" command and a periodic evaluation. > - For the periodic evaluation, please introduce a tuning parameter, for > example, mds_gc_interval (in seconds). If set to 0, disable periodic > evaluation. > Actually these are pretty good ideas! It will definitely be better to have it both ways. I'll bring this up in our next meeting. > Reasons: > > - On most production systems, doing this once per 24 hours seems enough > (my benchmark is very special, it needs to delete aggressively). The > default for mds_gc_interval could therefore be 86400 (24h). > I was thinking about a way more aggressive number of a minute or two but If your tests say that 86400 might be a possible value then It might be very good performance wise as well. I have discussed this with Greg before and personally have been brainstorming about a number to come up with and this might actually be it(or close to it), anyways it would help for sure. Thanks. > - On my production system I would probably disable periodic evaluation and > rather do a single shot manual evaluation some time after snapshot removal > but before users start working to synchronise with snapshot removal (where > the "lost" entries are created). > I was also thinking about a solution where we evaluate strays as soon as we delete a snap. What do you think about this on production clusters? > This follows a general software design principle: Whenever there is a > choice like this to take, it is best to try to implement an API that can > support all use cases and to leave the choice of what fits best for their > workloads to the operators. Try not to restrict operators by hard-coding > decisions. Rather pick reasonable defaults but also empower operators to > tune things to special needs. One-size-fits-all never works. > +1 > Now to the results: Indeed, a restart triggers complete removal of all > orphaned stray entries: > > [root@rit-tceph bench]# ./mds-stray-num > 962562 > [root@rit-tceph bench]# ceph mds fail 0 > failed mds gid 371425 > [root@rit-tceph bench]# ./mds-stray-num > 767329 > [root@rit-tceph bench]# ./mds-stray-num > 766777 > [root@rit-tceph bench]# ./mds-stray-num > 572430 > [root@rit-tceph bench]# ./mds-stray-num > 199172 > [root@rit-tceph bench]# ./mds-stray-num > 0 > Awesome, so far it looks like this might be helpful until we come up with a robust solution. > # ceph df > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 2.4 TiB 2.4 TiB 896 MiB 25 GiB 0.99 > TOTAL 2.4 TiB 2.4 TiB 896 MiB 25 GiB 0.99 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL > device_health_metrics 1 1 205 KiB 9 616 KiB 0 785 GiB > fs-meta1 2 64 684 MiB 44 2.0 GiB 0.09 785 GiB > fs-meta2 3 128 0 B 0 0 B 0 785 GiB > fs-data 4 128 0 B 0 0 B 0 1.5 TiB > > Good to see that the bookkeeping didn't loose track of anything. I will > add a periodic mds fail to my benchmark and report back how all of this > works under heavy load. > Good to hear it keeps the track. Yeah, that report will be very helpful. Thanks in advance! > > Best regards and thanks for our help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dhairya Parmar <dparmar@xxxxxxxxxx> > Sent: 05 August 2022 22:53:09 > To: Frank Schilder > Cc: ceph-users@xxxxxxx > Subject: Re: cephfs: num_stray growing without bounds > (octopus) > > On Fri, Aug 5, 2022 at 9:12 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx>> wrote: > Hi Dhairya, > > thanks to pointing me to this tracker. I can try an MDS fail to see if it > clears the stray buckets or if there are still left-overs. Before doing so: > > > Thanks for the logs though. It will help me while writing the patch. > > I couldn't see if you were asking for logs. Do you want me to collect > something or do you mean the session logs included in my e-mail. Also, is > it on purpose to leave out the ceph-user list in CC (e-mail address)? > > Nah, the session logs included are good enough. I missed CCing ceph-users. > Done now. > > For my urgent needs, failing the MDS periodically during the benchmark > might be an interesting addition any ways - if this helps with the stray > count. > > Yeah it might be helpful for now. Do let me know if that works for you. > > Thanks for your fast reply and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dhairya Parmar <dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>> > Sent: 05 August 2022 16:10 > To: Frank Schilder > Subject: Re: cephfs: num_stray growing without bounds > (octopus) > > Hi Frank, > > This seems to be related to a tracker< > https://tracker.ceph.com/issues/53724> that I'm working on. I've got some > rough ideas in my mind, a simple solution would be to run a single thread > that would regularly evaluate strays (maybe every 1 or 2 minutes?) or a > much better approach would be to evaluate strays whenever snapshot removal > takes place but it's not that easy as it looks, therefore I'm currently > going through the code to understand it's whole process(snapshot removal), > I'll try my best to come up with something as soon as possible. Thanks for > the logs though. It will help me while writing the patch. > > Regards, > Dhairya > > On Fri, Aug 5, 2022 at 6:55 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote: > Dear Gregory, Dan and Patrick, > > this is a reply to an older thread about num_stray growing without limits > (thread > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW, > message > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/). > I'm opening a new thread for a better matching subject line. > > I now started testing octopus and am afraid I came across a very serious > issue with unlimited growth of stray buckets. I'm running a test that puts > constant load on a file system by adding a blob of data, creating a > snapshot, deleting a blob of data and deleting a snapshot in a cyclic > process. A blob of data contains about 330K hard links to make it more > interesting. > > The benchmark crashed after half a day in rm with "no space left on > device", which was due to the stray buckets being too full (old thread). > OK, so I increased mds_bal_fragment_size_max and cleaned out all data to > start fresh. However, this happened: > > [root@rit-tceph ~]# df -h /mnt/adm/cephfs > Filesystem Size Used Avail Use% Mounted on > 10.41.24.13,10.41.24.14,10.41.24.15:/ 2.5T 35G 2.5T 2% > /mnt/adm/cephfs > > [root@rit-tceph ~]# find /mnt/adm/cephfs/ > /mnt/adm/cephfs/ > /mnt/adm/cephfs/data > /mnt/adm/cephfs/data/blobs > > [root@rit-tceph ~]# find /mnt/adm/cephfs/.snap > /mnt/adm/cephfs/.snap > > [root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap > /mnt/adm/cephfs/data/.snap > > [root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap > /mnt/adm/cephfs/data/blobs/.snap > > All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a > long time ago. Now look at this: > > [root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | > jq .mds_cache.num_strays" > 962562 > > Whaaaaat? > > There is data left over in the fs pools and the stray buckets are cloaked > up. > > [root@rit-tceph ~]# ceph df > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 2.4 TiB 2.4 TiB 1.4 GiB 35 GiB 1.38 > TOTAL 2.4 TiB 2.4 TiB 1.4 GiB 35 GiB 1.38 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL > device_health_metrics 1 1 170 KiB 9 509 KiB 0 781 GiB > fs-meta1 2 64 2.2 GiB 160.25k 6.5 GiB 0.28 781 GiB > fs-meta2 3 128 0 B 802.40k 0 B 0 781 GiB > fs-data 4 128 0 B 802.40k 0 B 0 1.5 TiB > > There is either a very serious bug with cleaning up stray entries when > their last snapshot is deleted, or I'm missing something important here > when deleting data. Just for completeness: > > [root@rit-tceph ~]# ceph status > cluster: > id: bf1f51f5-b381-4cf7-b3db-88d044c1960c > health: HEALTH_OK > > services: > mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d) > mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03 > mds: fs:1 {0=tceph-03=up:active} 2 up:standby > osd: 9 osds: 9 up (since 4d), 9 in (since 4d) > > data: > pools: 4 pools, 321 pgs > objects: 1.77M objects, 256 MiB > usage: 35 GiB used, 2.4 TiB / 2.4 TiB avail > pgs: 321 active+clean > > I would be most grateful for both, an explanation what happened here and a > way to get out of this. To me it looks very much like unlimited growth of > garbage that is never cleaned out. > > Many thanks and best regads, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx > ><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>> > Sent: 08 February 2022 18:22 > To: Dan van der Ster > Cc: Frank Schilder; Patrick Donnelly; ceph-users > Subject: Re: Re: cephfs: [ERR] loaded dup inode > > On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx > <mailto:dvanders@xxxxxxxxx><mailto:dvanders@xxxxxxxxx<mailto: > dvanders@xxxxxxxxx>>> wrote: > > > > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote: > > > The reason for this seemingly strange behaviour was an old static > snapshot taken in an entirely different directory. Apparently, ceph fs > snapshots are not local to an FS directory sub-tree but always global on > the entire FS despite the fact that you can only access the sub-tree in the > snapshot, which easily leads to the wrong conclusion that only data below > the directory is in the snapshot. As a consequence, the static snapshot was > accumulating the garbage from the rotating snapshots even though these > sub-trees were completely disjoint. > > > > So are you saying that if I do this I'll have 1M files in stray? > > No, happily. > > The thing that's happening here post-dates my main previous stretch on > CephFS and I had forgotten it, but there's a note in the developer > docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links > (I fortuitously stumbled across this from an entirely different > direction/discussion just after seeing this thread and put the pieces > together!) > > Basically, hard links are *the worst*. For everything in filesystems. > I spent a lot of time trying to figure out how to handle hard links > being renamed across snapshots[1] and never managed it, and the > eventual "solution" was to give up and do the degenerate thing: > If there's a file with multiple hard links, that file is a member of > *every* snapshot. > > Doing anything about this will take a lot of time. There's probably an > opportunity to improve it for users of the subvolumes library, as > those subvolumes do get tagged a bit, so I'll see if we can look into > that. But for generic CephFS, I'm not sure what the solution will look > like at all. > > Sorry folks. :/ > -Greg > > [1]: The issue is that, if you have a hard linked file in two places, > you would expect it to be snapshotted whenever a snapshot covering > either location occurs. But in CephFS the file can only live in one > location, and the other location has to just hold a reference to it > instead. So say you have inode Y at path A, and then hard link it in > at path B. Given how snapshots work, when you open up Y from A, you > would need to check all the snapshots that apply from both A and B's > trees. But 1) opening up other paths is a challenge all on its own, > and 2) without an inode and its backtrace to provide a lookup resolve > point, it's impossible to maintain a lookup that scales and is > possible to keep consistent. > (Oh, I did just have one idea, but I'm not sure if it would fix every > issue or just that scalable backtrace lookup: > https://tracker.ceph.com/issues/54205) > > > > > mkdir /a > > cd /a > > for i in {1..1000000}; do touch $i; done # create 1M files in /a > > cd .. > > mkdir /b > > mkdir /b/.snap/testsnap # create a snap in the empty dir /b > > rm -rf /a/ > > > > > > Cheers, Dan > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>> > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>> > > > > -- > Dhairya Parmar > > He/Him/His > > Associate Software Engineer, CephFS > > Red Hat Inc.<https://www.redhat.com/> > > dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx> > > [https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]< > https://www.redhat.com/> > > -- *Dhairya Parmar* He/Him/His Associate Software Engineer, CephFS Red Hat Inc. <https://www.redhat.com/> dparmar@xxxxxxxxxx <https://www.redhat.com/> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx