Hi Gregory, thanks for this piece of information. There is still something missing and maybe you can add it: what is actually going to stray and what is handled differently? As described, we had rolling snapshots in 2 directories and one static one in a disjoint tree (there was no overlap even with soft or hard links). Here is the critical snippet of the time series of mds_cache.num_strays: 1643669702 1089195 Mon Jan 31 23:55:02 CET 2022 << snapshot for Feb 1st is created 10 min from now 1643756101 1260646 Tue Feb 1 23:55:01 CET 2022 << user deleted lots of data on Feb 1st ... 1644101702 1279422 Sat Feb 5 23:55:02 CET 2022 << retention ends in 10 minutes, snapshot is deleted 1644188102 1280530 Sun Feb 6 23:55:02 CET 2022 << Why why why ??? 1644274502 232589 Mon Feb 7 23:55:02 CET 2022 << after deletion of static snapshot on Feb 7 On Feb 1st a user deleted a large directory tree. I don't think it contained 200k hard links, so the increase in num_strays cannot just be that. On deletion, there is absolutely no reduction in num_strays (snaptrim is finished by the time of recording). Are hard links that fall into a global stray blocking the removal of other stray entries as well? How is the time series explained? On a side node, we are now down to mds_cache.num_strays=132514 and snaptrim has finished more than half of the deleted objects. Thanks for your help! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Gregory Farnum <gfarnum@xxxxxxxxxx> Sent: 08 February 2022 18:22:15 To: Dan van der Ster Cc: Frank Schilder; Patrick Donnelly; ceph-users Subject: Re: Re: cephfs: [ERR] loaded dup inode On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote: > > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx> wrote: > > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint. > > So are you saying that if I do this I'll have 1M files in stray? No, happily. The thing that's happening here post-dates my main previous stretch on CephFS and I had forgotten it, but there's a note in the developer docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links (I fortuitously stumbled across this from an entirely different direction/discussion just after seeing this thread and put the pieces together!) Basically, hard links are *the worst*. For everything in filesystems. I spent a lot of time trying to figure out how to handle hard links being renamed across snapshots[1] and never managed it, and the eventual "solution" was to give up and do the degenerate thing: If there's a file with multiple hard links, that file is a member of *every* snapshot. Doing anything about this will take a lot of time. There's probably an opportunity to improve it for users of the subvolumes library, as those subvolumes do get tagged a bit, so I'll see if we can look into that. But for generic CephFS, I'm not sure what the solution will look like at all. Sorry folks. :/ -Greg [1]: The issue is that, if you have a hard linked file in two places, you would expect it to be snapshotted whenever a snapshot covering either location occurs. But in CephFS the file can only live in one location, and the other location has to just hold a reference to it instead. So say you have inode Y at path A, and then hard link it in at path B. Given how snapshots work, when you open up Y from A, you would need to check all the snapshots that apply from both A and B's trees. But 1) opening up other paths is a challenge all on its own, and 2) without an inode and its backtrace to provide a lookup resolve point, it's impossible to maintain a lookup that scales and is possible to keep consistent. (Oh, I did just have one idea, but I'm not sure if it would fix every issue or just that scalable backtrace lookup: https://tracker.ceph.com/issues/54205) > > mkdir /a > cd /a > for i in {1..1000000}; do touch $i; done # create 1M files in /a > cd .. > mkdir /b > mkdir /b/.snap/testsnap # create a snap in the empty dir /b > rm -rf /a/ > > > Cheers, Dan > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx