Dear Gregory, Dan and Patrick, this is a reply to an older thread about num_stray growing without limits (thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW, message https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/). I'm opening a new thread for a better matching subject line. I now started testing octopus and am afraid I came across a very serious issue with unlimited growth of stray buckets. I'm running a test that puts constant load on a file system by adding a blob of data, creating a snapshot, deleting a blob of data and deleting a snapshot in a cyclic process. A blob of data contains about 330K hard links to make it more interesting. The benchmark crashed after half a day in rm with "no space left on device", which was due to the stray buckets being too full (old thread). OK, so I increased mds_bal_fragment_size_max and cleaned out all data to start fresh. However, this happened: [root@rit-tceph ~]# df -h /mnt/adm/cephfs Filesystem Size Used Avail Use% Mounted on 10.41.24.13,10.41.24.14,10.41.24.15:/ 2.5T 35G 2.5T 2% /mnt/adm/cephfs [root@rit-tceph ~]# find /mnt/adm/cephfs/ /mnt/adm/cephfs/ /mnt/adm/cephfs/data /mnt/adm/cephfs/data/blobs [root@rit-tceph ~]# find /mnt/adm/cephfs/.snap /mnt/adm/cephfs/.snap [root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap /mnt/adm/cephfs/data/.snap [root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap /mnt/adm/cephfs/data/blobs/.snap All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a long time ago. Now look at this: [root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | jq .mds_cache.num_strays" 962562 Whaaaaat? There is data left over in the fs pools and the stray buckets are cloaked up. [root@rit-tceph ~]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 2.4 TiB 2.4 TiB 1.4 GiB 35 GiB 1.38 TOTAL 2.4 TiB 2.4 TiB 1.4 GiB 35 GiB 1.38 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 170 KiB 9 509 KiB 0 781 GiB fs-meta1 2 64 2.2 GiB 160.25k 6.5 GiB 0.28 781 GiB fs-meta2 3 128 0 B 802.40k 0 B 0 781 GiB fs-data 4 128 0 B 802.40k 0 B 0 1.5 TiB There is either a very serious bug with cleaning up stray entries when their last snapshot is deleted, or I'm missing something important here when deleting data. Just for completeness: [root@rit-tceph ~]# ceph status cluster: id: bf1f51f5-b381-4cf7-b3db-88d044c1960c health: HEALTH_OK services: mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d) mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03 mds: fs:1 {0=tceph-03=up:active} 2 up:standby osd: 9 osds: 9 up (since 4d), 9 in (since 4d) data: pools: 4 pools, 321 pgs objects: 1.77M objects, 256 MiB usage: 35 GiB used, 2.4 TiB / 2.4 TiB avail pgs: 321 active+clean I would be most grateful for both, an explanation what happened here and a way to get out of this. To me it looks very much like unlimited growth of garbage that is never cleaned out. Many thanks and best regads, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Gregory Farnum <gfarnum@xxxxxxxxxx> Sent: 08 February 2022 18:22 To: Dan van der Ster Cc: Frank Schilder; Patrick Donnelly; ceph-users Subject: Re: Re: cephfs: [ERR] loaded dup inode On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote: > > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx> wrote: > > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint. > > So are you saying that if I do this I'll have 1M files in stray? No, happily. The thing that's happening here post-dates my main previous stretch on CephFS and I had forgotten it, but there's a note in the developer docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links (I fortuitously stumbled across this from an entirely different direction/discussion just after seeing this thread and put the pieces together!) Basically, hard links are *the worst*. For everything in filesystems. I spent a lot of time trying to figure out how to handle hard links being renamed across snapshots[1] and never managed it, and the eventual "solution" was to give up and do the degenerate thing: If there's a file with multiple hard links, that file is a member of *every* snapshot. Doing anything about this will take a lot of time. There's probably an opportunity to improve it for users of the subvolumes library, as those subvolumes do get tagged a bit, so I'll see if we can look into that. But for generic CephFS, I'm not sure what the solution will look like at all. Sorry folks. :/ -Greg [1]: The issue is that, if you have a hard linked file in two places, you would expect it to be snapshotted whenever a snapshot covering either location occurs. But in CephFS the file can only live in one location, and the other location has to just hold a reference to it instead. So say you have inode Y at path A, and then hard link it in at path B. Given how snapshots work, when you open up Y from A, you would need to check all the snapshots that apply from both A and B's trees. But 1) opening up other paths is a challenge all on its own, and 2) without an inode and its backtrace to provide a lookup resolve point, it's impossible to maintain a lookup that scales and is possible to keep consistent. (Oh, I did just have one idea, but I'm not sure if it would fix every issue or just that scalable backtrace lookup: https://tracker.ceph.com/issues/54205) > > mkdir /a > cd /a > for i in {1..1000000}; do touch $i; done # create 1M files in /a > cd .. > mkdir /b > mkdir /b/.snap/testsnap # create a snap in the empty dir /b > rm -rf /a/ > > > Cheers, Dan > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx