Hi Dan and Patrick, I found the culprit. It originates from the counter intuitive non-locality of ceph fs snapshots. For the benefit of future readers, I include my findings below. Question to you: Looking at these findings, does it ever make sense to create snapshots in any other location than the file system root itself? The situation in a nutshell: We have a script rotating snapshots in two directories with a retention time of 5 days. This should imply that data deleted by users under these directories should be cleared out by snap trim after 5 days. During these 5 days, the deleted data populates so-called stray buckets. Contrary to these expectations, our observation was that the number of elements in the stray buckets was increasing continuously; see data below. The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint. For anyone who wants to preserve the state of a sub-tree for a long time, create a snapshot, copy the tree and then delete the snapshot. With the current behaviour, ceph fs snapshots are strictly for temporary use only. Here the data collected over the last couple of weeks: date in sec | stray count | date 1642114502 1090655 Thu Jan 13 23:55:02 CET 2022 1642200902 1032090 Fri Jan 14 23:55:02 CET 2022 1642287302 1026299 Sat Jan 15 23:55:02 CET 2022 1642373701 1026636 Sun Jan 16 23:55:01 CET 2022 1642460102 1020264 Mon Jan 17 23:55:02 CET 2022 1642546502 1040033 Tue Jan 18 23:55:02 CET 2022 1642632902 1052328 Wed Jan 19 23:55:02 CET 2022 1642719302 1064312 Thu Jan 20 23:55:02 CET 2022 1642805702 1067976 Fri Jan 21 23:55:02 CET 2022 1642892101 1064525 Sat Jan 22 23:55:01 CET 2022 1642978502 1049225 Sun Jan 23 23:55:02 CET 2022 1643064902 1054401 Mon Jan 24 23:55:02 CET 2022 1643151302 1104128 Tue Jan 25 23:55:02 CET 2022 1643237702 1116269 Wed Jan 26 23:55:02 CET 2022 1643324102 1118656 Thu Jan 27 23:55:02 CET 2022 1643410502 1129084 Fri Jan 28 23:55:02 CET 2022 1643496902 1127124 Sat Jan 29 23:55:02 CET 2022 1643583302 1090388 Sun Jan 30 23:55:02 CET 2022 1643669702 1089195 Mon Jan 31 23:55:02 CET 2022 << snapshot for Feb 1st is created 10 min from now 1643756101 1260646 Tue Feb 1 23:55:01 CET 2022 << user deleted lots of data on Feb 1st 1643842502 1269309 Wed Feb 2 23:55:02 CET 2022 1643928902 1275916 Thu Feb 3 23:55:02 CET 2022 1644015302 1279648 Fri Feb 4 23:55:02 CET 2022 1644101702 1279422 Sat Feb 5 23:55:02 CET 2022 << retention ends in 10 minutes, snapshot is deleted 1644188102 1280530 Sun Feb 6 23:55:02 CET 2022 << Why why why ??? 1644274502 232589 Mon Feb 7 23:55:02 CET 2022 << after deletion of static snapshot on Feb 7 After deleting the static snapshot the snap trim process is churning with great courage through the garbage of at least half a year of deleted files. So far it removed about 100M objects from the file system pools. Thanks for your help! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Patrick Donnelly <pdonnell@xxxxxxxxxx> Sent: 20 January 2022 19:00 To: Frank Schilder Cc: Dan van der Ster; ceph-users Subject: Re: Re: cephfs: [ERR] loaded dup inode Hi Frank, On Tue, Jan 18, 2022 at 4:54 AM Frank Schilder <frans@xxxxxx> wrote: > > Hi Dan and Patrick, > > this problem seems to develop into a nightmare. I executed a find on the file system and had some initial success. The number of stray files dropped by about 8%. Unfortunately, this is about it. I'm running a find now also on snap dirs, but I don't have much hope. There must be a way to find out what is accumulating in the stray buckets. As I wrote in another reply to this thread, I can't dump the trees: > > > I seem to have a problem. I cannot dump the mds tree: > > > > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mdsdir/stray0' > > root inode is not in cache > > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mds0/stray0' > > root inode is not in cache > > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mds0' 0 > > root inode is not in cache > > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mdsdir' 0 > > root inode is not in cache > > > > [root@ceph-08 ~]# ceph daemon mds.ceph-08 get subtrees | grep path > > "path": "", > > "path": "~mds0", > > > > However, this information is somewhere in rados objects and it should be possible to figure something out similar to > > # rados getxattr --pool=con-fs2-meta1 <OBJ_ID> parent | ceph-dencoder type inode_backtrace_t import - decode dump_json > # rados listomapkeys --pool=con-fs2-meta1 <OBJ_ID> > > What OBJ_IDs am I looking for? How and where can I start to traverse the structure? Version is mimic latest stable. You mentioned you have snapshots? If you've deleted the directories that have been snapshotted then they stick around in the stray directory until the snapshot is deleted. There's no way to force purging until the snapshot is also deleted. For this reason, the stray directory size can grow without bound. You need to either upgrade to Pacific where the stray directory will be fragmented or remove the snapshots. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx