Re: cephfs: [ERR] loaded dup inode

Frank Schilder <frans@xxxxxx> · Tue, 8 Feb 2022 12:03:57 +0000

Hi Dan and Patrick,

I found the culprit. It originates from the counter intuitive non-locality of ceph fs snapshots. For the benefit of future readers, I include my findings below. Question to you: Looking at these findings, does it ever make sense to create snapshots in any other location than the file system root itself?

The situation in a nutshell: We have a script rotating snapshots in two directories with a retention time of 5 days. This should imply that data deleted by users under these directories should be cleared out by snap trim after 5 days. During these 5 days, the deleted data populates so-called stray buckets. Contrary to these expectations, our observation was that the number of elements in the stray buckets was increasing continuously; see data below.

The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint.

For anyone who wants to preserve the state of a sub-tree for a long time, create a snapshot, copy the tree and then delete the snapshot. With the current behaviour, ceph fs snapshots are strictly for temporary use only.

Here the data collected over the last couple of weeks:

date in sec | stray count | date
1642114502  1090655  Thu Jan 13 23:55:02 CET 2022
1642200902  1032090  Fri Jan 14 23:55:02 CET 2022
1642287302  1026299  Sat Jan 15 23:55:02 CET 2022
1642373701  1026636  Sun Jan 16 23:55:01 CET 2022
1642460102  1020264  Mon Jan 17 23:55:02 CET 2022
1642546502  1040033  Tue Jan 18 23:55:02 CET 2022
1642632902  1052328  Wed Jan 19 23:55:02 CET 2022
1642719302  1064312  Thu Jan 20 23:55:02 CET 2022
1642805702  1067976  Fri Jan 21 23:55:02 CET 2022
1642892101  1064525  Sat Jan 22 23:55:01 CET 2022
1642978502  1049225  Sun Jan 23 23:55:02 CET 2022
1643064902  1054401  Mon Jan 24 23:55:02 CET 2022
1643151302  1104128  Tue Jan 25 23:55:02 CET 2022
1643237702  1116269  Wed Jan 26 23:55:02 CET 2022
1643324102  1118656  Thu Jan 27 23:55:02 CET 2022
1643410502  1129084  Fri Jan 28 23:55:02 CET 2022
1643496902  1127124  Sat Jan 29 23:55:02 CET 2022
1643583302  1090388  Sun Jan 30 23:55:02 CET 2022
1643669702  1089195  Mon Jan 31 23:55:02 CET 2022 << snapshot for Feb 1st is created 10 min from now
1643756101  1260646  Tue Feb  1 23:55:01 CET 2022 << user deleted lots of data on Feb 1st
1643842502  1269309  Wed Feb  2 23:55:02 CET 2022
1643928902  1275916  Thu Feb  3 23:55:02 CET 2022
1644015302  1279648  Fri Feb  4 23:55:02 CET 2022
1644101702  1279422  Sat Feb  5 23:55:02 CET 2022 << retention ends in 10 minutes, snapshot is deleted
1644188102  1280530  Sun Feb  6 23:55:02 CET 2022 << Why why why ???
1644274502  232589  Mon Feb  7 23:55:02 CET 2022 << after deletion of static snapshot on Feb 7

After deleting the static snapshot the snap trim process is churning with great courage through the garbage of at least half a year of deleted files. So far it removed about 100M objects from the file system pools.

Thanks for your help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 20 January 2022 19:00
To: Frank Schilder
Cc: Dan van der Ster; ceph-users
Subject: Re:  Re: cephfs: [ERR] loaded dup inode

Hi Frank,

On Tue, Jan 18, 2022 at 4:54 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan and Patrick,
>
> this problem seems to develop into a nightmare. I executed a find on the file system and had some initial success. The number of stray files dropped by about 8%. Unfortunately, this is about it. I'm running a find now also on snap dirs, but I don't have much hope. There must be a way to find out what is accumulating in the stray buckets. As I wrote in another reply to this thread, I can't dump the trees:
>
> > I seem to have a problem. I cannot dump the mds tree:
> >
> > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mdsdir/stray0'
> > root inode is not in cache
> > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mds0/stray0'
> > root inode is not in cache
> > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mds0' 0
> > root inode is not in cache
> > [root@ceph-08 ~]# ceph daemon mds.ceph-08 dump tree '~mdsdir' 0
> > root inode is not in cache
> >
> > [root@ceph-08 ~]# ceph daemon mds.ceph-08 get subtrees | grep path
> >             "path": "",
> >             "path": "~mds0",
> >
>
> However, this information is somewhere in rados objects and it should be possible to figure something out similar to
>
> # rados getxattr --pool=con-fs2-meta1 <OBJ_ID> parent | ceph-dencoder type inode_backtrace_t import - decode dump_json
> # rados listomapkeys --pool=con-fs2-meta1 <OBJ_ID>
>
> What OBJ_IDs am I looking for? How and where can I start to traverse the structure? Version is mimic latest stable.

You mentioned you have snapshots? If you've deleted the directories
that have been snapshotted then they stick around in the stray
directory until the snapshot is deleted. There's no way to force
purging until the snapshot is also deleted. For this reason, the stray
directory size can grow without bound. You need to either upgrade to
Pacific where the stray directory will be fragmented or remove the
snapshots.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx