Re: cephfs: [ERR] loaded dup inode

Frank Schilder <frans@xxxxxx> · Wed, 9 Feb 2022 08:10:11 +0000

Hi Gregory,

thanks for this piece of information. There is still something missing and maybe you can add it: what is actually going to stray and what is handled differently? As described, we had rolling snapshots in 2 directories and one static one in a disjoint tree (there was no overlap even with soft or hard links). Here is the critical snippet of the time series of mds_cache.num_strays:

1643669702  1089195  Mon Jan 31 23:55:02 CET 2022 << snapshot for Feb 1st is created 10 min from now
1643756101  1260646  Tue Feb  1 23:55:01 CET 2022 << user deleted lots of data on Feb 1st
...
1644101702  1279422  Sat Feb  5 23:55:02 CET 2022 << retention ends in 10 minutes, snapshot is deleted
1644188102  1280530  Sun Feb  6 23:55:02 CET 2022 << Why why why ???
1644274502  232589  Mon Feb  7 23:55:02 CET 2022 << after deletion of static snapshot on Feb 7

On Feb 1st a user deleted a large directory tree. I don't think it contained 200k hard links, so the increase in num_strays cannot just be that. On deletion, there is absolutely no reduction in num_strays (snaptrim is finished by the time of recording). Are hard links that fall into a global stray blocking the removal of other stray entries as well? How is the time series explained?

On a side node, we are now down to mds_cache.num_strays=132514 and snaptrim has finished more than half of the deleted objects.

Thanks for your help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfarnum@xxxxxxxxxx>
Sent: 08 February 2022 18:22:15
To: Dan van der Ster
Cc: Frank Schilder; Patrick Donnelly; ceph-users
Subject: Re:  Re: cephfs: [ERR] loaded dup inode

On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>
> On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx> wrote:
> > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint.
>
> So are you saying that if I do this I'll have 1M files in stray?

No, happily.

The thing that's happening here post-dates my main previous stretch on
CephFS and I had forgotten it, but there's a note in the developer
docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
(I fortuitously stumbled across this from an entirely different
direction/discussion just after seeing this thread and put the pieces
together!)

Basically, hard links are *the worst*. For everything in filesystems.
I spent a lot of time trying to figure out how to handle hard links
being renamed across snapshots[1] and never managed it, and the
eventual "solution" was to give up and do the degenerate thing:
If there's a file with multiple hard links, that file is a member of
*every* snapshot.

Doing anything about this will take a lot of time. There's probably an
opportunity to improve it for users of the subvolumes library, as
those subvolumes do get tagged a bit, so I'll see if we can look into
that. But for generic CephFS, I'm not sure what the solution will look
like at all.

Sorry folks. :/
-Greg

[1]: The issue is that, if you have a hard linked file in two places,
you would expect it to be snapshotted whenever a snapshot covering
either location occurs. But in CephFS the file can only live in one
location, and the other location has to just hold a reference to it
instead. So say you have inode Y at path A, and then hard link it in
at path B. Given how snapshots work, when you open up Y from A, you
would need to check all the snapshots that apply from both A and B's
trees. But 1) opening up other paths is a challenge all on its own,
and 2) without an inode and its backtrace to provide a lookup resolve
point, it's impossible to maintain a lookup that scales and is
possible to keep consistent.
(Oh, I did just have one idea, but I'm not sure if it would fix every
issue or just that scalable backtrace lookup:
https://tracker.ceph.com/issues/54205)

>
> mkdir /a
> cd /a
> for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> cd ..
> mkdir /b
> mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> rm -rf /a/
>
>
> Cheers, Dan
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx