Re: CephFS - How to handle "loaded dup inode" errors

John Spray <jspray@xxxxxxxxxx> · Tue, 10 Jul 2018 15:25:29 +0100

On Tue, Jul 10, 2018 at 3:14 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:
>
> Hi John,
>
> On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote:
> > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
> > >
> > >
> > > We're affected by something like this right now (the dup inode
> > > causing MDS to crash via assert(!p) with add_inode(CInode)
> > > function).
> > >
> > > In terms of behaviours, shouldn't the MDS simply skip to the next
> > > available free inode in the event of a dup, than crashing the
> > > entire FS because of one file? Probably I'm missing something but
> > > that'd be a no brainer picking between the two?
> > Historically (a few years ago) the MDS asserted out on any invalid
> > metadata.  Most of these cases have been picked up and converted into
> > explicit damage handling, but this one appears to have been missed --
> > so yes, it's a bug that the MDS asserts out.
>
> I have followed the disaster recovery and now all my files and
> directories in CephFS which complained about duplicate inodes
> disappeared from my FS. I see *some* data in "lost+found", but thats
> only a part of it. Is there any way to retrieve those missing files?

If you had multiple files trying to use the same inode number, then
the contents of the data pool would only have been storing the
contents of one of those files (or, worst case, some interspersed
mixture of both files).  So the chances are that if something wasn't
linked into lost+found, it is gone for good.

Now that your damaged filesystem is up and running again, if you have
the capacity then it's a good precaution to create a fresh filesystem,
copy the files over, and then restore anything missing from backups.
The multi-filesystem functionality is officially an experimental
feature (mainly because it gets little testing), but when you've gone
through a metadata damage episode it's the lesser of two evils.

John

>
> > John
> >
> > >
> > > ________________________________
> > > From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of
> > > Wido den Hollander <wido@xxxxxxxx>
> > > Sent: Saturday, 7 July 2018 12:26:15 AM
> > > To: John Spray
> > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  CephFS - How to handle "loaded dup inode"
> > > errors
> > >
> > >
> > >
> > > On 07/06/2018 01:47 PM, John Spray wrote:
> > > >
> > > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx
> > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 07/05/2018 03:36 PM, John Spray wrote:
> > > > > >
> > > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@ho
> > > > > > lmes.nl> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hi list,
> > > > > > >
> > > > > > > I have a serious problem now... I think.
> > > > > > >
> > > > > > > One of my users just informed me that a file he created
> > > > > > > (.doc file) has
> > > > > > > a different content then before. It looks like the file's
> > > > > > > inode is
> > > > > > > completely wrong and points to the wrong object. I myself
> > > > > > > have found
> > > > > > > another file with the same symptoms. I'm afraid my
> > > > > > > (production) FS is
> > > > > > > corrupt now, unless there is a possibility to fix the
> > > > > > > inodes.
> > > > > > You can probably get back to a state with some valid
> > > > > > metadata, but it
> > > > > > might not necessarily be the metadata the user was expecting
> > > > > > (e.g. if
> > > > > > two files are claiming the same inode number, one of them's
> > > > > > is
> > > > > > probably going to get deleted).
> > > > > >
> > > > > > >
> > > > > > > Timeline of what happend:
> > > > > > >
> > > > > > > Last week I upgraded our Ceph Jewel to Luminous.
> > > > > > > This went without any problem.
> > > > > > >
> > > > > > > I already had 5 MDS available and went with the Multi-MDS
> > > > > > > feature and
> > > > > > > enabled it. The seemed to work okay, but after a while my
> > > > > > > MDS went
> > > > > > > beserk and went flapping (crashed -> replay -> rejoin ->
> > > > > > > crashed)
> > > > > > >
> > > > > > > The only way to fix this and get the FS back online was the
> > > > > > > disaster
> > > > > > > recovery procedure:
> > > > > > >
> > > > > > > cephfs-journal-tool event recover_dentries summary
> > > > > > > ceph fs set cephfs cluster_down true
> > > > > > > cephfs-table-tool all reset session
> > > > > > > cephfs-table-tool all reset inode
> > > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset
> > > > > > > ceph mds fail 0
> > > > > > > ceph fs reset cephfs --yes-i-really-mean-it
> > > > > > My concern with this procedure is that the recover_dentries
> > > > > > and
> > > > > > journal reset only happened on rank 0, whereas the other 4
> > > > > > MDS ranks
> > > > > > would have retained lots of content in their journals.  I
> > > > > > wonder if we
> > > > > > should be adding some more multi-mds aware checks to these
> > > > > > tools, to
> > > > > > warn the user when they're only acting on particular ranks (a
> > > > > > reasonable person might assume that recover_dentries with no
> > > > > > args is
> > > > > > operating on all ranks, not just 0).  Created
> > > > > > https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?doma
> > > > > > in=tracker.ceph.com to track improving the default
> > > > > > behaviour.
> > > > > >
> > > > > > >
> > > > > > > Restarted the MDS and I was back online. Shortly after I
> > > > > > > was getting a
> > > > > > > lot of "loaded dup inode". In the meanwhile the MDS kept
> > > > > > > crashing. It
> > > > > > > looks like it had trouble creating new inodes. Right before
> > > > > > > the crash
> > > > > > > it mostly complained something like:
> > > > > > >
> > > > > > >     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4
> > > > > > > mds.0.server
> > > > > > > handle_client_request client_request(client.324932014:1434
> > > > > > > create
> > > > > > > #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458
> > > > > > > caller_uid=0,
> > > > > > > caller_gid=0{}) v2
> > > > > > >     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5
> > > > > > > mds.0.log
> > > > > > > _submit_thread 24100753876035~1070 : EOpen [metablob
> > > > > > > 0x10000360346, 1
> > > > > > > dirs], 1 open files
> > > > > > >      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1
> > > > > > > /build/ceph-
> > > > > > > 12.2.5/src/mds/MDCache.cc: In function 'void
> > > > > > > MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-
> > > > > > > 07-05
> > > > > > > 05:05:01.615123
> > > > > > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED
> > > > > > > assert(!p)
> > > > > > >
> > > > > > > I also tried to counter the create inode crash by doing the
> > > > > > > following:
> > > > > > >
> > > > > > > cephfs-journal-tool event recover_dentries
> > > > > > > cephfs-journal-tool journal reset
> > > > > > > cephfs-table-tool all reset session
> > > > > > > cephfs-table-tool all reset inode
> > > > > > > cephfs-table-tool all take_inos 100000
> > > > > > This procedure is recovering some metadata from the journal
> > > > > > into the
> > > > > > main tree, then resetting everything, but duplicate inodes
> > > > > > are
> > > > > > happening when the main tree has multiple dentries containing
> > > > > > inodes
> > > > > > using the same inode number.
> > > > > >
> > > > > > What you need is something that scans through all the
> > > > > > metadata,
> > > > > > notices which entries point to the a duplicate, and snips out
> > > > > > those
> > > > > > dentries.  I'm not quite up to date on the latest CephFS
> > > > > > forward scrub
> > > > > > bits, so hopefully someone else can chime in to comment on
> > > > > > whether we
> > > > > > have the tooling for this already.
> > > > > But to prevent these crashes setting take_inos to a higher
> > > > > number is a
> > > > > good choice, right? You'll loose inodes numbers, but you will
> > > > > have it
> > > > > running without duplicate (new inodes).
> > > > Yes -- that's the motivation to skipping inode numbers after some
> > > > damage (but it won't help if some duplication has already
> > > > happened).
> > > >
> > > > >
> > > > > Any idea to figure out the highest inode at the moment in the
> > > > > FS?
> > > > If the metadata is damaged, you'd have to do a full scan of
> > > > objects in
> > > > the data pool.  Perhaps that could be added as a mode to
> > > > cephfs-data-scan.
> > > >
> > > Understood, but it seems there are two things going on here:
> > >
> > > - Files with wrong content
> > > - MDS crashing on duplicate inodes
> > >
> > > The latter is fixed with take_inos as we then bump the inode number
> > > to
> > > something very high.
> > >
> > > Wido
> > >
> > > >
> > > > BTW, in the long run I'd still really like to integrate all this
> > > > tooling into an overall FSCK.  Most of the individual commands
> > > > were
> > > > added in Jewel era with the intent that they would be available
> > > > for
> > > > level 3 support, but eventually we should build a tool that is
> > > > safer
> > > > for end users.  I'm interested in using Kubernetes to orchestrate
> > > > groups of worker processes to do massively parallel cephfs-data-
> > > > scan
> > > > operations, so that this isn't so prohibitively long running.
> > > >
> > > > John
> > > >
> > > > >
> > > > >
> > > > > Wido
> > > > >
> > > > > >
> > > > > >
> > > > > > John
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I'm worried that my FS is corrupt because files are not
> > > > > > > linked
> > > > > > > correctly and have different content then they should.
> > > > > > >
> > > > > > > Please help.
> > > > > > >
> > > > > > > On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT)
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm getting a bunch of "loaded dup inode" errors in the
> > > > > > > > MDS logs.
> > > > > > > > How can this be fixed?
> > > > > > > >
> > > > > > > > logs:
> > > > > > > > 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup
> > > > > > > > inode 0x10000991921
> > > > > > > > [2,head] v160 at <file path>, but inode
> > > > > > > > 0x10000991921.head v146 already
> > > > > > > > exists at <another file path>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > > https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?
> > > > > > > > domain=lists.ceph.com
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?do
> > > > > > > main=lists.ceph.com
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?doma
> > > > > > in=lists.ceph.com
> > > > > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com