Re: CephFS - How to handle "loaded dup inode" errors

"Dennis Kramer (DBS)" <dennis@xxxxxxxxx> · Tue, 10 Jul 2018 14:13:55 +0000

Hi John,

On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote:
> On Tue, Jul 10, 2018 at 12:43 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
> > 
> > 
> > We're affected by something like this right now (the dup inode
> > causing MDS to crash via assert(!p) with add_inode(CInode)
> > function).
> > 
> > In terms of behaviours, shouldn't the MDS simply skip to the next
> > available free inode in the event of a dup, than crashing the
> > entire FS because of one file? Probably I'm missing something but
> > that'd be a no brainer picking between the two?
> Historically (a few years ago) the MDS asserted out on any invalid
> metadata.  Most of these cases have been picked up and converted into
> explicit damage handling, but this one appears to have been missed --
> so yes, it's a bug that the MDS asserts out.

I have followed the disaster recovery and now all my files and
directories in CephFS which complained about duplicate inodes
disappeared from my FS. I see *some* data in "lost+found", but thats
only a part of it. Is there any way to retrieve those missing files?

> John
> 
> > 
> > ________________________________
> > From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of
> > Wido den Hollander <wido@xxxxxxxx>
> > Sent: Saturday, 7 July 2018 12:26:15 AM
> > To: John Spray
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  CephFS - How to handle "loaded dup inode"
> > errors
> > 
> > 
> > 
> > On 07/06/2018 01:47 PM, John Spray wrote:
> > > 
> > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx
> > > > wrote:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On 07/05/2018 03:36 PM, John Spray wrote:
> > > > > 
> > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@ho
> > > > > lmes.nl> wrote:
> > > > > > 
> > > > > > 
> > > > > > Hi list,
> > > > > > 
> > > > > > I have a serious problem now... I think.
> > > > > > 
> > > > > > One of my users just informed me that a file he created
> > > > > > (.doc file) has
> > > > > > a different content then before. It looks like the file's
> > > > > > inode is
> > > > > > completely wrong and points to the wrong object. I myself
> > > > > > have found
> > > > > > another file with the same symptoms. I'm afraid my
> > > > > > (production) FS is
> > > > > > corrupt now, unless there is a possibility to fix the
> > > > > > inodes.
> > > > > You can probably get back to a state with some valid
> > > > > metadata, but it
> > > > > might not necessarily be the metadata the user was expecting
> > > > > (e.g. if
> > > > > two files are claiming the same inode number, one of them's
> > > > > is
> > > > > probably going to get deleted).
> > > > > 
> > > > > > 
> > > > > > Timeline of what happend:
> > > > > > 
> > > > > > Last week I upgraded our Ceph Jewel to Luminous.
> > > > > > This went without any problem.
> > > > > > 
> > > > > > I already had 5 MDS available and went with the Multi-MDS
> > > > > > feature and
> > > > > > enabled it. The seemed to work okay, but after a while my
> > > > > > MDS went
> > > > > > beserk and went flapping (crashed -> replay -> rejoin ->
> > > > > > crashed)
> > > > > > 
> > > > > > The only way to fix this and get the FS back online was the
> > > > > > disaster
> > > > > > recovery procedure:
> > > > > > 
> > > > > > cephfs-journal-tool event recover_dentries summary
> > > > > > ceph fs set cephfs cluster_down true
> > > > > > cephfs-table-tool all reset session
> > > > > > cephfs-table-tool all reset inode
> > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset
> > > > > > ceph mds fail 0
> > > > > > ceph fs reset cephfs --yes-i-really-mean-it
> > > > > My concern with this procedure is that the recover_dentries
> > > > > and
> > > > > journal reset only happened on rank 0, whereas the other 4
> > > > > MDS ranks
> > > > > would have retained lots of content in their journals.  I
> > > > > wonder if we
> > > > > should be adding some more multi-mds aware checks to these
> > > > > tools, to
> > > > > warn the user when they're only acting on particular ranks (a
> > > > > reasonable person might assume that recover_dentries with no
> > > > > args is
> > > > > operating on all ranks, not just 0).  Created
> > > > > https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?doma
> > > > > in=tracker.ceph.com to track improving the default
> > > > > behaviour.
> > > > > 
> > > > > > 
> > > > > > Restarted the MDS and I was back online. Shortly after I
> > > > > > was getting a
> > > > > > lot of "loaded dup inode". In the meanwhile the MDS kept
> > > > > > crashing. It
> > > > > > looks like it had trouble creating new inodes. Right before
> > > > > > the crash
> > > > > > it mostly complained something like:
> > > > > > 
> > > > > >     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4
> > > > > > mds.0.server
> > > > > > handle_client_request client_request(client.324932014:1434
> > > > > > create
> > > > > > #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458
> > > > > > caller_uid=0,
> > > > > > caller_gid=0{}) v2
> > > > > >     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5
> > > > > > mds.0.log
> > > > > > _submit_thread 24100753876035~1070 : EOpen [metablob
> > > > > > 0x10000360346, 1
> > > > > > dirs], 1 open files
> > > > > >      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1
> > > > > > /build/ceph-
> > > > > > 12.2.5/src/mds/MDCache.cc: In function 'void
> > > > > > MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-
> > > > > > 07-05
> > > > > > 05:05:01.615123
> > > > > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED
> > > > > > assert(!p)
> > > > > > 
> > > > > > I also tried to counter the create inode crash by doing the
> > > > > > following:
> > > > > > 
> > > > > > cephfs-journal-tool event recover_dentries
> > > > > > cephfs-journal-tool journal reset
> > > > > > cephfs-table-tool all reset session
> > > > > > cephfs-table-tool all reset inode
> > > > > > cephfs-table-tool all take_inos 100000
> > > > > This procedure is recovering some metadata from the journal
> > > > > into the
> > > > > main tree, then resetting everything, but duplicate inodes
> > > > > are
> > > > > happening when the main tree has multiple dentries containing
> > > > > inodes
> > > > > using the same inode number.
> > > > > 
> > > > > What you need is something that scans through all the
> > > > > metadata,
> > > > > notices which entries point to the a duplicate, and snips out
> > > > > those
> > > > > dentries.  I'm not quite up to date on the latest CephFS
> > > > > forward scrub
> > > > > bits, so hopefully someone else can chime in to comment on
> > > > > whether we
> > > > > have the tooling for this already.
> > > > But to prevent these crashes setting take_inos to a higher
> > > > number is a
> > > > good choice, right? You'll loose inodes numbers, but you will
> > > > have it
> > > > running without duplicate (new inodes).
> > > Yes -- that's the motivation to skipping inode numbers after some
> > > damage (but it won't help if some duplication has already
> > > happened).
> > > 
> > > > 
> > > > Any idea to figure out the highest inode at the moment in the
> > > > FS?
> > > If the metadata is damaged, you'd have to do a full scan of
> > > objects in
> > > the data pool.  Perhaps that could be added as a mode to
> > > cephfs-data-scan.
> > > 
> > Understood, but it seems there are two things going on here:
> > 
> > - Files with wrong content
> > - MDS crashing on duplicate inodes
> > 
> > The latter is fixed with take_inos as we then bump the inode number
> > to
> > something very high.
> > 
> > Wido
> > 
> > > 
> > > BTW, in the long run I'd still really like to integrate all this
> > > tooling into an overall FSCK.  Most of the individual commands
> > > were
> > > added in Jewel era with the intent that they would be available
> > > for
> > > level 3 support, but eventually we should build a tool that is
> > > safer
> > > for end users.  I'm interested in using Kubernetes to orchestrate
> > > groups of worker processes to do massively parallel cephfs-data-
> > > scan
> > > operations, so that this isn't so prohibitively long running.
> > > 
> > > John
> > > 
> > > > 
> > > > 
> > > > Wido
> > > > 
> > > > > 
> > > > > 
> > > > > John
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > I'm worried that my FS is corrupt because files are not
> > > > > > linked
> > > > > > correctly and have different content then they should.
> > > > > > 
> > > > > > Please help.
> > > > > > 
> > > > > > On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT)
> > > > > > wrote:
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I'm getting a bunch of "loaded dup inode" errors in the
> > > > > > > MDS logs.
> > > > > > > How can this be fixed?
> > > > > > > 
> > > > > > > logs:
> > > > > > > 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup
> > > > > > > inode 0x10000991921
> > > > > > > [2,head] v160 at <file path>, but inode
> > > > > > > 0x10000991921.head v146 already
> > > > > > > exists at <another file path>
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?
> > > > > > > domain=lists.ceph.com
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?do
> > > > > > main=lists.ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?doma
> > > > > in=lists.ceph.com
> > > > > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com