Re: CephFS - How to handle "loaded dup inode" errors

John Spray <jspray@xxxxxxxxxx> · Wed, 11 Jul 2018 11:00:29 +0100

On Wed, Jul 11, 2018 at 2:23 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
>
> Hi John,
>
>
> Thanks for the explanation, that command is a lot more impacting than I thought! I hope the change of name for the verb "reset" comes through in the next version, because that is very easy to misunderstand.
>
> "The first question is why we're talking about running it at all.  What
> chain of reasoning led you to believe that your inotable needed
> erasing?"
>
> I thought the reset inode command is just like the reset session command, as you can pass mds rank to it as a param, it only resets whatever the MDS was holding.
>
> "The most typical case is where the journal has been recovered/erased,
> and take_inos is used to skip forward to avoid re-using any inode
> numbers that had been claimed by journal entries that we threw away."
>
> We had the situation where our MDS was crashing at MDCache::add_inode(CInode*), as discussed earlier. take_inos should fix this, as you mentioned, but we thought that we would need to reset what the MDS was holding, just like the session.
>
> So with your clarification, I believe we only need to do these:
>
> journal backup
> recover dentries
> reset mds journal (it wasn't replaying anyway, kept crashing)
> reset session
> take_inos
> start mds up again
>
> Is that correct?

Probably... I have to be a bit hesitant because we don't know what
originally went wrong with your cluster.  You'd also need to add an
"fs reset" before starting up again if you had multiple active MDS
ranks to begin with.

John

>
> Many thanks, I've learned a lot more about this process.
>
> Cheers,
> Linh
>
> ________________________________
> From: John Spray <jspray@xxxxxxxxxx>
> Sent: Tuesday, 10 July 2018 7:24 PM
> To: Linh Vu
> Cc: Wido den Hollander; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  CephFS - How to handle "loaded dup inode" errors
>
> On Tue, Jul 10, 2018 at 2:49 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
> >
> > While we're on this topic, could someone please explain to me what `cephfs-table-tool all reset inode` does?
>
> The inode table stores an interval set of free inode numbers.  Active
> MDS daemons consume inode numbers as they create files.  Resetting the
> inode table means rewriting it to its original state (i.e. everything
> free).  Using the "take_inos" command consumes some range of inodes,
> to reflect that the inodes up to a certain point aren't really free,
> but in use by some files that already exist.
>
> > Does it only reset what the MDS has in its cache, and after starting up again, the MDS will read in new inode range from the metadata pool?
>
> I'm repeating myself a bit, but for the benefit of anyone reading this
> thread in the future: no, it's nothing like that.  It effectively
> *erases the inode table* by overwriting it ("resetting") with a blank
> one.
>
>
> As with the journal tool (https://github.com/ceph/ceph/pull/22853),
> perhaps the verb "reset" is too prone to misunderstanding.
>
> > If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must run `cephfs-table-tool all reset inode`?
>
> The first question is why we're talking about running it at all.  What
> chain of reasoning led you to believe that your inotable needed
> erasing?
>
> The most typical case is where the journal has been recovered/erased,
> and take_inos is used to skip forward to avoid re-using any inode
> numbers that had been claimed by journal entries that we threw away.
>
> John
>
> >
> > Cheers,
> >
> > Linh
> >
> > ________________________________
> > From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>
> > Sent: Saturday, 7 July 2018 12:26:15 AM
> > To: John Spray
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  CephFS - How to handle "loaded dup inode" errors
> >
> >
> >
> > On 07/06/2018 01:47 PM, John Spray wrote:
> > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:
> > >>
> > >>
> > >>
> > >> On 07/05/2018 03:36 PM, John Spray wrote:
> > >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:
> > >>>>
> > >>>> Hi list,
> > >>>>
> > >>>> I have a serious problem now... I think.
> > >>>>
> > >>>> One of my users just informed me that a file he created (.doc file) has
> > >>>> a different content then before. It looks like the file's inode is
> > >>>> completely wrong and points to the wrong object. I myself have found
> > >>>> another file with the same symptoms. I'm afraid my (production) FS is
> > >>>> corrupt now, unless there is a possibility to fix the inodes.
> > >>>
> > >>> You can probably get back to a state with some valid metadata, but it
> > >>> might not necessarily be the metadata the user was expecting (e.g. if
> > >>> two files are claiming the same inode number, one of them's is
> > >>> probably going to get deleted).
> > >>>
> > >>>> Timeline of what happend:
> > >>>>
> > >>>> Last week I upgraded our Ceph Jewel to Luminous.
> > >>>> This went without any problem.
> > >>>>
> > >>>> I already had 5 MDS available and went with the Multi-MDS feature and
> > >>>> enabled it. The seemed to work okay, but after a while my MDS went
> > >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> > >>>>
> > >>>> The only way to fix this and get the FS back online was the disaster
> > >>>> recovery procedure:
> > >>>>
> > >>>> cephfs-journal-tool event recover_dentries summary
> > >>>> ceph fs set cephfs cluster_down true
> > >>>> cephfs-table-tool all reset session
> > >>>> cephfs-table-tool all reset inode
> > >>>> cephfs-journal-tool --rank=cephfs:0 journal reset
> > >>>> ceph mds fail 0
> > >>>> ceph fs reset cephfs --yes-i-really-mean-it
> > >>>
> > >>> My concern with this procedure is that the recover_dentries and
> > >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
> > >>> would have retained lots of content in their journals.  I wonder if we
> > >>> should be adding some more multi-mds aware checks to these tools, to
> > >>> warn the user when they're only acting on particular ranks (a
> > >>> reasonable person might assume that recover_dentries with no args is
> > >>> operating on all ranks, not just 0).  Created
> > >>> https://protect-au.mimecast.com/s/Xu5bCXLKNwF95mlGi6D8C5?domain=tracker.ceph.com to track improving the default
> > >>> behaviour.
> > >>>
> > >>>> Restarted the MDS and I was back online. Shortly after I was getting a
> > >>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
> > >>>> looks like it had trouble creating new inodes. Right before the crash
> > >>>> it mostly complained something like:
> > >>>>
> > >>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
> > >>>> handle_client_request client_request(client.324932014:1434 create
> > >>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
> > >>>> caller_gid=0{}) v2
> > >>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
> > >>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1
> > >>>> dirs], 1 open files
> > >>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
> > >>>> 12.2.5/src/mds/MDCache.cc: In function 'void
> > >>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
> > >>>> 05:05:01.615123
> > >>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
> > >>>>
> > >>>> I also tried to counter the create inode crash by doing the following:
> > >>>>
> > >>>> cephfs-journal-tool event recover_dentries
> > >>>> cephfs-journal-tool journal reset
> > >>>> cephfs-table-tool all reset session
> > >>>> cephfs-table-tool all reset inode
> > >>>> cephfs-table-tool all take_inos 100000
> > >>>
> > >>> This procedure is recovering some metadata from the journal into the
> > >>> main tree, then resetting everything, but duplicate inodes are
> > >>> happening when the main tree has multiple dentries containing inodes
> > >>> using the same inode number.
> > >>>
> > >>> What you need is something that scans through all the metadata,
> > >>> notices which entries point to the a duplicate, and snips out those
> > >>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
> > >>> bits, so hopefully someone else can chime in to comment on whether we
> > >>> have the tooling for this already.
> > >>
> > >> But to prevent these crashes setting take_inos to a higher number is a
> > >> good choice, right? You'll loose inodes numbers, but you will have it
> > >> running without duplicate (new inodes).
> > >
> > > Yes -- that's the motivation to skipping inode numbers after some
> > > damage (but it won't help if some duplication has already happened).
> > >
> > >> Any idea to figure out the highest inode at the moment in the FS?
> > >
> > > If the metadata is damaged, you'd have to do a full scan of objects in
> > > the data pool.  Perhaps that could be added as a mode to
> > > cephfs-data-scan.
> > >
> >
> > Understood, but it seems there are two things going on here:
> >
> > - Files with wrong content
> > - MDS crashing on duplicate inodes
> >
> > The latter is fixed with take_inos as we then bump the inode number to
> > something very high.
> >
> > Wido
> >
> > > BTW, in the long run I'd still really like to integrate all this
> > > tooling into an overall FSCK.  Most of the individual commands were
> > > added in Jewel era with the intent that they would be available for
> > > level 3 support, but eventually we should build a tool that is safer
> > > for end users.  I'm interested in using Kubernetes to orchestrate
> > > groups of worker processes to do massively parallel cephfs-data-scan
> > > operations, so that this isn't so prohibitively long running.
> > >
> > > John
> > >
> > >>
> > >> Wido
> > >>
> > >>>
> > >>> John
> > >>>
> > >>>>
> > >>>> I'm worried that my FS is corrupt because files are not linked
> > >>>> correctly and have different content then they should.
> > >>>>
> > >>>> Please help.
> > >>>>
> > >>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
> > >>>>> How can this be fixed?
> > >>>>>
> > >>>>> logs:
> > >>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921
> > >>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already
> > >>>>> exists at <another file path>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> ceph-users mailing list
> > >>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>> https://protect-au.mimecast.com/s/e5t0CYWLOxh915KgiGbf0B?domain=lists.ceph.com
> > >>>> ceph-users mailing list
> > >>>> ceph-users@xxxxxxxxxxxxxx
> > >>>> https://protect-au.mimecast.com/s/DiIpCZYMPyCj4z1xsKmXmp?domain=lists.ceph.com
> > >>> _______________________________________________
> > >>> ceph-users mailing list
> > >>> ceph-users@xxxxxxxxxxxxxx
> > >>> https://protect-au.mimecast.com/s/DiIpCZYMPyCj4z1xsKmXmp?domain=lists.ceph.com
> > >>>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > https://protect-au.mimecast.com/s/DiIpCZYMPyCj4z1xsKmXmp?domain=lists.ceph.com
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com