Re: CephFS - How to handle "loaded dup inode" errors

Wido den Hollander <wido@xxxxxxxx> · Wed, 11 Jul 2018 09:23:30 +0200

On 07/11/2018 01:47 AM, Linh Vu wrote:
> Thanks John :) Has it - asserting out on dupe inode - already been
> logged as a bug yet? I could put one in if needed. 
> 

Did you just comment out the assert? And indeed, my next question would
be, do we have a issue tracker for this?

Wido

> 
> Cheers,
> 
> Linh
> 
> 
> 
> ------------------------------------------------------------------------
> *From:* John Spray <jspray@xxxxxxxxxx>
> *Sent:* Tuesday, 10 July 2018 7:11 PM
> *To:* Linh Vu
> *Cc:* Wido den Hollander; ceph-users@xxxxxxxxxxxxxx
> *Subject:* Re:  CephFS - How to handle "loaded dup inode"
> errors
>  
> On Tue, Jul 10, 2018 at 12:43 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
>>
>> We're affected by something like this right now (the dup inode causing MDS to crash via assert(!p) with add_inode(CInode) function).
>>
>> In terms of behaviours, shouldn't the MDS simply skip to the next available free inode in the event of a dup, than crashing the entire FS because of one file? Probably I'm missing something but that'd be a no brainer picking between the two?
> 
> Historically (a few years ago) the MDS asserted out on any invalid
> metadata.  Most of these cases have been picked up and converted into
> explicit damage handling, but this one appears to have been missed --
> so yes, it's a bug that the MDS asserts out.
> 
> John
> 
>> ________________________________
>> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>
>> Sent: Saturday, 7 July 2018 12:26:15 AM
>> To: John Spray
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  CephFS - How to handle "loaded dup inode" errors
>>
>>
>>
>> On 07/06/2018 01:47 PM, John Spray wrote:
>> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>> >>
>> >>
>> >>
>> >> On 07/05/2018 03:36 PM, John Spray wrote:
>> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:
>> >>>>
>> >>>> Hi list,
>> >>>>
>> >>>> I have a serious problem now... I think.
>> >>>>
>> >>>> One of my users just informed me that a file he created (.doc file) has
>> >>>> a different content then before. It looks like the file's inode is
>> >>>> completely wrong and points to the wrong object. I myself have found
>> >>>> another file with the same symptoms. I'm afraid my (production) FS is
>> >>>> corrupt now, unless there is a possibility to fix the inodes.
>> >>>
>> >>> You can probably get back to a state with some valid metadata, but it
>> >>> might not necessarily be the metadata the user was expecting (e.g. if
>> >>> two files are claiming the same inode number, one of them's is
>> >>> probably going to get deleted).
>> >>>
>> >>>> Timeline of what happend:
>> >>>>
>> >>>> Last week I upgraded our Ceph Jewel to Luminous.
>> >>>> This went without any problem.
>> >>>>
>> >>>> I already had 5 MDS available and went with the Multi-MDS feature and
>> >>>> enabled it. The seemed to work okay, but after a while my MDS went
>> >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
>> >>>>
>> >>>> The only way to fix this and get the FS back online was the disaster
>> >>>> recovery procedure:
>> >>>>
>> >>>> cephfs-journal-tool event recover_dentries summary
>> >>>> ceph fs set cephfs cluster_down true
>> >>>> cephfs-table-tool all reset session
>> >>>> cephfs-table-tool all reset inode
>> >>>> cephfs-journal-tool --rank=cephfs:0 journal reset
>> >>>> ceph mds fail 0
>> >>>> ceph fs reset cephfs --yes-i-really-mean-it
>> >>>
>> >>> My concern with this procedure is that the recover_dentries and
>> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
>> >>> would have retained lots of content in their journals.  I wonder if we
>> >>> should be adding some more multi-mds aware checks to these tools, to
>> >>> warn the user when they're only acting on particular ranks (a
>> >>> reasonable person might assume that recover_dentries with no args is
>> >>> operating on all ranks, not just 0).  Created
>> >>> https://protect-au.mimecast.com/s/QSBICmOxr6sMJ8wYHG1IuW?domain=tracker.ceph.com
> <http://tracker.ceph.com/issues/24780> to track improving the default
>> >>> behaviour.
>> >>>
>> >>>> Restarted the MDS and I was back online. Shortly after I was getting a
>> >>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>> >>>> looks like it had trouble creating new inodes. Right before the crash
>> >>>> it mostly complained something like:
>> >>>>
>> >>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>> >>>> handle_client_request client_request(client.324932014:1434 create
>> >>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>> >>>> caller_gid=0{}) v2
>> >>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>> >>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1
>> >>>> dirs], 1 open files
>> >>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
>> >>>> 12.2.5/src/mds/MDCache.cc: In function 'void
>> >>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
>> >>>> 05:05:01.615123
>> >>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
>> >>>>
>> >>>> I also tried to counter the create inode crash by doing the following:
>> >>>>
>> >>>> cephfs-journal-tool event recover_dentries
>> >>>> cephfs-journal-tool journal reset
>> >>>> cephfs-table-tool all reset session
>> >>>> cephfs-table-tool all reset inode
>> >>>> cephfs-table-tool all take_inos 100000
>> >>>
>> >>> This procedure is recovering some metadata from the journal into the
>> >>> main tree, then resetting everything, but duplicate inodes are
>> >>> happening when the main tree has multiple dentries containing inodes
>> >>> using the same inode number.
>> >>>
>> >>> What you need is something that scans through all the metadata,
>> >>> notices which entries point to the a duplicate, and snips out those
>> >>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
>> >>> bits, so hopefully someone else can chime in to comment on whether we
>> >>> have the tooling for this already.
>> >>
>> >> But to prevent these crashes setting take_inos to a higher number is a
>> >> good choice, right? You'll loose inodes numbers, but you will have it
>> >> running without duplicate (new inodes).
>> >
>> > Yes -- that's the motivation to skipping inode numbers after some
>> > damage (but it won't help if some duplication has already happened).
>> >
>> >> Any idea to figure out the highest inode at the moment in the FS?
>> >
>> > If the metadata is damaged, you'd have to do a full scan of objects in
>> > the data pool.  Perhaps that could be added as a mode to
>> > cephfs-data-scan.
>> >
>>
>> Understood, but it seems there are two things going on here:
>>
>> - Files with wrong content
>> - MDS crashing on duplicate inodes
>>
>> The latter is fixed with take_inos as we then bump the inode number to
>> something very high.
>>
>> Wido
>>
>> > BTW, in the long run I'd still really like to integrate all this
>> > tooling into an overall FSCK.  Most of the individual commands were
>> > added in Jewel era with the intent that they would be available for
>> > level 3 support, but eventually we should build a tool that is safer
>> > for end users.  I'm interested in using Kubernetes to orchestrate
>> > groups of worker processes to do massively parallel cephfs-data-scan
>> > operations, so that this isn't so prohibitively long running.
>> >
>> > John
>> >
>> >>
>> >> Wido
>> >>
>> >>>
>> >>> John
>> >>>
>> >>>>
>> >>>> I'm worried that my FS is corrupt because files are not linked
>> >>>> correctly and have different content then they should.
>> >>>>
>> >>>> Please help.
>> >>>>
>> >>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
>> >>>>> How can this be fixed?
>> >>>>>
>> >>>>> logs:
>> >>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921
>> >>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already
>> >>>>> exists at <another file path>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list
>> >>>>> ceph-users@xxxxxxxxxxxxxx
>> >>>>> https://protect-au.mimecast.com/s/fC92CnxyvXcWOg0LtJeRFI?domain=lists.ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________>
>> >>>> ceph-users mailing list
>> >>>> ceph-users@xxxxxxxxxxxxxx
>> >>>> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-users@xxxxxxxxxxxxxx
>> >>> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> >>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com