Re: CephFS - How to handle "loaded dup inode" errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We're affected by something like this right now (the dup inode causing MDS to crash via assert(!p) with add_inode(CInode) function). 

In terms of behaviours, shouldn't the MDS simply skip to the next available free inode in the event of a dup, than crashing the entire FS because of one file? Probably I'm missing something but that'd be a no brainer picking between the two?

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>
Sent: Saturday, 7 July 2018 12:26:15 AM
To: John Spray
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: CephFS - How to handle "loaded dup inode" errors
 


On 07/06/2018 01:47 PM, John Spray wrote:
> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>
>>
>> On 07/05/2018 03:36 PM, John Spray wrote:
>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:
>>>>
>>>> Hi list,
>>>>
>>>> I have a serious problem now... I think.
>>>>
>>>> One of my users just informed me that a file he created (.doc file) has
>>>> a different content then before. It looks like the file's inode is
>>>> completely wrong and points to the wrong object. I myself have found
>>>> another file with the same symptoms. I'm afraid my (production) FS is
>>>> corrupt now, unless there is a possibility to fix the inodes.
>>>
>>> You can probably get back to a state with some valid metadata, but it
>>> might not necessarily be the metadata the user was expecting (e.g. if
>>> two files are claiming the same inode number, one of them's is
>>> probably going to get deleted).
>>>
>>>> Timeline of what happend:
>>>>
>>>> Last week I upgraded our Ceph Jewel to Luminous.
>>>> This went without any problem.
>>>>
>>>> I already had 5 MDS available and went with the Multi-MDS feature and
>>>> enabled it. The seemed to work okay, but after a while my MDS went
>>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
>>>>
>>>> The only way to fix this and get the FS back online was the disaster
>>>> recovery procedure:
>>>>
>>>> cephfs-journal-tool event recover_dentries summary
>>>> ceph fs set cephfs cluster_down true
>>>> cephfs-table-tool all reset session
>>>> cephfs-table-tool all reset inode
>>>> cephfs-journal-tool --rank=cephfs:0 journal reset
>>>> ceph mds fail 0
>>>> ceph fs reset cephfs --yes-i-really-mean-it
>>>
>>> My concern with this procedure is that the recover_dentries and
>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
>>> would have retained lots of content in their journals.  I wonder if we
>>> should be adding some more multi-mds aware checks to these tools, to
>>> warn the user when they're only acting on particular ranks (a
>>> reasonable person might assume that recover_dentries with no args is
>>> operating on all ranks, not just 0).  Created
>>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com to track improving the default
>>> behaviour.
>>>
>>>> Restarted the MDS and I was back online. Shortly after I was getting a
>>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>>>> looks like it had trouble creating new inodes. Right before the crash
>>>> it mostly complained something like:
>>>>
>>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>>>> handle_client_request client_request(client.324932014:1434 create
>>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>>>> caller_gid=0{}) v2
>>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1
>>>> dirs], 1 open files
>>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
>>>> 12.2.5/src/mds/MDCache.cc: In function 'void
>>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
>>>> 05:05:01.615123
>>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
>>>>
>>>> I also tried to counter the create inode crash by doing the following:
>>>>
>>>> cephfs-journal-tool event recover_dentries
>>>> cephfs-journal-tool journal reset
>>>> cephfs-table-tool all reset session
>>>> cephfs-table-tool all reset inode
>>>> cephfs-table-tool all take_inos 100000
>>>
>>> This procedure is recovering some metadata from the journal into the
>>> main tree, then resetting everything, but duplicate inodes are
>>> happening when the main tree has multiple dentries containing inodes
>>> using the same inode number.
>>>
>>> What you need is something that scans through all the metadata,
>>> notices which entries point to the a duplicate, and snips out those
>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
>>> bits, so hopefully someone else can chime in to comment on whether we
>>> have the tooling for this already.
>>
>> But to prevent these crashes setting take_inos to a higher number is a
>> good choice, right? You'll loose inodes numbers, but you will have it
>> running without duplicate (new inodes).
>
> Yes -- that's the motivation to skipping inode numbers after some
> damage (but it won't help if some duplication has already happened).
>
>> Any idea to figure out the highest inode at the moment in the FS?
>
> If the metadata is damaged, you'd have to do a full scan of objects in
> the data pool.  Perhaps that could be added as a mode to
> cephfs-data-scan.
>

Understood, but it seems there are two things going on here:

- Files with wrong content
- MDS crashing on duplicate inodes

The latter is fixed with take_inos as we then bump the inode number to
something very high.

Wido

> BTW, in the long run I'd still really like to integrate all this
> tooling into an overall FSCK.  Most of the individual commands were
> added in Jewel era with the intent that they would be available for
> level 3 support, but eventually we should build a tool that is safer
> for end users.  I'm interested in using Kubernetes to orchestrate
> groups of worker processes to do massively parallel cephfs-data-scan
> operations, so that this isn't so prohibitively long running.
>
> John
>
>>
>> Wido
>>
>>>
>>> John
>>>
>>>>
>>>> I'm worried that my FS is corrupt because files are not linked
>>>> correctly and have different content then they should.
>>>>
>>>> Please help.
>>>>
>>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
>>>>> Hi,
>>>>>
>>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
>>>>> How can this be fixed?
>>>>>
>>>>> logs:
>>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921
>>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already
>>>>> exists at <another file path>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?domain=lists.ceph.com
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux