Re: CephFS - How to handle "loaded dup inode" errors

Linh Vu <vul@xxxxxxxxxxxxxx> · Mon, 9 Jul 2018 23:42:58 +0000

We're affected by something like this right now (the dup inode causing MDS to crash via assert(!p) with add_inode(CInode) function). 

In terms of behaviours, shouldn't the MDS simply skip to the next available free inode in the event of a dup, than crashing the entire FS because of one file? Probably I'm missing something but that'd be a no brainer picking between the two?

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>

Sent: Saturday, 7 July 2018 12:26:15 AM

To: John Spray

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  CephFS - How to handle "loaded dup inode" errors

On 07/06/2018 01:47 PM, John Spray wrote:

> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:

>>

>>

>>

>> On 07/05/2018 03:36 PM, John Spray wrote:

>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:

>>>>

>>>> Hi list,

>>>>

>>>> I have a serious problem now... I think.

>>>>

>>>> One of my users just informed me that a file he created (.doc file) has

>>>> a different content then before. It looks like the file's inode is

>>>> completely wrong and points to the wrong object. I myself have found

>>>> another file with the same symptoms. I'm afraid my (production) FS is

>>>> corrupt now, unless there is a possibility to fix the inodes.

>>>

>>> You can probably get back to a state with some valid metadata, but it

>>> might not necessarily be the metadata the user was expecting (e.g. if

>>> two files are claiming the same inode number, one of them's is

>>> probably going to get deleted).

>>>

>>>> Timeline of what happend:

>>>>

>>>> Last week I upgraded our Ceph Jewel to Luminous.

>>>> This went without any problem.

>>>>

>>>> I already had 5 MDS available and went with the Multi-MDS feature and

>>>> enabled it. The seemed to work okay, but after a while my MDS went

>>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)

>>>>

>>>> The only way to fix this and get the FS back online was the disaster

>>>> recovery procedure:

>>>>

>>>> cephfs-journal-tool event recover_dentries summary

>>>> ceph fs set cephfs cluster_down true

>>>> cephfs-table-tool all reset session

>>>> cephfs-table-tool all reset inode

>>>> cephfs-journal-tool --rank=cephfs:0 journal reset

>>>> ceph mds fail 0

>>>> ceph fs reset cephfs --yes-i-really-mean-it

>>>

>>> My concern with this procedure is that the recover_dentries and

>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks

>>> would have retained lots of content in their journals.  I wonder if we

>>> should be adding some more multi-mds aware checks to these tools, to

>>> warn the user when they're only acting on particular ranks (a

>>> reasonable person might assume that recover_dentries with no args is

>>> operating on all ranks, not just 0).  Created

>>> 
https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com to track improving the default

>>> behaviour.

>>>

>>>> Restarted the MDS and I was back online. Shortly after I was getting a

>>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It

>>>> looks like it had trouble creating new inodes. Right before the crash

>>>> it mostly complained something like:

>>>>

>>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server

>>>> handle_client_request client_request(client.324932014:1434 create

>>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,

>>>> caller_gid=0{}) v2

>>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log

>>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1

>>>> dirs], 1 open files

>>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-

>>>> 12.2.5/src/mds/MDCache.cc: In function 'void

>>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05

>>>> 05:05:01.615123

>>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)

>>>>

>>>> I also tried to counter the create inode crash by doing the following:

>>>>

>>>> cephfs-journal-tool event recover_dentries

>>>> cephfs-journal-tool journal reset

>>>> cephfs-table-tool all reset session

>>>> cephfs-table-tool all reset inode

>>>> cephfs-table-tool all take_inos 100000

>>>

>>> This procedure is recovering some metadata from the journal into the

>>> main tree, then resetting everything, but duplicate inodes are

>>> happening when the main tree has multiple dentries containing inodes

>>> using the same inode number.

>>>

>>> What you need is something that scans through all the metadata,

>>> notices which entries point to the a duplicate, and snips out those

>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub

>>> bits, so hopefully someone else can chime in to comment on whether we

>>> have the tooling for this already.

>>

>> But to prevent these crashes setting take_inos to a higher number is a

>> good choice, right? You'll loose inodes numbers, but you will have it

>> running without duplicate (new inodes).

> 

> Yes -- that's the motivation to skipping inode numbers after some

> damage (but it won't help if some duplication has already happened).

> 

>> Any idea to figure out the highest inode at the moment in the FS?

> 

> If the metadata is damaged, you'd have to do a full scan of objects in

> the data pool.  Perhaps that could be added as a mode to

> cephfs-data-scan.

> 

Understood, but it seems there are two things going on here:

- Files with wrong content

- MDS crashing on duplicate inodes

The latter is fixed with take_inos as we then bump the inode number to

something very high.

Wido

> BTW, in the long run I'd still really like to integrate all this

> tooling into an overall FSCK.  Most of the individual commands were

> added in Jewel era with the intent that they would be available for

> level 3 support, but eventually we should build a tool that is safer

> for end users.  I'm interested in using Kubernetes to orchestrate

> groups of worker processes to do massively parallel cephfs-data-scan

> operations, so that this isn't so prohibitively long running.

> 

> John

> 

>>

>> Wido

>>

>>>

>>> John

>>>

>>>>

>>>> I'm worried that my FS is corrupt because files are not linked

>>>> correctly and have different content then they should.

>>>>

>>>> Please help.

>>>>

>>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:

>>>>> Hi,

>>>>>

>>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.

>>>>> How can this be fixed?

>>>>>

>>>>> logs:

>>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921

>>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already

>>>>> exists at <another file path>

>>>>>

>>>>>

>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> 
https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?domain=lists.ceph.com

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> 
https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> 
https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com

>>>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com