Re: CephFS - How to handle "loaded dup inode" errors

Linh Vu <vul@xxxxxxxxxxxxxx> · Tue, 10 Jul 2018 01:49:27 +0000

While we're on this topic, could someone please explain to me what `cephfs-table-tool all reset inode` does? 

Does it only reset what the MDS has in its cache, and after starting up again, the MDS will read in new inode range from the metadata pool? 

If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must run `cephfs-table-tool all reset inode`? 

Cheers,
Linh

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>

Sent: Saturday, 7 July 2018 12:26:15 AM

To: John Spray

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  CephFS - How to handle "loaded dup inode" errors

On 07/06/2018 01:47 PM, John Spray wrote:

> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:

>>

>>

>>

>> On 07/05/2018 03:36 PM, John Spray wrote:

>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:

>>>>

>>>> Hi list,

>>>>

>>>> I have a serious problem now... I think.

>>>>

>>>> One of my users just informed me that a file he created (.doc file) has

>>>> a different content then before. It looks like the file's inode is

>>>> completely wrong and points to the wrong object. I myself have found

>>>> another file with the same symptoms. I'm afraid my (production) FS is

>>>> corrupt now, unless there is a possibility to fix the inodes.

>>>

>>> You can probably get back to a state with some valid metadata, but it

>>> might not necessarily be the metadata the user was expecting (e.g. if

>>> two files are claiming the same inode number, one of them's is

>>> probably going to get deleted).

>>>

>>>> Timeline of what happend:

>>>>

>>>> Last week I upgraded our Ceph Jewel to Luminous.

>>>> This went without any problem.

>>>>

>>>> I already had 5 MDS available and went with the Multi-MDS feature and

>>>> enabled it. The seemed to work okay, but after a while my MDS went

>>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)

>>>>

>>>> The only way to fix this and get the FS back online was the disaster

>>>> recovery procedure:

>>>>

>>>> cephfs-journal-tool event recover_dentries summary

>>>> ceph fs set cephfs cluster_down true

>>>> cephfs-table-tool all reset session

>>>> cephfs-table-tool all reset inode

>>>> cephfs-journal-tool --rank=cephfs:0 journal reset

>>>> ceph mds fail 0

>>>> ceph fs reset cephfs --yes-i-really-mean-it

>>>

>>> My concern with this procedure is that the recover_dentries and

>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks

>>> would have retained lots of content in their journals.  I wonder if we

>>> should be adding some more multi-mds aware checks to these tools, to

>>> warn the user when they're only acting on particular ranks (a

>>> reasonable person might assume that recover_dentries with no args is

>>> operating on all ranks, not just 0).  Created

>>> 
https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com to track improving the default

>>> behaviour.

>>>

>>>> Restarted the MDS and I was back online. Shortly after I was getting a

>>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It

>>>> looks like it had trouble creating new inodes. Right before the crash

>>>> it mostly complained something like:

>>>>

>>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server

>>>> handle_client_request client_request(client.324932014:1434 create

>>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,

>>>> caller_gid=0{}) v2

>>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log

>>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1

>>>> dirs], 1 open files

>>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-

>>>> 12.2.5/src/mds/MDCache.cc: In function 'void

>>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05

>>>> 05:05:01.615123

>>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)

>>>>

>>>> I also tried to counter the create inode crash by doing the following:

>>>>

>>>> cephfs-journal-tool event recover_dentries

>>>> cephfs-journal-tool journal reset

>>>> cephfs-table-tool all reset session

>>>> cephfs-table-tool all reset inode

>>>> cephfs-table-tool all take_inos 100000

>>>

>>> This procedure is recovering some metadata from the journal into the

>>> main tree, then resetting everything, but duplicate inodes are

>>> happening when the main tree has multiple dentries containing inodes

>>> using the same inode number.

>>>

>>> What you need is something that scans through all the metadata,

>>> notices which entries point to the a duplicate, and snips out those

>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub

>>> bits, so hopefully someone else can chime in to comment on whether we

>>> have the tooling for this already.

>>

>> But to prevent these crashes setting take_inos to a higher number is a

>> good choice, right? You'll loose inodes numbers, but you will have it

>> running without duplicate (new inodes).

> 

> Yes -- that's the motivation to skipping inode numbers after some

> damage (but it won't help if some duplication has already happened).

> 

>> Any idea to figure out the highest inode at the moment in the FS?

> 

> If the metadata is damaged, you'd have to do a full scan of objects in

> the data pool.  Perhaps that could be added as a mode to

> cephfs-data-scan.

> 

Understood, but it seems there are two things going on here:

- Files with wrong content

- MDS crashing on duplicate inodes

The latter is fixed with take_inos as we then bump the inode number to

something very high.

Wido

> BTW, in the long run I'd still really like to integrate all this

> tooling into an overall FSCK.  Most of the individual commands were

> added in Jewel era with the intent that they would be available for

> level 3 support, but eventually we should build a tool that is safer

> for end users.  I'm interested in using Kubernetes to orchestrate

> groups of worker processes to do massively parallel cephfs-data-scan

> operations, so that this isn't so prohibitively long running.

> 

> John

> 

>>

>> Wido

>>

>>>

>>> John

>>>

>>>>

>>>> I'm worried that my FS is corrupt because files are not linked

>>>> correctly and have different content then they should.

>>>>

>>>> Please help.

>>>>

>>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:

>>>>> Hi,

>>>>>

>>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.

>>>>> How can this be fixed?

>>>>>

>>>>> logs:

>>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921

>>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already

>>>>> exists at <another file path>

>>>>>

>>>>>

>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> 
https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?domain=lists.ceph.com

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> 
https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> 
https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com

>>>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com