Re: CephFS - How to handle "loaded dup inode" errors

Wido den Hollander <wido@xxxxxxxx> · Fri, 6 Jul 2018 16:26:15 +0200

On 07/06/2018 01:47 PM, John Spray wrote:
> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>
>>
>> On 07/05/2018 03:36 PM, John Spray wrote:
>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote:
>>>>
>>>> Hi list,
>>>>
>>>> I have a serious problem now... I think.
>>>>
>>>> One of my users just informed me that a file he created (.doc file) has
>>>> a different content then before. It looks like the file's inode is
>>>> completely wrong and points to the wrong object. I myself have found
>>>> another file with the same symptoms. I'm afraid my (production) FS is
>>>> corrupt now, unless there is a possibility to fix the inodes.
>>>
>>> You can probably get back to a state with some valid metadata, but it
>>> might not necessarily be the metadata the user was expecting (e.g. if
>>> two files are claiming the same inode number, one of them's is
>>> probably going to get deleted).
>>>
>>>> Timeline of what happend:
>>>>
>>>> Last week I upgraded our Ceph Jewel to Luminous.
>>>> This went without any problem.
>>>>
>>>> I already had 5 MDS available and went with the Multi-MDS feature and
>>>> enabled it. The seemed to work okay, but after a while my MDS went
>>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
>>>>
>>>> The only way to fix this and get the FS back online was the disaster
>>>> recovery procedure:
>>>>
>>>> cephfs-journal-tool event recover_dentries summary
>>>> ceph fs set cephfs cluster_down true
>>>> cephfs-table-tool all reset session
>>>> cephfs-table-tool all reset inode
>>>> cephfs-journal-tool --rank=cephfs:0 journal reset
>>>> ceph mds fail 0
>>>> ceph fs reset cephfs --yes-i-really-mean-it
>>>
>>> My concern with this procedure is that the recover_dentries and
>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
>>> would have retained lots of content in their journals.  I wonder if we
>>> should be adding some more multi-mds aware checks to these tools, to
>>> warn the user when they're only acting on particular ranks (a
>>> reasonable person might assume that recover_dentries with no args is
>>> operating on all ranks, not just 0).  Created
>>> http://tracker.ceph.com/issues/24780 to track improving the default
>>> behaviour.
>>>
>>>> Restarted the MDS and I was back online. Shortly after I was getting a
>>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>>>> looks like it had trouble creating new inodes. Right before the crash
>>>> it mostly complained something like:
>>>>
>>>>     -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>>>> handle_client_request client_request(client.324932014:1434 create
>>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>>>> caller_gid=0{}) v2
>>>>     -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1
>>>> dirs], 1 open files
>>>>      0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
>>>> 12.2.5/src/mds/MDCache.cc: In function 'void
>>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
>>>> 05:05:01.615123
>>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
>>>>
>>>> I also tried to counter the create inode crash by doing the following:
>>>>
>>>> cephfs-journal-tool event recover_dentries
>>>> cephfs-journal-tool journal reset
>>>> cephfs-table-tool all reset session
>>>> cephfs-table-tool all reset inode
>>>> cephfs-table-tool all take_inos 100000
>>>
>>> This procedure is recovering some metadata from the journal into the
>>> main tree, then resetting everything, but duplicate inodes are
>>> happening when the main tree has multiple dentries containing inodes
>>> using the same inode number.
>>>
>>> What you need is something that scans through all the metadata,
>>> notices which entries point to the a duplicate, and snips out those
>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
>>> bits, so hopefully someone else can chime in to comment on whether we
>>> have the tooling for this already.
>>
>> But to prevent these crashes setting take_inos to a higher number is a
>> good choice, right? You'll loose inodes numbers, but you will have it
>> running without duplicate (new inodes).
> 
> Yes -- that's the motivation to skipping inode numbers after some
> damage (but it won't help if some duplication has already happened).
> 
>> Any idea to figure out the highest inode at the moment in the FS?
> 
> If the metadata is damaged, you'd have to do a full scan of objects in
> the data pool.  Perhaps that could be added as a mode to
> cephfs-data-scan.
> 

Understood, but it seems there are two things going on here:

- Files with wrong content
- MDS crashing on duplicate inodes

The latter is fixed with take_inos as we then bump the inode number to
something very high.

Wido

> BTW, in the long run I'd still really like to integrate all this
> tooling into an overall FSCK.  Most of the individual commands were
> added in Jewel era with the intent that they would be available for
> level 3 support, but eventually we should build a tool that is safer
> for end users.  I'm interested in using Kubernetes to orchestrate
> groups of worker processes to do massively parallel cephfs-data-scan
> operations, so that this isn't so prohibitively long running.
> 
> John
> 
>>
>> Wido
>>
>>>
>>> John
>>>
>>>>
>>>> I'm worried that my FS is corrupt because files are not linked
>>>> correctly and have different content then they should.
>>>>
>>>> Please help.
>>>>
>>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
>>>>> Hi,
>>>>>
>>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
>>>>> How can this be fixed?
>>>>>
>>>>> logs:
>>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921
>>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already
>>>>> exists at <another file path>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com