We're affected by something like this right now (the dup inode causing MDS to crash via assert(!p) with add_inode(CInode) function).
In terms of behaviours, shouldn't the MDS simply skip to the next available free inode in the event of a dup, than crashing the entire FS because of one file? Probably I'm missing something but that'd be a no brainer picking between the two?
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx>
Sent: Saturday, 7 July 2018 12:26:15 AM To: John Spray Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: CephFS - How to handle "loaded dup inode" errors On 07/06/2018 01:47 PM, John Spray wrote: > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote: >> >> >> >> On 07/05/2018 03:36 PM, John Spray wrote: >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote: >>>> >>>> Hi list, >>>> >>>> I have a serious problem now... I think. >>>> >>>> One of my users just informed me that a file he created (.doc file) has >>>> a different content then before. It looks like the file's inode is >>>> completely wrong and points to the wrong object. I myself have found >>>> another file with the same symptoms. I'm afraid my (production) FS is >>>> corrupt now, unless there is a possibility to fix the inodes. >>> >>> You can probably get back to a state with some valid metadata, but it >>> might not necessarily be the metadata the user was expecting (e.g. if >>> two files are claiming the same inode number, one of them's is >>> probably going to get deleted). >>> >>>> Timeline of what happend: >>>> >>>> Last week I upgraded our Ceph Jewel to Luminous. >>>> This went without any problem. >>>> >>>> I already had 5 MDS available and went with the Multi-MDS feature and >>>> enabled it. The seemed to work okay, but after a while my MDS went >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed) >>>> >>>> The only way to fix this and get the FS back online was the disaster >>>> recovery procedure: >>>> >>>> cephfs-journal-tool event recover_dentries summary >>>> ceph fs set cephfs cluster_down true >>>> cephfs-table-tool all reset session >>>> cephfs-table-tool all reset inode >>>> cephfs-journal-tool --rank=cephfs:0 journal reset >>>> ceph mds fail 0 >>>> ceph fs reset cephfs --yes-i-really-mean-it >>> >>> My concern with this procedure is that the recover_dentries and >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks >>> would have retained lots of content in their journals. I wonder if we >>> should be adding some more multi-mds aware checks to these tools, to >>> warn the user when they're only acting on particular ranks (a >>> reasonable person might assume that recover_dentries with no args is >>> operating on all ranks, not just 0). Created >>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com to track improving the default >>> behaviour. >>> >>>> Restarted the MDS and I was back online. Shortly after I was getting a >>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It >>>> looks like it had trouble creating new inodes. Right before the crash >>>> it mostly complained something like: >>>> >>>> -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 mds.0.server >>>> handle_client_request client_request(client.324932014:1434 create >>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0, >>>> caller_gid=0{}) v2 >>>> -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 mds.0.log >>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1 >>>> dirs], 1 open files >>>> 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph- >>>> 12.2.5/src/mds/MDCache.cc: In function 'void >>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05 >>>> 05:05:01.615123 >>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) >>>> >>>> I also tried to counter the create inode crash by doing the following: >>>> >>>> cephfs-journal-tool event recover_dentries >>>> cephfs-journal-tool journal reset >>>> cephfs-table-tool all reset session >>>> cephfs-table-tool all reset inode >>>> cephfs-table-tool all take_inos 100000 >>> >>> This procedure is recovering some metadata from the journal into the >>> main tree, then resetting everything, but duplicate inodes are >>> happening when the main tree has multiple dentries containing inodes >>> using the same inode number. >>> >>> What you need is something that scans through all the metadata, >>> notices which entries point to the a duplicate, and snips out those >>> dentries. I'm not quite up to date on the latest CephFS forward scrub >>> bits, so hopefully someone else can chime in to comment on whether we >>> have the tooling for this already. >> >> But to prevent these crashes setting take_inos to a higher number is a >> good choice, right? You'll loose inodes numbers, but you will have it >> running without duplicate (new inodes). > > Yes -- that's the motivation to skipping inode numbers after some > damage (but it won't help if some duplication has already happened). > >> Any idea to figure out the highest inode at the moment in the FS? > > If the metadata is damaged, you'd have to do a full scan of objects in > the data pool. Perhaps that could be added as a mode to > cephfs-data-scan. > Understood, but it seems there are two things going on here: - Files with wrong content - MDS crashing on duplicate inodes The latter is fixed with take_inos as we then bump the inode number to something very high. Wido > BTW, in the long run I'd still really like to integrate all this > tooling into an overall FSCK. Most of the individual commands were > added in Jewel era with the intent that they would be available for > level 3 support, but eventually we should build a tool that is safer > for end users. I'm interested in using Kubernetes to orchestrate > groups of worker processes to do massively parallel cephfs-data-scan > operations, so that this isn't so prohibitively long running. > > John > >> >> Wido >> >>> >>> John >>> >>>> >>>> I'm worried that my FS is corrupt because files are not linked >>>> correctly and have different content then they should. >>>> >>>> Please help. >>>> >>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote: >>>>> Hi, >>>>> >>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs. >>>>> How can this be fixed? >>>>> >>>>> logs: >>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921 >>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already >>>>> exists at <another file path> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm?domain=lists.ceph.com >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?domain=lists.ceph.com >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com