Hi John, On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote: > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote: > > > > > > We're affected by something like this right now (the dup inode > > causing MDS to crash via assert(!p) with add_inode(CInode) > > function). > > > > In terms of behaviours, shouldn't the MDS simply skip to the next > > available free inode in the event of a dup, than crashing the > > entire FS because of one file? Probably I'm missing something but > > that'd be a no brainer picking between the two? > Historically (a few years ago) the MDS asserted out on any invalid > metadata. Most of these cases have been picked up and converted into > explicit damage handling, but this one appears to have been missed -- > so yes, it's a bug that the MDS asserts out. I have followed the disaster recovery and now all my files and directories in CephFS which complained about duplicate inodes disappeared from my FS. I see *some* data in "lost+found", but thats only a part of it. Is there any way to retrieve those missing files? > John > > > > > ________________________________ > > From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of > > Wido den Hollander <wido@xxxxxxxx> > > Sent: Saturday, 7 July 2018 12:26:15 AM > > To: John Spray > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: CephFS - How to handle "loaded dup inode" > > errors > > > > > > > > On 07/06/2018 01:47 PM, John Spray wrote: > > > > > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx > > > > wrote: > > > > > > > > > > > > > > > > > > > > On 07/05/2018 03:36 PM, John Spray wrote: > > > > > > > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@ho > > > > > lmes.nl> wrote: > > > > > > > > > > > > > > > > > > Hi list, > > > > > > > > > > > > I have a serious problem now... I think. > > > > > > > > > > > > One of my users just informed me that a file he created > > > > > > (.doc file) has > > > > > > a different content then before. It looks like the file's > > > > > > inode is > > > > > > completely wrong and points to the wrong object. I myself > > > > > > have found > > > > > > another file with the same symptoms. I'm afraid my > > > > > > (production) FS is > > > > > > corrupt now, unless there is a possibility to fix the > > > > > > inodes. > > > > > You can probably get back to a state with some valid > > > > > metadata, but it > > > > > might not necessarily be the metadata the user was expecting > > > > > (e.g. if > > > > > two files are claiming the same inode number, one of them's > > > > > is > > > > > probably going to get deleted). > > > > > > > > > > > > > > > > > Timeline of what happend: > > > > > > > > > > > > Last week I upgraded our Ceph Jewel to Luminous. > > > > > > This went without any problem. > > > > > > > > > > > > I already had 5 MDS available and went with the Multi-MDS > > > > > > feature and > > > > > > enabled it. The seemed to work okay, but after a while my > > > > > > MDS went > > > > > > beserk and went flapping (crashed -> replay -> rejoin -> > > > > > > crashed) > > > > > > > > > > > > The only way to fix this and get the FS back online was the > > > > > > disaster > > > > > > recovery procedure: > > > > > > > > > > > > cephfs-journal-tool event recover_dentries summary > > > > > > ceph fs set cephfs cluster_down true > > > > > > cephfs-table-tool all reset session > > > > > > cephfs-table-tool all reset inode > > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset > > > > > > ceph mds fail 0 > > > > > > ceph fs reset cephfs --yes-i-really-mean-it > > > > > My concern with this procedure is that the recover_dentries > > > > > and > > > > > journal reset only happened on rank 0, whereas the other 4 > > > > > MDS ranks > > > > > would have retained lots of content in their journals. I > > > > > wonder if we > > > > > should be adding some more multi-mds aware checks to these > > > > > tools, to > > > > > warn the user when they're only acting on particular ranks (a > > > > > reasonable person might assume that recover_dentries with no > > > > > args is > > > > > operating on all ranks, not just 0). Created > > > > > https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?doma > > > > > in=tracker.ceph.com to track improving the default > > > > > behaviour. > > > > > > > > > > > > > > > > > Restarted the MDS and I was back online. Shortly after I > > > > > > was getting a > > > > > > lot of "loaded dup inode". In the meanwhile the MDS kept > > > > > > crashing. It > > > > > > looks like it had trouble creating new inodes. Right before > > > > > > the crash > > > > > > it mostly complained something like: > > > > > > > > > > > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 > > > > > > mds.0.server > > > > > > handle_client_request client_request(client.324932014:1434 > > > > > > create > > > > > > #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 > > > > > > caller_uid=0, > > > > > > caller_gid=0{}) v2 > > > > > > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 > > > > > > mds.0.log > > > > > > _submit_thread 24100753876035~1070 : EOpen [metablob > > > > > > 0x10000360346, 1 > > > > > > dirs], 1 open files > > > > > > 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 > > > > > > /build/ceph- > > > > > > 12.2.5/src/mds/MDCache.cc: In function 'void > > > > > > MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018- > > > > > > 07-05 > > > > > > 05:05:01.615123 > > > > > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED > > > > > > assert(!p) > > > > > > > > > > > > I also tried to counter the create inode crash by doing the > > > > > > following: > > > > > > > > > > > > cephfs-journal-tool event recover_dentries > > > > > > cephfs-journal-tool journal reset > > > > > > cephfs-table-tool all reset session > > > > > > cephfs-table-tool all reset inode > > > > > > cephfs-table-tool all take_inos 100000 > > > > > This procedure is recovering some metadata from the journal > > > > > into the > > > > > main tree, then resetting everything, but duplicate inodes > > > > > are > > > > > happening when the main tree has multiple dentries containing > > > > > inodes > > > > > using the same inode number. > > > > > > > > > > What you need is something that scans through all the > > > > > metadata, > > > > > notices which entries point to the a duplicate, and snips out > > > > > those > > > > > dentries. I'm not quite up to date on the latest CephFS > > > > > forward scrub > > > > > bits, so hopefully someone else can chime in to comment on > > > > > whether we > > > > > have the tooling for this already. > > > > But to prevent these crashes setting take_inos to a higher > > > > number is a > > > > good choice, right? You'll loose inodes numbers, but you will > > > > have it > > > > running without duplicate (new inodes). > > > Yes -- that's the motivation to skipping inode numbers after some > > > damage (but it won't help if some duplication has already > > > happened). > > > > > > > > > > > Any idea to figure out the highest inode at the moment in the > > > > FS? > > > If the metadata is damaged, you'd have to do a full scan of > > > objects in > > > the data pool. Perhaps that could be added as a mode to > > > cephfs-data-scan. > > > > > Understood, but it seems there are two things going on here: > > > > - Files with wrong content > > - MDS crashing on duplicate inodes > > > > The latter is fixed with take_inos as we then bump the inode number > > to > > something very high. > > > > Wido > > > > > > > > BTW, in the long run I'd still really like to integrate all this > > > tooling into an overall FSCK. Most of the individual commands > > > were > > > added in Jewel era with the intent that they would be available > > > for > > > level 3 support, but eventually we should build a tool that is > > > safer > > > for end users. I'm interested in using Kubernetes to orchestrate > > > groups of worker processes to do massively parallel cephfs-data- > > > scan > > > operations, so that this isn't so prohibitively long running. > > > > > > John > > > > > > > > > > > > > > > Wido > > > > > > > > > > > > > > > > > > > John > > > > > > > > > > > > > > > > > > > > > > > I'm worried that my FS is corrupt because files are not > > > > > > linked > > > > > > correctly and have different content then they should. > > > > > > > > > > > > Please help. > > > > > > > > > > > > On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) > > > > > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I'm getting a bunch of "loaded dup inode" errors in the > > > > > > > MDS logs. > > > > > > > How can this be fixed? > > > > > > > > > > > > > > logs: > > > > > > > 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup > > > > > > > inode 0x10000991921 > > > > > > > [2,head] v160 at <file path>, but inode > > > > > > > 0x10000991921.head v146 already > > > > > > > exists at <another file path> > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > ceph-users mailing list > > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > > https://protect-au.mimecast.com/s/wcSJCK1qwBS1076vsvO7Cm? > > > > > > > domain=lists.ceph.com > > > > > > ceph-users mailing list > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?do > > > > > > main=lists.ceph.com > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > https://protect-au.mimecast.com/s/fiHCCL7rxDs35B7psPEAx8?doma > > > > > in=lists.ceph.com > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com