On 07/11/2018 01:47 AM, Linh Vu wrote: > Thanks John :) Has it - asserting out on dupe inode - already been > logged as a bug yet? I could put one in if needed. > Did you just comment out the assert? And indeed, my next question would be, do we have a issue tracker for this? Wido > > Cheers, > > Linh > > > > ------------------------------------------------------------------------ > *From:* John Spray <jspray@xxxxxxxxxx> > *Sent:* Tuesday, 10 July 2018 7:11 PM > *To:* Linh Vu > *Cc:* Wido den Hollander; ceph-users@xxxxxxxxxxxxxx > *Subject:* Re: CephFS - How to handle "loaded dup inode" > errors > > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu <vul@xxxxxxxxxxxxxx> wrote: >> >> We're affected by something like this right now (the dup inode causing MDS to crash via assert(!p) with add_inode(CInode) function). >> >> In terms of behaviours, shouldn't the MDS simply skip to the next available free inode in the event of a dup, than crashing the entire FS because of one file? Probably I'm missing something but that'd be a no brainer picking between the two? > > Historically (a few years ago) the MDS asserted out on any invalid > metadata. Most of these cases have been picked up and converted into > explicit damage handling, but this one appears to have been missed -- > so yes, it's a bug that the MDS asserts out. > > John > >> ________________________________ >> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Wido den Hollander <wido@xxxxxxxx> >> Sent: Saturday, 7 July 2018 12:26:15 AM >> To: John Spray >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: CephFS - How to handle "loaded dup inode" errors >> >> >> >> On 07/06/2018 01:47 PM, John Spray wrote: >> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander <wido@xxxxxxxx> wrote: >> >> >> >> >> >> >> >> On 07/05/2018 03:36 PM, John Spray wrote: >> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote: >> >>>> >> >>>> Hi list, >> >>>> >> >>>> I have a serious problem now... I think. >> >>>> >> >>>> One of my users just informed me that a file he created (.doc file) has >> >>>> a different content then before. It looks like the file's inode is >> >>>> completely wrong and points to the wrong object. I myself have found >> >>>> another file with the same symptoms. I'm afraid my (production) FS is >> >>>> corrupt now, unless there is a possibility to fix the inodes. >> >>> >> >>> You can probably get back to a state with some valid metadata, but it >> >>> might not necessarily be the metadata the user was expecting (e.g. if >> >>> two files are claiming the same inode number, one of them's is >> >>> probably going to get deleted). >> >>> >> >>>> Timeline of what happend: >> >>>> >> >>>> Last week I upgraded our Ceph Jewel to Luminous. >> >>>> This went without any problem. >> >>>> >> >>>> I already had 5 MDS available and went with the Multi-MDS feature and >> >>>> enabled it. The seemed to work okay, but after a while my MDS went >> >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed) >> >>>> >> >>>> The only way to fix this and get the FS back online was the disaster >> >>>> recovery procedure: >> >>>> >> >>>> cephfs-journal-tool event recover_dentries summary >> >>>> ceph fs set cephfs cluster_down true >> >>>> cephfs-table-tool all reset session >> >>>> cephfs-table-tool all reset inode >> >>>> cephfs-journal-tool --rank=cephfs:0 journal reset >> >>>> ceph mds fail 0 >> >>>> ceph fs reset cephfs --yes-i-really-mean-it >> >>> >> >>> My concern with this procedure is that the recover_dentries and >> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks >> >>> would have retained lots of content in their journals. I wonder if we >> >>> should be adding some more multi-mds aware checks to these tools, to >> >>> warn the user when they're only acting on particular ranks (a >> >>> reasonable person might assume that recover_dentries with no args is >> >>> operating on all ranks, not just 0). Created >> >>> https://protect-au.mimecast.com/s/QSBICmOxr6sMJ8wYHG1IuW?domain=tracker.ceph.com > <http://tracker.ceph.com/issues/24780> to track improving the default >> >>> behaviour. >> >>> >> >>>> Restarted the MDS and I was back online. Shortly after I was getting a >> >>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It >> >>>> looks like it had trouble creating new inodes. Right before the crash >> >>>> it mostly complained something like: >> >>>> >> >>>> -2> 2018-07-05 05:05:01.614290 7f8f8574b700 4 mds.0.server >> >>>> handle_client_request client_request(client.324932014:1434 create >> >>>> #0x10000360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0, >> >>>> caller_gid=0{}) v2 >> >>>> -1> 2018-07-05 05:05:01.614320 7f8f7e73d700 5 mds.0.log >> >>>> _submit_thread 24100753876035~1070 : EOpen [metablob 0x10000360346, 1 >> >>>> dirs], 1 open files >> >>>> 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph- >> >>>> 12.2.5/src/mds/MDCache.cc: In function 'void >> >>>> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05 >> >>>> 05:05:01.615123 >> >>>> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) >> >>>> >> >>>> I also tried to counter the create inode crash by doing the following: >> >>>> >> >>>> cephfs-journal-tool event recover_dentries >> >>>> cephfs-journal-tool journal reset >> >>>> cephfs-table-tool all reset session >> >>>> cephfs-table-tool all reset inode >> >>>> cephfs-table-tool all take_inos 100000 >> >>> >> >>> This procedure is recovering some metadata from the journal into the >> >>> main tree, then resetting everything, but duplicate inodes are >> >>> happening when the main tree has multiple dentries containing inodes >> >>> using the same inode number. >> >>> >> >>> What you need is something that scans through all the metadata, >> >>> notices which entries point to the a duplicate, and snips out those >> >>> dentries. I'm not quite up to date on the latest CephFS forward scrub >> >>> bits, so hopefully someone else can chime in to comment on whether we >> >>> have the tooling for this already. >> >> >> >> But to prevent these crashes setting take_inos to a higher number is a >> >> good choice, right? You'll loose inodes numbers, but you will have it >> >> running without duplicate (new inodes). >> > >> > Yes -- that's the motivation to skipping inode numbers after some >> > damage (but it won't help if some duplication has already happened). >> > >> >> Any idea to figure out the highest inode at the moment in the FS? >> > >> > If the metadata is damaged, you'd have to do a full scan of objects in >> > the data pool. Perhaps that could be added as a mode to >> > cephfs-data-scan. >> > >> >> Understood, but it seems there are two things going on here: >> >> - Files with wrong content >> - MDS crashing on duplicate inodes >> >> The latter is fixed with take_inos as we then bump the inode number to >> something very high. >> >> Wido >> >> > BTW, in the long run I'd still really like to integrate all this >> > tooling into an overall FSCK. Most of the individual commands were >> > added in Jewel era with the intent that they would be available for >> > level 3 support, but eventually we should build a tool that is safer >> > for end users. I'm interested in using Kubernetes to orchestrate >> > groups of worker processes to do massively parallel cephfs-data-scan >> > operations, so that this isn't so prohibitively long running. >> > >> > John >> > >> >> >> >> Wido >> >> >> >>> >> >>> John >> >>> >> >>>> >> >>>> I'm worried that my FS is corrupt because files are not linked >> >>>> correctly and have different content then they should. >> >>>> >> >>>> Please help. >> >>>> >> >>>> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote: >> >>>>> Hi, >> >>>>> >> >>>>> I'm getting a bunch of "loaded dup inode" errors in the MDS logs. >> >>>>> How can this be fixed? >> >>>>> >> >>>>> logs: >> >>>>> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x10000991921 >> >>>>> [2,head] v160 at <file path>, but inode 0x10000991921.head v146 already >> >>>>> exists at <another file path> >> >>>>> >> >>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> ceph-users mailing list >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >>>>> https://protect-au.mimecast.com/s/fC92CnxyvXcWOg0LtJeRFI?domain=lists.ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________> >> >>>> ceph-users mailing list >> >>>> ceph-users@xxxxxxxxxxxxxx >> >>>> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >>> _______________________________________________ >> >>> ceph-users mailing list >> >>> ceph-users@xxxxxxxxxxxxxx >> >>> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> https://protect-au.mimecast.com/s/BNMJCoVzwKhwq5NGHV_rkr?domain=lists.ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com