Hi John. I have been hitting that issue also although have not seen any asserts in my mds yet. Could you please clarify a bit further your proposal about manually removing omap info from strays? Should it be applied: - to the problematic replicas of the stray object which triggered the inconsistent pg? Or... - to all replicas of the stray object which triggered the inconsistent pg? Or... - to all replicas of all stray objects? In the later case, how do we know how many stray objects exist? Cheers Goncalo ________________________________________ From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of John Spray [jspray@xxxxxxxxxx] Sent: 09 December 2016 05:30 To: Sean Redmond Cc: ceph-users Subject: Re: CephFS FAILED assert(dn->get_linkage()->is_null()) On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> wrote: > Hi, > > We had no changes going on with the ceph pools or ceph servers at the time. > > We have however been hitting this in the last week and it maybe related: > > http://tracker.ceph.com/issues/17177 Oh, okay -- so you've got corruption in your metadata pool as a result of hitting that issue, presumably. I think in the past people have managed to get past this by taking their MDSs offline and manually removing the omap entries in their stray directory fragments (i.e. using the `rados` cli on the objects starting "600."). John > Thanks > > On Thu, Dec 8, 2016 at 3:34 PM, John Spray <jspray@xxxxxxxxxx> wrote: >> >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> >> wrote: >> > Hi, >> > >> > I have a CephFS cluster that is currently unable to start the mds server >> > as >> > it is hitting an assert, the extract from the mds log is below, any >> > pointers >> > are welcome: >> > >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) >> > >> > 2016-12-08 14:50:18.577038 7f7d9faa3700 1 mds.0.47077 handle_mds_map >> > state >> > change up:rejoin --> up:active >> > 2016-12-08 14:50:18.577048 7f7d9faa3700 1 mds.0.47077 recovery_done -- >> > successful recovery! >> > 2016-12-08 14:50:18.577166 7f7d9faa3700 1 mds.0.47077 active_start >> > 2016-12-08 14:50:19.460208 7f7d9faa3700 1 mds.0.47077 cluster >> > recovered. >> > 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function >> > 'void >> > CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time >> > 2016-12-08 >> > 14:50:19 >> > .494508 >> > mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null()) >> > >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) >> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> > const*)+0x80) [0x55f0f789def0] >> > 2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55f0f76666c0] >> > 3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9) [0x55f0f75e7799] >> > 4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f0f75e7cf2] >> > 5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f0f753b30d] >> > 6: (MDSInternalContextBase::complete(int)+0x18b) [0x55f0f76e93db] >> > 7: (MDSRank::_advance_queues()+0x6a7) [0x55f0f749bf27] >> > 8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f0f749c45a] >> > 9: (()+0x770a) [0x7f7da6bdc70a] >> > 10: (clone()+0x6d) [0x7f7da509d82d] >> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> > needed to >> > interpret this. >> >> Last time someone had this issue they had tried to create a filesystem >> using pools that had another filesystem's old objects in: >> http://tracker.ceph.com/issues/16829 >> >> What was going on on your system before you hit this? >> >> John >> >> > Thanks >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com