Re: CephFS FAILED assert(dn->get_linkage()->is_null())

Sean Redmond <sean.redmond1@xxxxxxxxx> · Mon, 12 Dec 2016 12:48:02 +0000

Hey John,
Thanks for your response here.

We took the below action on the journal as a method to move past hitting the mds assert initially:

#cephfs-journal-tool journal export backup.bin (This commands failed We suspect due to corruption)

#cephfs-journal-tool event recover_dentries summary, this ran successfully (based on exit status and output)

#cephfs-journal-tool journal reset, this ran successfully (based on exit status and output)

#cephfs-table-tool all reset session , this ran successfully (based on exit status and output)

After the above we where able to get past the mds assert loop. and we asked the mds to scrub using the below, this caused two damage reports as shown below:

"ceph --admin-daemon /var/run/ceph/ceph-mds.ceph-mds2.asok scrub_path / recursive repair"

[
   {
       "damage_type": "dir_frag",
       "frag": "*",
       "id": 1060573330,
       "ino": 1099683935045
   },
   {
       "damage_type": "dir_frag",
       "frag": "*",
       "id": 1086075136,
       "ino": 1099683935044
   }
]

It is interesting that the "damage rm" doesn't fix anything, in our case after running this the mds was still reporting 'damaged metadata' even though 'damage ls' output did not list any damaged inodes. 

Only when running "mds repaired" did the "damaged metadata" warning go from the output of 'ceph -s'.

We then forced a deep scrub of all the PG's in the metadata pool and we found two PG's  in the metadata pool that had single object scrub errors reporting such as those in bug #17177, we took action to just repair these using pg repair. 

We have then also ran the below again that did not seem to return any damage in the ceph status or the mds log.

"ceph --admin-daemon /var/run/ceph/ceph-mds.ceph-mds2.asok scrub_path / recursive repair"

We also mounted the file system and ran a `ls -R` that also did not seem to return any damage reported in the ceph status or the mds logs.

One extra item to note is we run the mds cluster as (active,+standby-replay) and when the issue first presented we noticed flapping that seems to be related to hitting the error "beacon timeout" with very high CPU usage I am only just starting to investigate this further but though it was worth sharing. 

Hopefully the above is useful to you, If you need more information I will do my best to provide it, you can also find me in #ceph (s3an2) if it is helpful. 

Thanks

On Mon, Dec 12, 2016 at 12:17 PM, John Spray <jspray@xxxxxxxxxx> wrote:
On Sat, Dec 10, 2016 at 1:50 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> wrote:

> Hi Goncarlo,

>

> With the output from "ceph tell mds.0 damage ls" we tracked the inodes of

> two damaged directories using 'find /mnt/ceph/ -inum $inode', after

> reviewing the paths involved we confirmed a backup was availble for this

> data so we ran "ceph tell mds.0 damage rm $inode" on the two inodes. We then

> marked the mds as repaired "ceph mds repaired 0".

You're going to see the damage pop back up again just as soon as you

touch the file in question.  "damage rm" doesn't fix anything, it just

removes the record of damage (i.e. it's how you tell the MDS "I fixed

this for you").

"mds repaired" is for when a rank is entirely offline due to

catastrophic damage (i.e. something too bad for the damage table to

report nicely from a live MDS) -- it will presumably have been a no-op

for you.

Can you say exactly what operations you have done and exactly what

damage is being reported?

How did you conclude that your journal was corrupt?

John

P.S. I went quiet at the end of last week because I was out of the

office, it's not that I don't care :-)

P.P.S. Any chance you guys could use your work mail addresses?  It's

not always obvious that a series of different people posting from

@gmail.com addresses are working on the same system.

> We have restarted the mds to confirm it is not htting any asserts, we are

> now just enabling scrubs and running a "ls -R /mnt/ceph" to see if we hit

> any further problems.

>

> Thanks

>

> On Fri, Dec 9, 2016 at 11:37 PM, Chris Sarginson <csargiso@xxxxxxxxx> wrote:

>>

>> Hi Goncarlo,

>>

>> In the end we ascertained that the assert was coming from reading corrupt

>> data in the mds journal.  We have followed the sections at the following

>> link (http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/) in order

>> down to (and including) MDS Table wipes (only wiping the "session" table in

>> the final step).  This resolved the problem we had with our mds asserting.

>>

>> We have also run a cephfs scrub to validate the data (ceph daemon mds.0

>> scrub_path / recursive repair), which has resulted in "metadata damage

>> detected" health warning.  This seems to perform a read of all objects

>> involved in cephfs rados pools (anecdotal: performance of the scan against

>> the data pool was much faster to process than the metadata pool itself).

>>

>> We are now working with the output of "ceph tell mds.0 damage ls", and

>> looking at the following mailing list post as a starting point for

>> proceeding with that:

>> http://ceph-users.ceph.narkive.com/EfFTUPyP/how-to-fix-the-mds-damaged-issue

>>

>> Chris

>>

>> On Fri, 9 Dec 2016 at 19:26 Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx>

>> wrote:

>>>

>>> Hi Sean, Rob.

>>>

>>> I saw on the tracker that you were able to resolve the mds assert by

>>> manually cleaning the corrupted metadata. Since I am also hitting that issue

>>> and I suspect that i will face an mds assert of the same type sooner or

>>> later, can you please explain a bit further what operations did you do to

>>> clean the problem?

>>> Cheers

>>> Goncalo

>>> ________________________________________

>>> From: ceph-users [ceph-users-bounces@lists.ceph.com] on behalf of Rob

>>> Pickerill [r.pickerill@xxxxxxxxx]

>>> Sent: 09 December 2016 07:13

>>> To: Sean Redmond; John Spray

>>> Cc: ceph-users

>>> Subject: Re:  CephFS FAILED

>>> assert(dn->get_linkage()->is_null())

>>>

>>> Hi John / All

>>>

>>> Thank you for the help so far.

>>>

>>> To add a further point to Sean's previous email, I see this log entry

>>> before the assertion failure:

>>>

>>>     -6> 2016-12-08 15:47:08.483700 7fb133dca700 12

>>> mds.0.cache.dir(1000a453344) remove_dentry [dentry

>>> #100/stray9/1000a453344/config [2,head] auth NULL (dver

>>> sion lock) v=540 inode=0 0x55e8664fede0]

>>>     -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In

>>> function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700

>>> time 2016-12-08

>>> 15:47:08.483704

>>> mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())

>>>

>>> And I can reference this with:

>>>

>>> root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys

>>> 1000a453344.00000000

>>> 1470734502_head

>>> config_head

>>>

>>> Would we also need to clean up this object, if so is there a safe we can

>>> do this?

>>>

>>> Rob

>>>

>>> On Thu, 8 Dec 2016 at 19:58 Sean Redmond

>>> <sean.redmond1@xxxxxxxxx<mailto:sean.redmond1@xxxxxxxxx>> wrote:

>>> Hi John,

>>>

>>> Thanks for your pointers, I have extracted the onmap_keys and

>>> onmap_values for an object I found in the metadata pool called

>>> '600.00000000' and dropped them at the below location

>>>

>>> https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0

>>>

>>> Could you explain how is it possible to identify stray directory

>>> fragments?

>>>

>>> Thanks

>>>

>>> On Thu, Dec 8, 2016 at 6:30 PM, John Spray

>>> <jspray@xxxxxxxxxx<mailto:jspray@xxxxxxxxxx>> wrote:

>>> On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond

>>> <sean.redmond1@xxxxxxxxx<mailto:sean.redmond1@xxxxxxxxx>> wrote:

>>> > Hi,

>>> >

>>> > We had no changes going on with the ceph pools or ceph servers at the

>>> > time.

>>> >

>>> > We have however been hitting this in the last week and it maybe

>>> > related:

>>> >

>>> > http://tracker.ceph.com/issues/17177

>>>

>>> Oh, okay -- so you've got corruption in your metadata pool as a result

>>> of hitting that issue, presumably.

>>>

>>> I think in the past people have managed to get past this by taking

>>> their MDSs offline and manually removing the omap entries in their

>>> stray directory fragments (i.e. using the `rados` cli on the objects

>>> starting "600.").

>>>

>>> John

>>>

>>>

>>>

>>> > Thanks

>>> >

>>> > On Thu, Dec 8, 2016 at 3:34 PM, John Spray

>>> > <jspray@xxxxxxxxxx<mailto:jspray@xxxxxxxxxx>> wrote:

>>> >>

>>> >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond

>>> >> <sean.redmond1@xxxxxxxxx<mailto:sean.redmond1@xxxxxxxxx>>

>>> >> wrote:

>>> >> > Hi,

>>> >> >

>>> >> > I have a CephFS cluster that is currently unable to start the mds

>>> >> > server

>>> >> > as

>>> >> > it is hitting an assert, the extract from the mds log is below, any

>>> >> > pointers

>>> >> > are welcome:

>>> >> >

>>> >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

>>> >> >

>>> >> > 2016-12-08 14:50:18.577038 7f7d9faa3700  1 mds.0.47077

>>> >> > handle_mds_map

>>> >> > state

>>> >> > change up:rejoin --> up:active

>>> >> > 2016-12-08 14:50:18.577048 7f7d9faa3700  1 mds.0.47077 recovery_done

>>> >> > --

>>> >> > successful recovery!

>>> >> > 2016-12-08 14:50:18.577166 7f7d9faa3700  1 mds.0.47077 active_start

>>> >> > 2016-12-08 14:50:19.460208 7f7d9faa3700  1 mds.0.47077 cluster

>>> >> > recovered.

>>> >> > 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function

>>> >> > 'void

>>> >> > CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time

>>> >> > 2016-12-08

>>> >> > 14:50:19

>>> >> > .494508

>>> >> > mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())

>>> >> >

>>> >> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

>>> >> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char

>>> >> > const*)+0x80) [0x55f0f789def0]

>>> >> >  2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55f0f76666c0]

>>> >> >  3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9)

>>> >> > [0x55f0f75e7799]

>>> >> >  4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f0f75e7cf2]

>>> >> >  5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f0f753b30d]

>>> >> >  6: (MDSInternalContextBase::complete(int)+0x18b) [0x55f0f76e93db]

>>> >> >  7: (MDSRank::_advance_queues()+0x6a7) [0x55f0f749bf27]

>>> >> >  8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f0f749c45a]

>>> >> >  9: (()+0x770a) [0x7f7da6bdc70a]

>>> >> >  10: (clone()+0x6d) [0x7f7da509d82d]

>>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is

>>> >> > needed to

>>> >> > interpret this.

>>> >>

>>> >> Last time someone had this issue they had tried to create a filesystem

>>> >> using pools that had another filesystem's old objects in:

>>> >> http://tracker.ceph.com/issues/16829

>>> >>

>>> >> What was going on on your system before you hit this?

>>> >>

>>> >> John

>>> >>

>>> >> > Thanks

>>> >> >

>>> >> > _______________________________________________

>>> >> > ceph-users mailing list

>>> >> > ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxx.com>

>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >> >

>>> >

>>> >

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxx.com>

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com