Re: CephFS FAILED assert(dn->get_linkage()->is_null())

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

I am working alongside Sean with this assertion issue. We see problems with the stray calls when the MDS starts, the last entry in the log before the assertion failure is a reference: try_remove_dentries_for_stray. I have provided a link for the ceph-collect logs, which we're collected after speaking with 42on. I hope this provides a bit more insight. I also have the MDS debug and journal debug logs. 
ceph-collect:
https://www.dropbox.com/s/7ntm1qqggd2y9xg/ceph-collect_20161208_160428.tar.gz?dl=0

Log snippet:
   -10> 2016-12-08 15:47:08.483684 7fb133dca700 10 mds.0.cache.strays eval_stray [dentry #100/stray9/1000a453344 [2,head] auth (dversion lock) v=84208144 inode=0x55e856640f10 | inodepin=1 dirty=1 0x55ec8b6c4610]
    -9> 2016-12-08 15:47:08.483686 7fb133dca700 10 mds.0.cache.strays  inode is [inode 1000a453344 [...2,head] ~mds0/stray9/1000a453344/ auth v84208144 dirtyparent f(v0 m2016-12-08 11:51:49.756918) n(v12 rc2016-12-08 11:51:49.759918 b-12285 -2=-3+1) (inest lock) (iversion lock) | dirtyscattered=0 lock=0 dirfrag=1 dirtyrstat=0 dirtyparent=1 dirty=1 authpin=0 0x55e856640f10]
    -8> 2016-12-08 15:47:08.483694 7fb133dca700 10 mds.0.cache.dir(1000a453344) try_remove_dentries_for_stray
    -7> 2016-12-08 15:47:08.483696 7fb133dca700 10 mds.0.cache.den(1000a453344 config)  mark_clean [dentry #100/stray9/1000a453344/config [2,head] auth NULL (dversion lock) v=540 inode=0 | dirty=1 0x55e8664fede0]
    -6> 2016-12-08 15:47:08.483700 7fb133dca700 12 mds.0.cache.dir(1000a453344) remove_dentry [dentry #100/stray9/1000a453344/config [2,head] auth NULL (dversion lock) v=540 inode=0 0x55e8664fede0]
    -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700 time 2016-12-08 15:47:08.483704
mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e71d788ef0]
 2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55e71d5516c0]
 3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9) [0x55e71d4d2799]
 4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55e71d4d2cf2]
 5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55e71d42630d]
 6: (MDSInternalContextBase::complete(int)+0x18b) [0x55e71d5d43db]
 7: (MDSRank::_advance_queues()+0x6a7) [0x55e71d386f27]
 8: (MDSRank::ProgressThread::entry()+0x4a) [0x55e71d38745a]
 9: (()+0x770a) [0x7fb13e19670a]
 10: (clone()+0x6d) [0x7fb13c65782d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The debug logs come in at ~800MB gzipped, ~11GB uncompressed, would these be useful for review and I will provide a link.

Thanks

Rob

On Thu, 8 Dec 2016 at 15:45 Sean Redmond <sean.redmond1@xxxxxxxxx> wrote:
Hi,

We had no changes going on with the ceph pools or ceph servers at the time.

We have however been hitting this in the last week and it maybe related:


Thanks

On Thu, Dec 8, 2016 at 3:34 PM, John Spray <jspray@xxxxxxxxxx> wrote:
On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> wrote:
> Hi,
>
> I have a CephFS cluster that is currently unable to start the mds server as
> it is hitting an assert, the extract from the mds log is below, any pointers
> are welcome:
>
> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>
> 2016-12-08 14:50:18.577038 7f7d9faa3700  1 mds.0.47077 handle_mds_map state
> change up:rejoin --> up:active
> 2016-12-08 14:50:18.577048 7f7d9faa3700  1 mds.0.47077 recovery_done --
> successful recovery!
> 2016-12-08 14:50:18.577166 7f7d9faa3700  1 mds.0.47077 active_start
> 2016-12-08 14:50:19.460208 7f7d9faa3700  1 mds.0.47077 cluster recovered.
> 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function 'void
> CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time 2016-12-08
> 14:50:19
> .494508
> mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
>
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x80) [0x55f0f789def0]
>  2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55f0f76666c0]
>  3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9) [0x55f0f75e7799]
>  4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f0f75e7cf2]
>  5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f0f753b30d]
>  6: (MDSInternalContextBase::complete(int)+0x18b) [0x55f0f76e93db]
>  7: (MDSRank::_advance_queues()+0x6a7) [0x55f0f749bf27]
>  8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f0f749c45a]
>  9: (()+0x770a) [0x7f7da6bdc70a]
>  10: (clone()+0x6d) [0x7f7da509d82d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.

Last time someone had this issue they had tried to create a filesystem
using pools that had another filesystem's old objects in:
http://tracker.ceph.com/issues/16829

What was going on on your system before you hit this?

John

> Thanks
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux