ceph-mds crash - jewel 10.2.3

Simion Marius Rad <simarad@xxxxxxxxx> · Wed, 17 May 2017 15:44:36 +0300

Hello, 
We cannot start the mds service after running some delete commands on large folders (100k+ files).
This is what the crash message looks like right after a start-up attempt:

    -2> 2017-05-17 08:36:03.071272 7fcc87a61700  1 -- 10.103.213.182:6803/14366 <== osd.2 10.103.213.1:6811/3384506 1 ==== osd_op_reply(92 10007e5ca9f.00000000 [delete] v0'0 uv911507 _ondisk_ = -2 ((2) No such file or directory)) v7 ==== 140+0+0 (1847967201 0 0) 0x55744151dc80 con 0x5574414e9d80
    -1> 2017-05-17 08:36:03.071430 7fcc8765d700  1 -- 10.103.213.182:6803/14366 <== osd.21 10.103.213.5:6805/4030475 1 ==== osd_op_reply(90 10007e5cab8.00000000 [delete] v0'0 uv1270452 _ondisk_ = -2 ((2) No such file or directory)) v7 ==== 140+0+0 (2193063204 0 0) 0x55744156a000 con 0x5574414e8700
     0> 2017-05-17 08:36:03.081734 7fcc97235700 -1 mds/StrayManager.cc: In function 'void StrayManager::eval_remote_stray(CDentry*, CDentry*)' thread 7fcc97235700 time 2017-05-17 08:36:03.080128
mds/StrayManager.cc: 673: FAILED assert(stray_in->inode.nlink >= 1)

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x557434e58adb]
 2: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x466) [0x557434bcfdf6]
 3: (StrayManager::__eval_stray(CDentry*, bool)+0x4cd) [0x557434bd47ad]
 4: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x557434bd509e]
 5: (MDCache::scan_stray_dir(dirfrag_t)+0x14e) [0x557434b2bace]
 6: (MDCache::populate_mydir()+0x807) [0x557434b994b7]
 7: (MDCache::open_root()+0xdc) [0x557434b99e0c]
 8: (MDSInternalContextBase::complete(int)+0x1db) [0x557434cc2acb]
 9: (MDSRank::_advance_queues()+0x495) [0x557434a960c5]
 10: (MDSRank::ProgressThread::entry()+0x4a) [0x557434a963ea]
 11: (()+0x8182) [0x7fcca1536182]
 12: (clone()+0x6d) [0x7fcc9fa8d47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   0/ 0 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph2-mds.ceph2-mds-2.log
--- end dump of recent events ---
2017-05-17 08:36:03.087895 7fcc97235700 -1 *** Caught signal (Aborted) **
 in thread 7fcc97235700 thread_name:mds_rank_progr

I would appreciate any hints about how to aproach a recovery attempt.

Thank you,
Simion Marius Rad
Sr.SysAdmin
PropertyShark.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com