On Wed, May 17, 2017 at 1:44 PM, Simion Marius Rad <simarad@xxxxxxxxx> wrote: > Hello, > > We cannot start the mds service after running some delete commands on large > folders (100k+ files). You've posted previously about damaged detected on your MDS, and about corrupted XFS filesystems on your OSDs -- is this the same cluster/filesystem, or a fresh one? John > This is what the crash message looks like right after a start-up attempt: > > -2> 2017-05-17 08:36:03.071272 7fcc87a61700 1 -- > 10.103.213.182:6803/14366 <== osd.2 10.103.213.1:6811/3384506 1 ==== > osd_op_reply(92 10007e5ca9f.00000000 [delete] v0'0 uv911507 ondisk = -2 ((2) > No such file or directory)) v7 ==== 140+0+0 (1847967201 0 0) 0x55744151dc80 > con 0x5574414e9d80 > -1> 2017-05-17 08:36:03.071430 7fcc8765d700 1 -- > 10.103.213.182:6803/14366 <== osd.21 10.103.213.5:6805/4030475 1 ==== > osd_op_reply(90 10007e5cab8.00000000 [delete] v0'0 uv1270452 ondisk = -2 > ((2) No such file or directory)) v7 ==== 140+0+0 (2193063204 0 0) > 0x55744156a000 con 0x5574414e8700 > 0> 2017-05-17 08:36:03.081734 7fcc97235700 -1 mds/StrayManager.cc: In > function 'void StrayManager::eval_remote_stray(CDentry*, CDentry*)' thread > 7fcc97235700 time 2017-05-17 08:36:03.080128 > mds/StrayManager.cc: 673: FAILED assert(stray_in->inode.nlink >= 1) > > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x557434e58adb] > 2: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x466) > [0x557434bcfdf6] > 3: (StrayManager::__eval_stray(CDentry*, bool)+0x4cd) [0x557434bd47ad] > 4: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x557434bd509e] > 5: (MDCache::scan_stray_dir(dirfrag_t)+0x14e) [0x557434b2bace] > 6: (MDCache::populate_mydir()+0x807) [0x557434b994b7] > 7: (MDCache::open_root()+0xdc) [0x557434b99e0c] > 8: (MDSInternalContextBase::complete(int)+0x1db) [0x557434cc2acb] > 9: (MDSRank::_advance_queues()+0x495) [0x557434a960c5] > 10: (MDSRank::ProgressThread::entry()+0x4a) [0x557434a963ea] > 11: (()+0x8182) [0x7fcca1536182] > 12: (clone()+0x6d) [0x7fcc9fa8d47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 0/ 0 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 newstore > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 1/ 5 kinetic > 1/ 5 fuse > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph2-mds.ceph2-mds-2.log > --- end dump of recent events --- > 2017-05-17 08:36:03.087895 7fcc97235700 -1 *** Caught signal (Aborted) ** > in thread 7fcc97235700 thread_name:mds_rank_progr > > I would appreciate any hints about how to aproach a recovery attempt. > > Thank you, > Simion Marius Rad > Sr.SysAdmin > PropertyShark.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com