On Thu, Jun 15, 2017 at 7:32 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote: > Hi Yan, > > Many thanks for looking into this and providing a patch. > > I've downloaded ceph 12.0.3-1661-g3ddbfcd, applied your patch, rebuilt > the rpms, and installed across my cluster. > > Unfortunately, the MDS are still crashing, any ideas welcome :) > > With "debug_mds = 10" the full Log is 140MB, a truncated version of the > log immediately preceding the crash follows: > > best, > > Jake > > -5> 2017-06-15 12:21:14.084373 7f77fe590700 10 mds.0.journal > EMetaBlob.replay added (full) [dentry > #1/isilon/sc/users/spc/JessComb_AB_230115/JessB_TO_190115_F6_1/n0/JessB_TO_190115_F6_1.peaks_int > [9f,head] auth NULL (dversion lock) v=3104 inode=0 > state=1073741888|bottomlru 0x7f781a3f1860] > -4> 2017-06-15 12:21:14.084375 7f77fe590700 10 mds.0.journal > EMetaBlob.replay added [inode 1000147f773 [9f,head] > /isilon/sc/users/spc/JessComb_AB_230115/JessB_TO_190115_F6_1/n0/JessB_TO_190115_F6_1.peaks_int > auth v3104 s=4 n(v0 b4 1=1+0) (iversion lock) cr={3554272=0-4194304@9e} > 0x7f781a3f5800] > -3> 2017-06-15 12:21:14.084379 7f77fe590700 10 mds.0.journal > EMetaBlob.replay added (full) [dentry > #1/isilon/sc/users/spc/JessComb_AB_230115/JessB_TO_190115_F6_1/n0/JessB_TO_190115_F6_1.peaks_maxt > [9f,head] auth NULL (dversion lock) v=3132 inode=0 > state=1073741888|bottomlru 0x7f781a3f1d40] > -2> 2017-06-15 12:21:14.084381 7f77fe590700 10 mds.0.journal > EMetaBlob.replay added [inode 1000147f775 [9f,head] > /isilon/sc/users/spc/JessComb_AB_230115/JessB_TO_190115_F6_1/n0/JessB_TO_190115_F6_1.peaks_maxt > auth v3132 s=4 n(v0 b4 1=1+0) (iversion lock) cr={3554272=0-4194304@9e} > 0x7f781a3f5e00] > -1> 2017-06-15 12:21:14.084406 7f77fe590700 0 mds.0.journal > EOpen.replay ino 1000147761b.9a not in metablob > 0> 2017-06-15 12:21:14.085348 7f77fe590700 -1 > /root/rpmbuild/BUILD/ceph-12.0.3-1661-g3ddbfcd/src/mds/journal.cc: In > function 'virtual void EOpen::replay(MDSRank*)' thread 7f77fe590700 time > 2017-06-15 12:21:14.084409 > /root/rpmbuild/BUILD/ceph-12.0.3-1661-g3ddbfcd/src/mds/journal.cc: 2207: > FAILED assert(in) > The assertion should be removed by my patch. Maybe you didn't cleanly apply the patch. Regards Yan, Zheng > ceph version 12.0.3-1661-g3ddbfcd > (3ddbfcd4357ab3a3c2f17f86f88dc83172d4ce0d) luminous (dev) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x110) [0x7f780d290500] > 2: (EOpen::replay(MDSRank*)+0x3e5) [0x7f780d2397b5] > 3: (MDLog::_replay_thread()+0x5f2) [0x7f780d1efd12] > 4: (MDLog::ReplayThread::entry()+0xd) [0x7f780cf9b6ad] > 5: (()+0x7dc5) [0x7f780adb4dc5] > 6: (clone()+0x6d) [0x7f7809e9476d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 10/10 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 1/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 4/ 5 memdb > 1/ 5 kinetic > 1/ 5 fuse > 1/ 5 mgr > 1/ 5 mgrc > 1/ 5 dpdk > 1/ 5 eventtrace > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-mds.cephfs1.log > --- end dump of recent events --- > 2017-06-15 12:21:14.101761 7f77fe590700 -1 *** Caught signal (Aborted) ** > in thread 7f77fe590700 thread_name:md_log_replay > > ceph version 12.0.3-1661-g3ddbfcd > (3ddbfcd4357ab3a3c2f17f86f88dc83172d4ce0d) luminous (dev) > 1: (()+0x57d7ff) [0x7f780d2507ff] > 2: (()+0xf370) [0x7f780adbc370] > 3: (gsignal()+0x37) [0x7f7809dd21d7] > 4: (abort()+0x148) [0x7f7809dd38c8] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x284) [0x7f780d290674] > 6: (EOpen::replay(MDSRank*)+0x3e5) [0x7f780d2397b5] > 7: (MDLog::_replay_thread()+0x5f2) [0x7f780d1efd12] > 8: (MDLog::ReplayThread::entry()+0xd) [0x7f780cf9b6ad] > 9: (()+0x7dc5) [0x7f780adb4dc5] > 10: (clone()+0x6d) [0x7f7809e9476d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- begin dump of recent events --- > 0> 2017-06-15 12:21:14.101761 7f77fe590700 -1 *** Caught signal > (Aborted) ** > in thread 7f77fe590700 thread_name:md_log_replay > > ceph version 12.0.3-1661-g3ddbfcd > (3ddbfcd4357ab3a3c2f17f86f88dc83172d4ce0d) luminous (dev) > 1: (()+0x57d7ff) [0x7f780d2507ff] > 2: (()+0xf370) [0x7f780adbc370] > 3: (gsignal()+0x37) [0x7f7809dd21d7] > 4: (abort()+0x148) [0x7f7809dd38c8] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x284) [0x7f780d290674] > 6: (EOpen::replay(MDSRank*)+0x3e5) [0x7f780d2397b5] > 7: (MDLog::_replay_thread()+0x5f2) [0x7f780d1efd12] > 8: (MDLog::ReplayThread::entry()+0xd) [0x7f780cf9b6ad] > 9: (()+0x7dc5) [0x7f780adb4dc5] > 10: (clone()+0x6d) [0x7f7809e9476d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 10/10 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 1/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 4/ 5 memdb > 1/ 5 kinetic > 1/ 5 fuse > 1/ 5 mgr > 1/ 5 mgrc > 1/ 5 dpdk > 1/ 5 eventtrace > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-mds.cephfs1.log > --- end dump of recent events --- > > > On 15/06/17 08:10, Yan, Zheng wrote: >> On Wed, Jun 14, 2017 at 11:49 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote: >>> Dear All, >>> >>> Sorry, but I need to add +1 to the mds crash reports with ceph >>> 12.0.3-1507-g52f0deb >>> >>> This happened to me after updating from 12.0.2 >>> All was fairly OK for a few hours, I/O around 500MB/s, then both MDS >>> servers crashed, and have not worked since. >>> >>> The two MDS servers, are active:standby, both now crash immediately >>> after being started. >>> >>> This cluster has been upgraded from Kraken, through several Luminous >>> versions, so I did a clean install of SL7.3 on one MDS server, and still >>> have crashes on this machine. >>> >>> Cluster has 40 x 8TB drives (EC 4+1), with dual replicated NVME >>> providing a hotpool to drive the Cephfs layer. df -h /cephfs is/was >>> 200TB. All OSD's are bluestore, and were generated on Luminous. >>> >>> I enabled snapshots a few days ago, and keep 144 snapshots (one taken >>> every 10 minutes, each is kept for 24 hours only) about 30TB is copied >>> into the fs each day. If snapshots caused the crash, I can regenerate >>> the data, but they are very useful. >>> >>> One MDS gave this log... >>> >>> <http://www.mrc-lmb.cam.ac.uk/jog/ceph-mds.cephfs1.log> >> >> It is a snapshot related bug. The Attached patch should prevent mds >> from crashing. >> Next time you restart mds, please set debug_mds=10 and upload the log. >> >> Regards >> Yan, Zheng >> >>> >>> many thanks for any suggestions, and it's great to see the experimental >>> flag removed from bluestore! >>> >>> Jake >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > Dr Jake Grimmett > Head Of Scientific Computing > MRC Laboratory of Molecular Biology > Francis Crick Avenue, > Cambridge CB2 0QH, UK. > Phone 01223 267019 > Mobile 0776 9886539 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html