On Tue, Feb 2, 2016 at 12:52 PM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi... > > Seems very similar to > > http://tracker.ceph.com/issues/14144 > > Can you confirm it is the same issue? should be the same. it's fixed by https://github.com/ceph/ceph/pull/7132 Regards Yan, Zheng > > Cheers > G. > > > > ________________________________ > From: Goncalo Borges > Sent: 02 February 2016 15:30 > To: ceph-users@xxxxxxxx > Cc: rcteam@xxxxxxxxxxxx > Subject: CEPHFS: standby-replay mds crash > > Hi CephFS experts. > > 1./ We are using Ceph and CephFS 9.2.0 with an active mds and a > standby-replay mds (standard config) > > # ceph -s > cluster <CLUSTERID> > health HEALTH_OK > monmap e1: 3 mons at > {mon1=<MON1_IP>:6789/0,mon2=<MON2_IP>:6789/0,mon3=<MON3_IP>:6789/0} > election epoch 98, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e102: 1/1/1 up {0=mds2=up:active}, 1 up:standby-replay > osdmap e689: 64 osds: 64 up, 64 in > flags sortbitwise > pgmap v2006627: 3072 pgs, 3 pools, 106 GB data, 85605 objects > 323 GB used, 174 TB / 174 TB avail > 3072 active+clean > client io 1191 B/s rd, 2 op/s > > > > 2./ Today, the standby-replay mds crashed but the active mds continued ok. > The logs (following this email) show a problem creating a thread. > > 3./ Our ganglia monitoring shows: > > - a tremendous increase of load in the system > - a tremendous peak of network connectivity for inbound traffic > - No excessive memory usage nor excessive number of processes running. > > 4./ For now, we just restarted the standby-replay mds, which seems to be > happy again. > > > Have any of you hit this issue before? > > TIA > Goncalo > . > > # cat /var/log/ceph/ceph-mds.mds.log > > (... snap...) > > 2016-02-02 02:53:28.608130 7f047679d700 1 mds.0.0 standby_replay_restart > (as standby) > 2016-02-02 02:53:28.614498 7f0474799700 1 mds.0.0 replay_done (as standby) > 2016-02-02 02:53:29.614593 7f047679d700 1 mds.0.0 standby_replay_restart > (as standby) > 2016-02-02 02:53:29.620953 7f0474799700 1 mds.0.0 replay_done (as standby) > 2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0 standby_replay_restart > (as standby) > 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function > 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 > 02:53:30.626626 > common/Thread.cc: 154: FAILED assert(ret == 0) > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x7f047ee2e105] > 2: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a] > 3: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78] > 4: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e] > 5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) > [0x7f047ea898b6] > 6: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4] > 7: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8] > 8: (()+0x7dc5) [0x7f047dc7adc5] > 9: (clone()+0x6d) [0x7f047cb6521d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > > (... snap...) > > -10> 2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0 > standby_replay_restart (as standby) > -9> 2016-02-02 02:53:30.621091 7f047679d700 1 -- > <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER8_IP>:6808/23967 -- > osd_op(mds.245029.0:7272223 200.00000000 [read 0~0] 5.844f3494 > ack+read+known_if_redirected+full_force e689) v6 -- ?+0 0x7f0492e67600 con > 0x7f04931c5b80 > -8> 2016-02-02 02:53:30.624919 7f0466c66700 1 -- > <BACKUP_MDS_IP>:6801/31961 <== osd.62 <OSD_SERVER8_IP>:6808/23967 1556070 > ==== osd_op_reply(7272223 200.00000000 [read 0~90] v0'0 uv8348 ondisk = 0) > v6 ==== 179+0+90 (526420493 0 2009462618) 0x7f04b53fc000 con 0x7f04931c5b80 > -7> 2016-02-02 02:53:30.625029 7f0474799700 1 mds.245029.journaler(ro) > probing for end of the log > -6> 2016-02-02 02:53:30.625094 7f0474799700 1 -- > <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER4_IP>:6814/11760 -- > osd_op(mds.245029.0:7272224 200.00000537 [stat] 5.a003dca > ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 > 0x7f04bd8c6680 con 0x7f04af654aa0 > -5> 2016-02-02 02:53:30.625168 7f0474799700 1 -- > <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER1_IP>:6814/11663 -- > osd_op(mds.245029.0:7272225 200.00000538 [stat] 5.aa28907c > ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 > 0x7f04a72d1440 con 0x7f04b4b08d60 > -4> 2016-02-02 02:53:30.626365 7f00e3e1c700 1 -- > <BACKUP_MDS_IP>:6801/31961 <== osd.1 <OSD_SERVER1_IP>:6814/11663 92601 ==== > osd_op_reply(7272225 200.00000538 [stat] v0'0 uv0 ack = -2 ((2) No such file > or directory)) v6 ==== 179+0+0 (3023689365 0 0) 0x7f04913899c0 con > 0x7f04b4b08d60 > -3> 2016-02-02 02:53:30.626433 7f0374d69700 1 -- > <BACKUP_MDS_IP>:6801/31961 <== osd.24 <OSD_SERVER4_IP>:6814/11760 93044 ==== > osd_op_reply(7272224 200.00000537 [stat] v0'0 uv1909 ondisk = 0) v6 > ==== 179+0+16 (736135707 0 822294467) 0x7f0482b3cc00 con 0x7f04af654aa0 > -2> 2016-02-02 02:53:30.626500 7f0474799700 1 mds.245029.journaler(ro) > _finish_reprobe new_end = 5600997154 (header had 5600996718). > -1> 2016-02-02 02:53:30.626525 7f0474799700 2 mds.0.0 boot_start 2: > replaying mds log > 0> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In > function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 > 02:53:30.626626 > common/Thread.cc: 154: FAILED assert(ret == 0) > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 newstore > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-mds.mds.log > --- end dump of recent events --- > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (()+0x4b6fa2) [0x7f047ed40fa2] > 2: (()+0xf100) [0x7f047dc82100] > 3: (gsignal()+0x37) [0x7f047caa45f7] > 4: (abort()+0x148) [0x7f047caa5ce8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f047d3a89d5] > 6: (()+0x5e946) [0x7f047d3a6946] > 7: (()+0x5e973) [0x7f047d3a6973] > 8: (()+0x5eb93) [0x7f047d3a6b93] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x27a) [0x7f047ee2e2fa] > 10: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a] > 11: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78] > 12: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e] > 13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) > [0x7f047ea898b6] > 14: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4] > 15: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8] > 16: (()+0x7dc5) [0x7f047dc7adc5] > 17: (clone()+0x6d) [0x7f047cb6521d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > -7> 2016-02-02 02:53:30.698174 7f033eaed700 2 -- > <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 > sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).reader couldn't read > tag, (0) Success > -6> 2016-02-02 02:53:30.698222 7f033eaed700 2 -- > <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 > sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).fault (0) Success > -5> 2016-02-02 02:53:30.698275 7f04790a3700 1 mds.245029.objecter > ms_handle_reset on osd.12 > -4> 2016-02-02 02:53:30.698286 7f04790a3700 1 -- > <BACKUP_MDS_IP>:6801/31961 mark_down 0x7f04a2dac100 -- pipe dne > -3> 2016-02-02 02:53:30.700816 7f0475f9c700 10 monclient: > _send_mon_message to mon.mon3 at <MON3_IP>:6789/0 > -2> 2016-02-02 02:53:30.700829 7f0475f9c700 1 -- > <BACKUP_MDS_IP>:6801/31961 --> <MON3_IP>:6789/0 -- mdsbeacon(245029/mds > up:standby-replay seq 606971 v99) v4 -- ?+0 0x7f04beedaa00 con > 0x7f0482aac2c0 > -1> 2016-02-02 02:53:30.702635 7f04790a3700 1 -- > <BACKUP_MDS_IP>:6801/31961 <== mon.1 <MON3_IP>:6789/0 625316 ==== > mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 ==== 121+0+0 > (538935308 0 0) 0x7f04beeb5200 con 0x7f0482aac2c0 > 0> 2016-02-02 02:53:30.751289 7f0474799700 -1 *** Caught signal > (Aborted) ** > in thread 7f0474799700 > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 newstore > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-mds.mds.log > --- end dump of recent events --- > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com