Hi CephFS experts.
1./ We are using Ceph and CephFS 9.2.0 with an active mds and a standby-replay mds (standard config) # ceph -s 2./ Today, the standby-replay mds crashed but the active mds continued ok. The logs (following this email) show a problem creating a thread. 3./ Our ganglia monitoring shows: - a tremendous increase of load in the system4./ For now, we just restarted the standby-replay mds, which seems to be happy again. Have any of you hit this issue before? TIA Goncalo . # cat /var/log/ceph/ceph-mds.mds.log (... snap...) 2016-02-02 02:53:28.608130 7f047679d700 1 mds.0.0 standby_replay_restart (as standby) 2016-02-02 02:53:28.614498 7f0474799700 1 mds.0.0 replay_done (as standby) 2016-02-02 02:53:29.614593 7f047679d700 1 mds.0.0 standby_replay_restart (as standby) 2016-02-02 02:53:29.620953 7f0474799700 1 mds.0.0 replay_done (as standby) 2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0 standby_replay_restart (as standby) 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 02:53:30.626626 common/Thread.cc: 154: FAILED assert(ret == 0) ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f047ee2e105] 2: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a] 3: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78] 4: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e] 5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x7f047ea898b6] 6: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4] 7: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8] 8: (()+0x7dc5) [0x7f047dc7adc5] 9: (clone()+0x6d) [0x7f047cb6521d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- (... snap...) -10> 2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0 standby_replay_restart (as standby) -9> 2016-02-02 02:53:30.621091 7f047679d700 1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER8_IP>:6808/23967 -- osd_op(mds.245029.0:7272223 200.00000000 [read 0~0] 5.844f3494 ack+read+known_if_redirected+full_force e689) v6 -- ?+0 0x7f0492e67600 con 0x7f04931c5b80 -8> 2016-02-02 02:53:30.624919 7f0466c66700 1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.62 <OSD_SERVER8_IP>:6808/23967 1556070 ==== osd_op_reply(7272223 200.00000000 [read 0~90] v0'0 uv8348 _ondisk_ = 0) v6 ==== 179+0+90 (526420493 0 2009462618) 0x7f04b53fc000 con 0x7f04931c5b80 -7> 2016-02-02 02:53:30.625029 7f0474799700 1 mds.245029.journaler(ro) probing for end of the log -6> 2016-02-02 02:53:30.625094 7f0474799700 1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER4_IP>:6814/11760 -- osd_op(mds.245029.0:7272224 200.00000537 [stat] 5.a003dca ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 0x7f04bd8c6680 con 0x7f04af654aa0 -5> 2016-02-02 02:53:30.625168 7f0474799700 1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER1_IP>:6814/11663 -- osd_op(mds.245029.0:7272225 200.00000538 [stat] 5.aa28907c ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 0x7f04a72d1440 con 0x7f04b4b08d60 -4> 2016-02-02 02:53:30.626365 7f00e3e1c700 1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.1 <OSD_SERVER1_IP>:6814/11663 92601 ==== osd_op_reply(7272225 200.00000538 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (3023689365 0 0) 0x7f04913899c0 con 0x7f04b4b08d60 -3> 2016-02-02 02:53:30.626433 7f0374d69700 1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.24 <OSD_SERVER4_IP>:6814/11760 93044 ==== osd_op_reply(7272224 200.00000537 [stat] v0'0 uv1909 _ondisk_ = 0) v6 ==== 179+0+16 (736135707 0 822294467) 0x7f0482b3cc00 con 0x7f04af654aa0 -2> 2016-02-02 02:53:30.626500 7f0474799700 1 mds.245029.journaler(ro) _finish_reprobe new_end = 5600997154 (header had 5600996718). -1> 2016-02-02 02:53:30.626525 7f0474799700 2 mds.0.0 boot_start 2: replaying mds log 0> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 02:53:30.626626 common/Thread.cc: 154: FAILED assert(ret == 0) --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.mds.log --- end dump of recent events --- ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 1: (()+0x4b6fa2) [0x7f047ed40fa2] 2: (()+0xf100) [0x7f047dc82100] 3: (gsignal()+0x37) [0x7f047caa45f7] 4: (abort()+0x148) [0x7f047caa5ce8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f047d3a89d5] 6: (()+0x5e946) [0x7f047d3a6946] 7: (()+0x5e973) [0x7f047d3a6973] 8: (()+0x5eb93) [0x7f047d3a6b93] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7f047ee2e2fa] 10: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a] 11: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78] 12: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e] 13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x7f047ea898b6] 14: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4] 15: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8] 16: (()+0x7dc5) [0x7f047dc7adc5] 17: (clone()+0x6d) [0x7f047cb6521d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -7> 2016-02-02 02:53:30.698174 7f033eaed700 2 -- <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).reader couldn't read tag, (0) Success -6> 2016-02-02 02:53:30.698222 7f033eaed700 2 -- <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).fault (0) Success -5> 2016-02-02 02:53:30.698275 7f04790a3700 1 mds.245029.objecter ms_handle_reset on osd.12 -4> 2016-02-02 02:53:30.698286 7f04790a3700 1 -- <BACKUP_MDS_IP>:6801/31961 mark_down 0x7f04a2dac100 -- pipe dne -3> 2016-02-02 02:53:30.700816 7f0475f9c700 10 monclient: _send_mon_message to mon.mon3 at <MON3_IP>:6789/0 -2> 2016-02-02 02:53:30.700829 7f0475f9c700 1 -- <BACKUP_MDS_IP>:6801/31961 --> <MON3_IP>:6789/0 -- mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 -- ?+0 0x7f04beedaa00 con 0x7f0482aac2c0 -1> 2016-02-02 02:53:30.702635 7f04790a3700 1 -- <BACKUP_MDS_IP>:6801/31961 <== mon.1 <MON3_IP>:6789/0 625316 ==== mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 ==== 121+0+0 (538935308 0 0) 0x7f04beeb5200 con 0x7f0482aac2c0 0> 2016-02-02 02:53:30.751289 7f0474799700 -1 *** Caught signal (Aborted) ** in thread 7f0474799700 --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.mds.log --- end dump of recent events --- |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com