CEPHFS: standby-replay mds crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi CephFS experts.

1./ We are using Ceph and CephFS 9.2.0 with an active mds and a standby-replay mds (standard config)

# ceph -s
    cluster <CLUSTERID>
     health HEALTH_OK
     monmap e1: 3 mons at {mon1=<MON1_IP>:6789/0,mon2=<MON2_IP>:6789/0,mon3=<MON3_IP>:6789/0}
            election epoch 98, quorum 0,1,2 mon1,mon3,mon2
     mdsmap e102: 1/1/1 up {0=mds2=up:active}, 1 up:standby-replay
     osdmap e689: 64 osds: 64 up, 64 in
            flags sortbitwise
      pgmap v2006627: 3072 pgs, 3 pools, 106 GB data, 85605 objects
            323 GB used, 174 TB / 174 TB avail
                3072 active+clean
  client io 1191 B/s rd, 2 op/s


2./ Today, the standby-replay mds crashed but the active mds continued ok. The logs (following this email) show a problem creating a thread.

3./ Our ganglia monitoring shows:
- a tremendous increase of load in the system
- a tremendous peak of network connectivity for inbound traffic
- No excessive memory usage nor excessive number of processes running.

4./ For now, we just restarted the standby-replay mds, which seems to be happy again.


Have any of you hit this issue before?

TIA
Goncalo
.

# cat /var/log/ceph/ceph-mds.mds.log

(... snap...)

2016-02-02 02:53:28.608130 7f047679d700  1 mds.0.0 standby_replay_restart (as standby)
2016-02-02 02:53:28.614498 7f0474799700  1 mds.0.0 replay_done (as standby)
2016-02-02 02:53:29.614593 7f047679d700  1 mds.0.0 standby_replay_restart (as standby)
2016-02-02 02:53:29.620953 7f0474799700  1 mds.0.0 replay_done (as standby)
2016-02-02 02:53:30.621036 7f047679d700  1 mds.0.0 standby_replay_restart (as standby)
2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 02:53:30.626626
common/Thread.cc: 154: FAILED assert(ret == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f047ee2e105]
 2: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
 3: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
 4: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
 5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x7f047ea898b6]
 6: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
 7: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
 8: (()+0x7dc5) [0x7f047dc7adc5]
 9: (clone()+0x6d) [0x7f047cb6521d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

(... snap...)

   -10> 2016-02-02 02:53:30.621036 7f047679d700  1 mds.0.0 standby_replay_restart (as standby)
    -9> 2016-02-02 02:53:30.621091 7f047679d700  1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER8_IP>:6808/23967 -- osd_op(mds.245029.0:7272223 200.00000000 [read 0~0] 5.844f3494 ack+read+known_if_redirected+full_force e689) v6 -- ?+0 0x7f0492e67600 con 0x7f04931c5b80
    -8> 2016-02-02 02:53:30.624919 7f0466c66700  1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.62 <OSD_SERVER8_IP>:6808/23967 1556070 ==== osd_op_reply(7272223 200.00000000 [read 0~90] v0'0 uv8348 _ondisk_ = 0) v6 ==== 179+0+90 (526420493 0 2009462618) 0x7f04b53fc000 con 0x7f04931c5b80
    -7> 2016-02-02 02:53:30.625029 7f0474799700  1 mds.245029.journaler(ro) probing for end of the log
    -6> 2016-02-02 02:53:30.625094 7f0474799700  1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER4_IP>:6814/11760 -- osd_op(mds.245029.0:7272224 200.00000537 [stat] 5.a003dca ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 0x7f04bd8c6680 con 0x7f04af654aa0
    -5> 2016-02-02 02:53:30.625168 7f0474799700  1 -- <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER1_IP>:6814/11663 -- osd_op(mds.245029.0:7272225 200.00000538 [stat] 5.aa28907c ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0 0x7f04a72d1440 con 0x7f04b4b08d60
    -4> 2016-02-02 02:53:30.626365 7f00e3e1c700  1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.1 <OSD_SERVER1_IP>:6814/11663 92601 ==== osd_op_reply(7272225 200.00000538 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (3023689365 0 0) 0x7f04913899c0 con 0x7f04b4b08d60
    -3> 2016-02-02 02:53:30.626433 7f0374d69700  1 -- <BACKUP_MDS_IP>:6801/31961 <== osd.24 <OSD_SERVER4_IP>:6814/11760 93044 ==== osd_op_reply(7272224 200.00000537 [stat] v0'0 uv1909 _ondisk_ = 0) v6
==== 179+0+16 (736135707 0 822294467) 0x7f0482b3cc00 con 0x7f04af654aa0
    -2> 2016-02-02 02:53:30.626500 7f0474799700  1 mds.245029.journaler(ro) _finish_reprobe new_end = 5600997154 (header had 5600996718).
    -1> 2016-02-02 02:53:30.626525 7f0474799700  2 mds.0.0 boot_start 2: replaying mds log
     0> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 02:53:30.626626
common/Thread.cc: 154: FAILED assert(ret == 0)

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.mds.log
--- end dump of recent events ---

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (()+0x4b6fa2) [0x7f047ed40fa2]
 2: (()+0xf100) [0x7f047dc82100]
 3: (gsignal()+0x37) [0x7f047caa45f7]
 4: (abort()+0x148) [0x7f047caa5ce8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f047d3a89d5]
 6: (()+0x5e946) [0x7f047d3a6946]
 7: (()+0x5e973) [0x7f047d3a6973]
 8: (()+0x5eb93) [0x7f047d3a6b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7f047ee2e2fa]
 10: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
 11: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
 12: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
 13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x7f047ea898b6]
 14: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
 15: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
 16: (()+0x7dc5) [0x7f047dc7adc5]
 17: (clone()+0x6d) [0x7f047cb6521d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -7> 2016-02-02 02:53:30.698174 7f033eaed700  2 -- <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).reader couldn't read tag, (0) Success
    -6> 2016-02-02 02:53:30.698222 7f033eaed700  2 -- <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000 sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).fault (0) Success
    -5> 2016-02-02 02:53:30.698275 7f04790a3700  1 mds.245029.objecter ms_handle_reset on osd.12
    -4> 2016-02-02 02:53:30.698286 7f04790a3700  1 -- <BACKUP_MDS_IP>:6801/31961 mark_down 0x7f04a2dac100 -- pipe dne
    -3> 2016-02-02 02:53:30.700816 7f0475f9c700 10 monclient: _send_mon_message to mon.mon3 at <MON3_IP>:6789/0
    -2> 2016-02-02 02:53:30.700829 7f0475f9c700  1 -- <BACKUP_MDS_IP>:6801/31961 --> <MON3_IP>:6789/0 -- mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 -- ?+0 0x7f04beedaa00 con 0x7f0482aac2c0
    -1> 2016-02-02 02:53:30.702635 7f04790a3700  1 -- <BACKUP_MDS_IP>:6801/31961 <== mon.1 <MON3_IP>:6789/0 625316 ==== mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 ==== 121+0+0 (538935308 0 0) 0x7f04beeb5200 con 0x7f0482aac2c0
     0> 2016-02-02 02:53:30.751289 7f0474799700 -1 *** Caught signal (Aborted) **
 in thread 7f0474799700

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.mds.log
--- end dump of recent events ---




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux