Re: CEPHFS: standby-replay mds crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 2, 2016 at 12:52 PM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
> Hi...
>
> Seems very similar to
>
> http://tracker.ceph.com/issues/14144
>
> Can you confirm it is the same issue?

should be the same. it's fixed by https://github.com/ceph/ceph/pull/7132

Regards
Yan, Zheng

>
> Cheers
> G.
>
>
>
> ________________________________
> From: Goncalo Borges
> Sent: 02 February 2016 15:30
> To: ceph-users@xxxxxxxx
> Cc: rcteam@xxxxxxxxxxxx
> Subject: CEPHFS: standby-replay mds crash
>
> Hi CephFS experts.
>
> 1./ We are using Ceph and CephFS 9.2.0 with an active mds and a
> standby-replay mds (standard config)
>
> # ceph -s
>     cluster <CLUSTERID>
>      health HEALTH_OK
>      monmap e1: 3 mons at
> {mon1=<MON1_IP>:6789/0,mon2=<MON2_IP>:6789/0,mon3=<MON3_IP>:6789/0}
>             election epoch 98, quorum 0,1,2 mon1,mon3,mon2
>      mdsmap e102: 1/1/1 up {0=mds2=up:active}, 1 up:standby-replay
>      osdmap e689: 64 osds: 64 up, 64 in
>             flags sortbitwise
>       pgmap v2006627: 3072 pgs, 3 pools, 106 GB data, 85605 objects
>             323 GB used, 174 TB / 174 TB avail
>                 3072 active+clean
>   client io 1191 B/s rd, 2 op/s
>
>
>
> 2./ Today, the standby-replay mds crashed but the active mds continued ok.
> The logs (following this email) show a problem creating a thread.
>
> 3./ Our ganglia monitoring shows:
>
> - a tremendous increase of load in the system
> - a tremendous peak of network connectivity for inbound traffic
> - No excessive memory usage nor excessive number of processes running.
>
> 4./ For now, we just restarted the standby-replay mds, which seems to be
> happy again.
>
>
> Have any of you hit this issue before?
>
> TIA
> Goncalo
> .
>
> # cat /var/log/ceph/ceph-mds.mds.log
>
> (... snap...)
>
> 2016-02-02 02:53:28.608130 7f047679d700  1 mds.0.0 standby_replay_restart
> (as standby)
> 2016-02-02 02:53:28.614498 7f0474799700  1 mds.0.0 replay_done (as standby)
> 2016-02-02 02:53:29.614593 7f047679d700  1 mds.0.0 standby_replay_restart
> (as standby)
> 2016-02-02 02:53:29.620953 7f0474799700  1 mds.0.0 replay_done (as standby)
> 2016-02-02 02:53:30.621036 7f047679d700  1 mds.0.0 standby_replay_restart
> (as standby)
> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function
> 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02
> 02:53:30.626626
> common/Thread.cc: 154: FAILED assert(ret == 0)
>
>  ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7f047ee2e105]
>  2: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
>  3: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
>  4: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
>  5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96)
> [0x7f047ea898b6]
>  6: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
>  7: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
>  8: (()+0x7dc5) [0x7f047dc7adc5]
>  9: (clone()+0x6d) [0x7f047cb6521d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> --- begin dump of recent events ---
>
> (... snap...)
>
>    -10> 2016-02-02 02:53:30.621036 7f047679d700  1 mds.0.0
> standby_replay_restart (as standby)
>     -9> 2016-02-02 02:53:30.621091 7f047679d700  1 --
> <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER8_IP>:6808/23967 --
> osd_op(mds.245029.0:7272223 200.00000000 [read 0~0] 5.844f3494
> ack+read+known_if_redirected+full_force e689) v6 -- ?+0 0x7f0492e67600 con
> 0x7f04931c5b80
>     -8> 2016-02-02 02:53:30.624919 7f0466c66700  1 --
> <BACKUP_MDS_IP>:6801/31961 <== osd.62 <OSD_SERVER8_IP>:6808/23967 1556070
> ==== osd_op_reply(7272223 200.00000000 [read 0~90] v0'0 uv8348 ondisk = 0)
> v6 ==== 179+0+90 (526420493 0 2009462618) 0x7f04b53fc000 con 0x7f04931c5b80
>     -7> 2016-02-02 02:53:30.625029 7f0474799700  1 mds.245029.journaler(ro)
> probing for end of the log
>     -6> 2016-02-02 02:53:30.625094 7f0474799700  1 --
> <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER4_IP>:6814/11760 --
> osd_op(mds.245029.0:7272224 200.00000537 [stat] 5.a003dca
> ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0
> 0x7f04bd8c6680 con 0x7f04af654aa0
>     -5> 2016-02-02 02:53:30.625168 7f0474799700  1 --
> <BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER1_IP>:6814/11663 --
> osd_op(mds.245029.0:7272225 200.00000538 [stat] 5.aa28907c
> ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0
> 0x7f04a72d1440 con 0x7f04b4b08d60
>     -4> 2016-02-02 02:53:30.626365 7f00e3e1c700  1 --
> <BACKUP_MDS_IP>:6801/31961 <== osd.1 <OSD_SERVER1_IP>:6814/11663 92601 ====
> osd_op_reply(7272225 200.00000538 [stat] v0'0 uv0 ack = -2 ((2) No such file
> or directory)) v6 ==== 179+0+0 (3023689365 0 0) 0x7f04913899c0 con
> 0x7f04b4b08d60
>     -3> 2016-02-02 02:53:30.626433 7f0374d69700  1 --
> <BACKUP_MDS_IP>:6801/31961 <== osd.24 <OSD_SERVER4_IP>:6814/11760 93044 ====
> osd_op_reply(7272224 200.00000537 [stat] v0'0 uv1909 ondisk = 0) v6
> ==== 179+0+16 (736135707 0 822294467) 0x7f0482b3cc00 con 0x7f04af654aa0
>     -2> 2016-02-02 02:53:30.626500 7f0474799700  1 mds.245029.journaler(ro)
> _finish_reprobe new_end = 5600997154 (header had 5600996718).
>     -1> 2016-02-02 02:53:30.626525 7f0474799700  2 mds.0.0 boot_start 2:
> replaying mds log
>      0> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02
> 02:53:30.626626
> common/Thread.cc: 154: FAILED assert(ret == 0)
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-mds.mds.log
> --- end dump of recent events ---
>
>  ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>  1: (()+0x4b6fa2) [0x7f047ed40fa2]
>  2: (()+0xf100) [0x7f047dc82100]
>  3: (gsignal()+0x37) [0x7f047caa45f7]
>  4: (abort()+0x148) [0x7f047caa5ce8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f047d3a89d5]
>  6: (()+0x5e946) [0x7f047d3a6946]
>  7: (()+0x5e973) [0x7f047d3a6973]
>  8: (()+0x5eb93) [0x7f047d3a6b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0x7f047ee2e2fa]
>  10: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
>  11: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
>  12: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
>  13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96)
> [0x7f047ea898b6]
>  14: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
>  15: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
>  16: (()+0x7dc5) [0x7f047dc7adc5]
>  17: (clone()+0x6d) [0x7f047cb6521d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> --- begin dump of recent events ---
>     -7> 2016-02-02 02:53:30.698174 7f033eaed700  2 --
> <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000
> sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).reader couldn't read
> tag, (0) Success
>     -6> 2016-02-02 02:53:30.698222 7f033eaed700  2 --
> <BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000
> sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).fault (0) Success
>     -5> 2016-02-02 02:53:30.698275 7f04790a3700  1 mds.245029.objecter
> ms_handle_reset on osd.12
>     -4> 2016-02-02 02:53:30.698286 7f04790a3700  1 --
> <BACKUP_MDS_IP>:6801/31961 mark_down 0x7f04a2dac100 -- pipe dne
>     -3> 2016-02-02 02:53:30.700816 7f0475f9c700 10 monclient:
> _send_mon_message to mon.mon3 at <MON3_IP>:6789/0
>     -2> 2016-02-02 02:53:30.700829 7f0475f9c700  1 --
> <BACKUP_MDS_IP>:6801/31961 --> <MON3_IP>:6789/0 -- mdsbeacon(245029/mds
> up:standby-replay seq 606971 v99) v4 -- ?+0 0x7f04beedaa00 con
> 0x7f0482aac2c0
>     -1> 2016-02-02 02:53:30.702635 7f04790a3700  1 --
> <BACKUP_MDS_IP>:6801/31961 <== mon.1 <MON3_IP>:6789/0 625316 ====
> mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 ==== 121+0+0
> (538935308 0 0) 0x7f04beeb5200 con 0x7f0482aac2c0
>      0> 2016-02-02 02:53:30.751289 7f0474799700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f0474799700
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-mds.mds.log
> --- end dump of recent events ---
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux