When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults. Has anyone seen this? Below is a part of MDS log. This happened in emperor and
recent 0.77 release. I am running Debian Wheezy with testing kernels 3.13. What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info.
-43> 2014-03-20 20:08:42.463357 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580
-42> 2014-03-20 20:08:42.463787 7fee411d4700 1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21764 ==== mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2
==== 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
-41> 2014-03-20 20:08:43.373099 7fee3f0cf700 2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max
1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode
-40> 2014-03-20 20:08:44.494963 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0
-39> 2014-03-20 20:08:44.495033 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept
peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0)
-38> 2014-03-20 20:08:44.495565 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept
we reset (peer sent cseq 2), sending RESETSESSION
-37> 2014-03-20 20:08:44.496015 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault
0: Success
-36> 2014-03-20 20:08:44.496099 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127
-35> 2014-03-20 20:08:44.496120 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127
-34> 2014-03-20 20:08:44.496207 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne
-33> 2014-03-20 20:08:44.653628 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0
-32> 2014-03-20 20:08:44.653677 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept
peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52027/0)
-31> 2014-03-20 20:08:44.925618 7fee411d4700 1 -- 192.168.1.20:6801/17079 <== client.6019 192.168.1.101:0/2113152127 1 ==== client_reconnect(77349 caps) v2 ==== 0+0+11032578
(0 0 3293767716) 0x2e92780 con 0x1f0e22c0
-30> 2014-03-20 20:08:44.925682 7fee411d4700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close
-29> 2014-03-20 20:08:44.925735 7fee411d4700 0 log [INF] : denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679
(allowed interval 45)
-28> 2014-03-20 20:08:44.925748 7fee411d4700 1 -- 192.168.1.20:6801/17079 --> 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 0x1f0e22c0
-27> 2014-03-20 20:08:44.927727 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).reader
couldn't read tag, Success
-26> 2014-03-20 20:08:44.927797 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault
0: Success
-25> 2014-03-20 20:08:44.927849 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault,
server, going to standby
-24> 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick
-23> 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-20 20:08:16.372333)
-22> 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no
-21> 2014-03-20 20:08:46.372403 7fee401d2700 10 log_queue is 1 last_log 2 sent 1 num 1 unsent 1 sending 1
-20> 2014-03-20 20:08:46.372421 7fee401d2700 10 will send 2014-03-20 20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127
after 2014-03-20 20:08:44.925679 (allowed interval 45)
-19> 2014-03-20 20:08:46.372466 7fee401d2700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
-18> 2014-03-20 20:08:46.372483 7fee401d2700 1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- log(1 entries) v1 -- ?+0 0x71d8d80 con 0x2e56580
-17> 2014-03-20 20:08:46.463496 7fee3f0cf700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
-16> 2014-03-20 20:08:46.463524 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21121 v6970) v2 -- ?+0 0x1ee91340 con 0x2e56580
-15> 2014-03-20 20:08:46.499688 7fee411d4700 1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21765 ==== log(last 2) v1 ==== 24+0+0 (1174756693 0 0) 0x3ea7a40 con 0x2e56580
-14> 2014-03-20 20:08:46.499751 7fee411d4700 10 handle_log_ack log(last 2) v1
-13> 2014-03-20 20:08:46.499757 7fee411d4700 10 logged 2014-03-20 20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127
after 2014-03-20 20:08:44.925679 (allowed interval 45)
-12> 2014-03-20 20:08:46.499825 7fee411d4700 1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21766 ==== mdsbeacon(6798/MDS1.2 up:active seq 21121 v6970) v2 ==== 108+0+0 (51773011 0
0) 0x1ee25600 con 0x2e56580
-11> 2014-03-20 20:08:47.303323 7fee3ddca700 1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept sd=21 192.168.1.101:52029/0
-10> 2014-03-20 20:08:47.308830 7fee3ddca700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept peer addr is really
192.168.1.101:0/2113152127 (socket is 192.168.1.101:52029/0)
-9> 2014-03-20 20:08:47.309221 7fee3ddca700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept connect_seq 0 vs existing
1 state standby
-8> 2014-03-20 20:08:47.309251 7fee3ddca700 0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept peer reset, then tried
to connect to us, replacing
-7> 2014-03-20 20:08:47.309337 7fee411d4700 5 mds.0.35 ms_handle_remote_reset on 192.168.1.101:0/2113152127
-6> 2014-03-20 20:08:47.309357 7fee411d4700 3 mds.0.35 ms_handle_remote_reset closing connection for session client.6019 192.168.1.101:0/2113152127
-5> 2014-03-20 20:08:47.309385 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e22c0 -- 0x3e92280
-4> 2014-03-20 20:08:47.309478 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127
-3> 2014-03-20 20:08:47.309490 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127
-2> 2014-03-20 20:08:47.309501 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e3080 -- 0x3e92280
-1> 2014-03-20 20:08:47.309485 7fee3ddca700 2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=4 pgs=136 cs=1 l=0 c=0x1f0e22c0).accept read error on newly_acked_seq
0> 2014-03-20 20:08:47.310724 7fee411d4700 -1 *** Caught signal (Segmentation fault) **
ceph version 0.77 (1bca9c5c412b3af722d5250f07fd562a23cf35ff)
1: /usr/bin/ceph-mds() [0x8baf22]
2: (()+0xf880) [0x7fee458ef880]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
log_file /var/log/ceph/ceph-mds.MDS1.2.log
--- end dump of recent events ---