ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Hi, we are running ceph 9.2.0.
Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked the logs, I saw this mds crashed with a stacktrace. I checked the other mdss, and I saw the same there. When I try to start the mds again, I get again a stacktrace and it won't come up:

-12> 2016-02-04 10:23:46.837131 7ff9ea570700 1 -- 10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1 ==== osd_op_reply(207 100005ef982.00000000 [stat] v0'0 uv22184 ondisk = 0) v6 ==== 187+0+16 (113
2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
-11> 2016-02-04 10:23:46.837317 7ff9ed6a1700 1 -- 10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6 ==== osd_op_reply(209 1000048aaac.00000000 [delete] v0'0 uv23797 ondisk = -2 ((2) No such file o r directory)) v6 ==== 187+0+0 (64699207 0 0) 0x7ffa171acb00 con 0x7ffa014fd9c0 -10> 2016-02-04 10:23:46.837406 7ff9ec994700 1 -- 10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5 ==== osd_op_reply(175 100005f631f.00000000 [stat] v0'0 uv22466 ondisk = 0) v6 ==== 187+0+16 (1037
61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
-9> 2016-02-04 10:23:46.837463 7ff9eba85700 1 -- 10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2 ==== osd_op_reply(211 1000048aac8.00000000 [delete] v0'0 uv22990 ondisk = -2 ((2) No such file or directory)) v6 ==== 187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con 0x7ffa189cadc0 -8> 2016-02-04 10:23:46.837468 7ff9eb27d700 1 -- 10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2 ==== osd_op_reply(212 1000048aacd.00000000 [delete] v0'0 uv23991 ondisk = -2 ((2) No such file or directory)) v6 ==== 187+0+0 (1675093742 0 0) 0x7ffa171ac840 con 0x7ffa189cb760 -7> 2016-02-04 10:23:46.837477 7ff9eab76700 1 -- 10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2 ==== osd_op_reply(210 1000048aab9.00000000 [delete] v0'0 uv24583 ondisk = -2 ((2) No such file or directory)) v6 ==== 187+0+0 (603192739 0 0) 0x7ffa19054680 con 0x7ffa189cbce0 -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700 1 -- 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43 ==== osd_op_reply(121 200.00009d96 [write 1459360~980] v943'4092 uv4092 ondisk = 0) v6 ==== 179+0+0 (3939130488 0 0) 0x7ffa01590100 con 0x7ffa014fab00 -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700 1 -- 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44 ==== osd_op_reply(124 200.00009d96 [write 1460340~956] v943'4093 uv4093 ondisk = 0) v6 ==== 179+0+0 (1434265886 0 0) 0x7ffa01590100 con 0x7ffa014fab00 -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700 1 -- 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45 ==== osd_op_reply(126 200.00009d96 [write 1461296~954] v943'4094 uv4094 ondisk = 0) v6 ==== 179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00 -3> 2016-02-04 10:23:46.838700 7ff9ecd98700 1 -- 10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3 ==== osd_op_reply(199 100005ef976.00000000 [stat] v0'0 uv22557 ondisk = 0) v6 ==== 187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160 -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700 1 -- 10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3 ==== osd_op_reply(115 10000625476.00000000 [stat] v0'0 uv22587 ondisk = 0) v6 ==== 187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20 -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700 1 -- 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46 ==== osd_op_reply(128 200.00009d96 [write 1462250~954] v943'4095 uv4095 ondisk = 0) v6 ==== 179+0+0 (1379768629 0 0) 0x7ffa01590100 con 0x7ffa014fab00 0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal (Floating point exception) **
  in thread 7ff9f30d8700

  ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
  1: (()+0x4b6fa2) [0x7ff9fd091fa2]
  2: (()+0xf100) [0x7ff9fbfd3100]
3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2) [0x7ff9fcf0adc2]
  4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
  5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
  6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
  7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
  8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
  9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
  10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
  11: (()+0x7dc5) [0x7ff9fbfcbdc5]
  12: (clone()+0x6d) [0x7ff9faeb621d]

Does someone has an idea? We can't use our fs right now..

I included the full log of an mds start in attachment

Thanks!!

K

Attachment: mds02.tar.bz2
Description: application/bzip

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux