Re: ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

John Spray <jspray@xxxxxxxxxx> · Fri, 5 Feb 2016 10:43:07 +0000

On Fri, Feb 5, 2016 at 9:36 AM, Kenneth Waegeman
<kenneth.waegeman@xxxxxxxx> wrote:
>
>
> On 04/02/16 16:17, Gregory Farnum wrote:
>>
>> On Thu, Feb 4, 2016 at 1:42 AM, Kenneth Waegeman
>> <kenneth.waegeman@xxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> Hi, we are running ceph 9.2.0.
>>> Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked
>>> the
>>> logs, I saw this mds crashed with a stacktrace. I checked the other mdss,
>>> and I saw the same there.
>>> When I try to start the mds again, I get again a stacktrace and it won't
>>> come up:
>>>
>>>       -12> 2016-02-04 10:23:46.837131 7ff9ea570700  1 --
>>> 10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1 ====
>>> osd_op_reply(207 100005ef982.00000000 [stat] v0'0 uv22184 ondisk = 0) v6
>>> ==== 187+0+16 (113
>>> 2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
>>>      -11> 2016-02-04 10:23:46.837317 7ff9ed6a1700  1 --
>>> 10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6 ====
>>> osd_op_reply(209 1000048aaac.00000000 [delete] v0'0 uv23797 ondisk = -2
>>> ((2)
>>> No such file o
>>> r directory)) v6 ==== 187+0+0 (64699207 0 0) 0x7ffa171acb00 con
>>> 0x7ffa014fd9c0
>>>      -10> 2016-02-04 10:23:46.837406 7ff9ec994700  1 --
>>> 10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5 ====
>>> osd_op_reply(175 100005f631f.00000000 [stat] v0'0 uv22466 ondisk = 0) v6
>>> ==== 187+0+16 (1037
>>> 61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
>>>       -9> 2016-02-04 10:23:46.837463 7ff9eba85700  1 --
>>> 10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2 ====
>>> osd_op_reply(211 1000048aac8.00000000 [delete] v0'0 uv22990 ondisk = -2
>>> ((2)
>>> No such file or
>>>    directory)) v6 ==== 187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con
>>> 0x7ffa189cadc0
>>>       -8> 2016-02-04 10:23:46.837468 7ff9eb27d700  1 --
>>> 10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2 ====
>>> osd_op_reply(212 1000048aacd.00000000 [delete] v0'0 uv23991 ondisk = -2
>>> ((2)
>>> No such file or
>>>    directory)) v6 ==== 187+0+0 (1675093742 0 0) 0x7ffa171ac840 con
>>> 0x7ffa189cb760
>>>       -7> 2016-02-04 10:23:46.837477 7ff9eab76700  1 --
>>> 10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2 ====
>>> osd_op_reply(210 1000048aab9.00000000 [delete] v0'0 uv24583 ondisk = -2
>>> ((2)
>>> No such file or
>>>    directory)) v6 ==== 187+0+0 (603192739 0 0) 0x7ffa19054680 con
>>> 0x7ffa189cbce0
>>>       -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700  1 --
>>> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43 ====
>>> osd_op_reply(121 200.00009d96 [write 1459360~980] v943'4092 uv4092 ondisk
>>> =
>>> 0) v6 ==== 179+0+0 (3939130488 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>>>       -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700  1 --
>>> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44 ====
>>> osd_op_reply(124 200.00009d96 [write 1460340~956] v943'4093 uv4093 ondisk
>>> =
>>> 0) v6 ==== 179+0+0 (1434265886 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>>>       -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700  1 --
>>> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45 ====
>>> osd_op_reply(126 200.00009d96 [write 1461296~954] v943'4094 uv4094 ondisk
>>> =
>>> 0) v6 ==== 179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>>>       -3> 2016-02-04 10:23:46.838700 7ff9ecd98700  1 --
>>> 10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3 ====
>>> osd_op_reply(199 100005ef976.00000000 [stat] v0'0 uv22557 ondisk = 0) v6
>>> ==== 187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160
>>>       -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700  1 --
>>> 10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3 ====
>>> osd_op_reply(115 10000625476.00000000 [stat] v0'0 uv22587 ondisk = 0) v6
>>> ==== 187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20
>>>       -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700  1 --
>>> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46 ====
>>> osd_op_reply(128 200.00009d96 [write 1462250~954] v943'4095 uv4095 ondisk
>>> =
>>> 0) v6 ==== 179+0+0 (1379768629 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>>>        0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal
>>> (Floating point exception) **
>>>    in thread 7ff9f30d8700
>>>
>>>    ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>>>    1: (()+0x4b6fa2) [0x7ff9fd091fa2]
>>>    2: (()+0xf100) [0x7ff9fbfd3100]
>>>    3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2)
>>> [0x7ff9fcf0adc2]
>>>    4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
>>>    5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
>>>    6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
>>>    7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
>>>    8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
>>>    9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
>>>    10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
>>>    11: (()+0x7dc5) [0x7ff9fbfcbdc5]
>>>    12: (clone()+0x6d) [0x7ff9faeb621d]
>>>
>>> Does someone has an idea? We can't use our fs right now..
>>
>> Hey, fun! Just looking for FPE opportunities in that function, it
>> looks like someone managed to set either the object size or stripe
>> count to 0 on some of your files. Is that possible?
>
> I am the only user on the system :) What was happening, was our production
> storage on gpfs being rsynced to the cephfs. so if something happened like
> that, rsync should have triggered it, but that would be very strange in my
> opinion. Can such things be prevented somehow ?
>>
>> As for repair chances...hmm. I think you'll need to:
>> 1) identify what the broken inodes are. Hopefully you can get this by
>> setting "debug mds = 20" in your conf, starting up the MDS and seeing
>> what the last referenced inode is in your crashing thread...
>> 2) Manually remove the entry for that inode in your stray directory.
>> The stray directory should be inode 0x601, I think, object
>> 601.00000000 in the metadata pool. Use the rados omap commands to
>> remove the entries for the bad inode.
>> 3) Delete all the objects in the data pool for that file.
>> 4) You may also need to remove references to that inode from your
>> journal using the cephfs-journal-tool, if it doesn't start up cleanly
>> following the prior steps. I think the help text will give you what
>> you need; if not John can.
>
> Our sync was only running for a day, so I can restart from scratch if this
> wouldn't work.. But my main concern is how to avoid this in the future.
> Because I suppose if I restart the backup, this can happen again?

OK, in a way that's good news because it means this is hopefully
something that you can readily reproduce.

Before going any further, in case we need to inspect it later, please
could you grab a dump of your journal with "cephfs-journal-tool
journal export /tmp/somewhere.bin" -- if this proves to be hard to
reproduce we might need to go to the trouble of trying to dissect
evidence from this particular occurence.

The surprising thing is that a file is ending up with a bogus layout
to begin with: there are checks that are applied during layout-setting
operations that should ensure this situation can't arise.  Especially
if you're just rsyncing from another filesystem, nothing should even
be trying to touch the ceph-specific layout settings.

I guess if this is straightforward to reproduce, could you crank
"debug mds" up to 20, run through the whole business again, and then
try to snip out all the places in the log where the relevant inode was
modified/handled for us?  If your workload includes a very large
number of files this might produce an unmanageably large log, but it's
worth a go.

Thanks,
John

> Thanks very much!
> Kenneth
>
>> Does that sound right to you, John?
>>
>> I've created tickets to prevent those nonsensical values from getting
>> set, and to prevent the StrayManager from crashing when it encounters
>> them. http://tracker.ceph.com/issues/14641,
>> http://tracker.ceph.com/issues/14642
>> -Greg
>>
>>> I included the full log of an mds start in attachment
>>>
>>> Thanks!!
>>>
>>> K
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com