Re: ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> · Fri, 5 Feb 2016 16:48:11 +0100

Hi,

In my attempt to retry, I ran 'ceph mds newfs' because removing the fs 
was not working (because the mdss could not be stopped).
With the new fs, I could again start syncing. After 10-15min it all 
crashed again. The log now shows some other stacktrace.

     -9> 2016-02-05 15:26:29.015197 7f177de2f700 10 mds.0.locker got 
rdlock on (ipolicy sync r=1) [inode 1000000138e [...2,head] 
/backups/vulpixhome/g
ent/ auth v1328 ap=1+1 f(v0 m2016-02-05 15:11:49.275650 4=0+4) n(v53 
rc2016-02-05 15:12:41.886233 b276810712 5559=3918+1641) (isnap sync r=1) 
(inest l
ock dirty) (ipolicy sync r=1) (iversion lock) 
caps={67042=pAsLsXsFs/-@0,67045=pAsLsXs/-@0,67578=pAsLsXsFs/-@0,67587=pAsLsXsFs/-@0} 
| dirtyscattered=1
lock=2 dirfrag=1 caps=1 dirty=1 authpin=1 0x7f178c539660]
    -8> 2016-02-05 15:26:29.015219 7f177de2f700  7 mds.0.locker 
rdlock_start  on (isnap sync) on [inode 1000000138f [...2,head] 
/backups/vulpixhome/ge
nt/vsc408/ auth v1509 ap=1+1 f(v0 m2016-02-05 15:09:13.624725 10=0+10) 
n(v51 rc2016-02-05 15:12:41.538927 b168677648 3672=2522+1150)/n(v50 
rc2016-02-0
5 15:12:41.538927 b168321213 3672=2522+1150) (inest lock dirty) 
(iversion lock) caps={67045=pAsLsXsFs/-@0,67587=pAsLsXsFs/-@0} | 
dirtyscattered=1 dirf
rag=1 caps=1 dirtyrstat=1 dirty=1 authpin=1 0x7f178c539bf8]
    -7> 2016-02-05 15:26:29.015237 7f177de2f700 10 mds.0.locker got 
rdlock on (isnap sync r=1) [inode 1000000138f [...2,head] 
/backups/vulpixhome/gen
t/vsc408/ auth v1509 ap=1+1 f(v0 m2016-02-05 15:09:13.624725 10=0+10) 
n(v51 rc2016-02-05 15:12:41.538927 b168677648 3672=2522+1150)/n(v50 
rc2016-02-05
 15:12:41.538927 b168321213 3672=2522+1150) (isnap sync r=1) (inest 
lock dirty) (iversion lock) 
caps={67045=pAsLsXsFs/-@0,67587=pAsLsXsFs/-@0} | dirty
scattered=1 lock=1 dirfrag=1 caps=1 dirtyrstat=1 dirty=1 authpin=1 
0x7f178c539bf8]
    -6> 2016-02-05 15:26:29.015256 7f177de2f700  7 mds.0.locker 
rdlock_start  on (ipolicy sync) on [inode 1000000138f [...2,head] 
/backups/vulpixhome/
gent/vsc408/ auth v1509 ap=1+1 f(v0 m2016-02-05 15:09:13.624725 10=0+10) 
n(v51 rc2016-02-05 15:12:41.538927 b168677648 3672=2522+1150)/n(v50 
rc2016-02
-05 15:12:41.538927 b168321213 3672=2522+1150) (isnap sync r=1) (inest 
lock dirty) (iversion lock) 
caps={67045=pAsLsXsFs/-@0,67587=pAsLsXsFs/-@0} | di
rtyscattered=1 lock=1 dirfrag=1 caps=1 dirtyrstat=1 dirty=1 authpin=1 
0x7f178c539bf8]
    -5> 2016-02-05 15:26:29.015274 7f177de2f700 10 mds.0.locker got 
rdlock on (ipolicy sync r=1) [inode 1000000138f [...2,head] 
/backups/vulpixhome/g
ent/vsc408/ auth v1509 ap=1+1 f(v0 m2016-02-05 15:09:13.624725 10=0+10) 
n(v51 rc2016-02-05 15:12:41.538927 b168677648 3672=2522+1150)/n(v50 
rc2016-02-
05 15:12:41.538927 b168321213 3672=2522+1150) (isnap sync r=1) (inest 
lock dirty) (ipolicy sync r=1) (iversion lock) 
caps={67045=pAsLsXsFs/-@0,67587=p
AsLsXsFs/-@0} | dirtyscattered=1 lock=2 dirfrag=1 caps=1 dirtyrstat=1 
dirty=1 authpin=1 0x7f178c539bf8]
    -4> 2016-02-05 15:26:29.015301 7f177de2f700  5 -- op tracker -- 
seq: 2, time: 2016-02-05 15:26:29.015301, event: acquired locks, op: 
client_request(client.67587:2334 create #10000000d3a/.input.log.us42rN 
2016-02-05 15:12:42.328932 RETRY=7 REPLAY)
    -3> 2016-02-05 15:26:29.015321 7f177de2f700 20 mds.0.sessionmap 
mark_projected s=0x7f178c4f3b80 name=client.67587 pv=13127 -> 13128
    -2> 2016-02-05 15:26:29.015326 7f177de2f700 10 mds.0.server 
prepare_new_inode used_prealloc 1000000380c 
([1000000380d~a7,10000004a51~1f5], 668 left)
    -1> 2016-02-05 15:26:29.015330 7f177de2f700 10 mds.0.server dir 
mode 040775 new mode 0100600
     0> 2016-02-05 15:26:29.018013 7f177de2f700 -1 mds/MDCache.cc: In 
function 'void MDCache::add_inode(CInode*)' thread 7f177de2f700 time 
2016-02-05 15:26:29.015333
mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0x7f1787ed6105]
 2: (MDCache::add_inode(CInode*)+0x246) [0x7f1787bd14b6]
 3: (Server::prepare_new_inode(std::shared_ptr<MDRequestImpl>&, CDir*, 
inodeno_t, unsigned int, ceph_file_layout*)+0xf30) [0x7f1787b71ce0]
 4: 
(Server::handle_client_openc(std::shared_ptr<MDRequestImpl>&)+0xd83) 
[0x7f1787b75963]
 5: 
(Server::dispatch_client_request(std::shared_ptr<MDRequestImpl>&)+0xa48) 
[0x7f1787b94678]
 6: (MDCache::dispatch_request(std::shared_ptr<MDRequestImpl>&)+0x4c) 
[0x7f1787c0d2dc]
 7: (MDSInternalContextBase::complete(int)+0x1e3) [0x7f1787d4b993]
 8: (MDSRank::_advance_queues()+0x382) [0x7f1787b2b652]
 9: (MDSRank::ProgressThread::entry()+0x4a) [0x7f1787b2baca]
 10: (()+0x7dc5) [0x7f1786d22dc5]
 11: (clone()+0x6d) [0x7f1785c0d21d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

I don't know if it is related, or the 'new fs' has broken it..?
I added the full log  with debug level 20 in an issue ticket: 
http://tracker.ceph.com/issues/14672

On 05/02/16 11:43, John Spray wrote:
On Fri, Feb 5, 2016 at 9:36 AM, Kenneth Waegeman
<kenneth.waegeman@xxxxxxxx> wrote:

On 04/02/16 16:17, Gregory Farnum wrote:
On Thu, Feb 4, 2016 at 1:42 AM, Kenneth Waegeman
<kenneth.waegeman@xxxxxxxx> wrote:
Hi,

Hi, we are running ceph 9.2.0.
Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked
the
logs, I saw this mds crashed with a stacktrace. I checked the other mdss,
and I saw the same there.
When I try to start the mds again, I get again a stacktrace and it won't
come up:

       -12> 2016-02-04 10:23:46.837131 7ff9ea570700  1 --
10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1 ====
osd_op_reply(207 100005ef982.00000000 [stat] v0'0 uv22184 ondisk = 0) v6
==== 187+0+16 (113
2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
      -11> 2016-02-04 10:23:46.837317 7ff9ed6a1700  1 --
10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6 ====
osd_op_reply(209 1000048aaac.00000000 [delete] v0'0 uv23797 ondisk = -2
((2)
No such file o
r directory)) v6 ==== 187+0+0 (64699207 0 0) 0x7ffa171acb00 con
0x7ffa014fd9c0
      -10> 2016-02-04 10:23:46.837406 7ff9ec994700  1 --
10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5 ====
osd_op_reply(175 100005f631f.00000000 [stat] v0'0 uv22466 ondisk = 0) v6
==== 187+0+16 (1037
61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
       -9> 2016-02-04 10:23:46.837463 7ff9eba85700  1 --
10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2 ====
osd_op_reply(211 1000048aac8.00000000 [delete] v0'0 uv22990 ondisk = -2
((2)
No such file or
    directory)) v6 ==== 187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con
0x7ffa189cadc0
       -8> 2016-02-04 10:23:46.837468 7ff9eb27d700  1 --
10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2 ====
osd_op_reply(212 1000048aacd.00000000 [delete] v0'0 uv23991 ondisk = -2
((2)
No such file or
    directory)) v6 ==== 187+0+0 (1675093742 0 0) 0x7ffa171ac840 con
0x7ffa189cb760
       -7> 2016-02-04 10:23:46.837477 7ff9eab76700  1 --
10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2 ====
osd_op_reply(210 1000048aab9.00000000 [delete] v0'0 uv24583 ondisk = -2
((2)
No such file or
    directory)) v6 ==== 187+0+0 (603192739 0 0) 0x7ffa19054680 con
0x7ffa189cbce0
       -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43 ====
osd_op_reply(121 200.00009d96 [write 1459360~980] v943'4092 uv4092 ondisk
=
0) v6 ==== 179+0+0 (3939130488 0 0) 0x7ffa01590100 con 0x7ffa014fab00
       -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44 ====
osd_op_reply(124 200.00009d96 [write 1460340~956] v943'4093 uv4093 ondisk
=
0) v6 ==== 179+0+0 (1434265886 0 0) 0x7ffa01590100 con 0x7ffa014fab00
       -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45 ====
osd_op_reply(126 200.00009d96 [write 1461296~954] v943'4094 uv4094 ondisk
=
0) v6 ==== 179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00
       -3> 2016-02-04 10:23:46.838700 7ff9ecd98700  1 --
10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3 ====
osd_op_reply(199 100005ef976.00000000 [stat] v0'0 uv22557 ondisk = 0) v6
==== 187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160
       -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700  1 --
10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3 ====
osd_op_reply(115 10000625476.00000000 [stat] v0'0 uv22587 ondisk = 0) v6
==== 187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20
       -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46 ====
osd_op_reply(128 200.00009d96 [write 1462250~954] v943'4095 uv4095 ondisk
=
0) v6 ==== 179+0+0 (1379768629 0 0) 0x7ffa01590100 con 0x7ffa014fab00
        0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal
(Floating point exception) **
    in thread 7ff9f30d8700

    ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
    1: (()+0x4b6fa2) [0x7ff9fd091fa2]
    2: (()+0xf100) [0x7ff9fbfd3100]
    3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2)
[0x7ff9fcf0adc2]
    4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
    5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
    6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
    7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
    8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
    9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
    10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
    11: (()+0x7dc5) [0x7ff9fbfcbdc5]
    12: (clone()+0x6d) [0x7ff9faeb621d]

Does someone has an idea? We can't use our fs right now..
Hey, fun! Just looking for FPE opportunities in that function, it
looks like someone managed to set either the object size or stripe
count to 0 on some of your files. Is that possible?
I am the only user on the system :) What was happening, was our production
storage on gpfs being rsynced to the cephfs. so if something happened like
that, rsync should have triggered it, but that would be very strange in my
opinion. Can such things be prevented somehow ?
As for repair chances...hmm. I think you'll need to:
1) identify what the broken inodes are. Hopefully you can get this by
setting "debug mds = 20" in your conf, starting up the MDS and seeing
what the last referenced inode is in your crashing thread...
2) Manually remove the entry for that inode in your stray directory.
The stray directory should be inode 0x601, I think, object
601.00000000 in the metadata pool. Use the rados omap commands to
remove the entries for the bad inode.
3) Delete all the objects in the data pool for that file.
4) You may also need to remove references to that inode from your
journal using the cephfs-journal-tool, if it doesn't start up cleanly
following the prior steps. I think the help text will give you what
you need; if not John can.
Our sync was only running for a day, so I can restart from scratch if this
wouldn't work.. But my main concern is how to avoid this in the future.
Because I suppose if I restart the backup, this can happen again?
OK, in a way that's good news because it means this is hopefully
something that you can readily reproduce.

Before going any further, in case we need to inspect it later, please
could you grab a dump of your journal with "cephfs-journal-tool
journal export /tmp/somewhere.bin" -- if this proves to be hard to
reproduce we might need to go to the trouble of trying to dissect
evidence from this particular occurence.

The surprising thing is that a file is ending up with a bogus layout
to begin with: there are checks that are applied during layout-setting
operations that should ensure this situation can't arise.  Especially
if you're just rsyncing from another filesystem, nothing should even
be trying to touch the ceph-specific layout settings.

I guess if this is straightforward to reproduce, could you crank
"debug mds" up to 20, run through the whole business again, and then
try to snip out all the places in the log where the relevant inode was
modified/handled for us?  If your workload includes a very large
number of files this might produce an unmanageably large log, but it's
worth a go.

Thanks,
John

Thanks very much!
Kenneth

Does that sound right to you, John?

I've created tickets to prevent those nonsensical values from getting
set, and to prevent the StrayManager from crashing when it encounters
them. http://tracker.ceph.com/issues/14641,
http://tracker.ceph.com/issues/14642
-Greg

I included the full log of an mds start in attachment

Thanks!!

K

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com