CephFS meltdown fallout: mds assert failure, kernel oopses

Hector Martin <hector@xxxxxxxxxxxxxx> · Tue, 13 Aug 2019 17:38:53 +0900

I just had a minor CephFS meltdown caused by underprovisioned RAM on the 
MDS servers. This is a CephFS with two ranks; I manually failed over the 
first rank and the new MDS server ran out of RAM in the rejoin phase 
(ceph-mds didn't get OOM-killed, but I think things slowed down enough 
due to swapping out that something timed out). This happened 4 times, 
with the rank bouncing between two MDS servers, until I brought up an 
MDS on a bigger machine.

The new MDS managed to become active, but then crashed with an assert:

2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1239 from mon.1
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am 
now mds.0.1164
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state 
change up:clientreplay --> up:active
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1240 from mon.1
2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1241 from mon.1
2019-08-13 16:03:50.286 7fd4578b2700 -1 
/build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void 
MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13 
16:03:50.279463
/build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED 
assert(o->get_num_ref() == 0)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7fd46650eb5e]
 2: (()+0x2c4cb7) [0x7fd46650ecb7]
 3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, 
LogSegment*)+0x1f2) [0x55f423dc7192]
 5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
 6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
 7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
 8: (()+0x76db) [0x7fd465dc26db]
 9: (clone()+0x3f) [0x7fd464fa888f]

Thankfully this didn't happen on a subsequent attempt, and I got the 
filesystem happy again.

At this point, of the 4 kernel clients actively using the filesystem, 3 
had gone into a strange state (can't SSH in, partial service). Here is a 
kernel log from one of the hosts (the other two were similar):
https://mrcn.st/p/ezrhr1qR

After playing some service failover games and hard rebooting the three 
affected client boxes everything seems to be fine. The remaining FS 
client box had no kernel errors (other than blocked task warnings and 
cephfs talking about reconnections and such) and seems to be fine.

I can't find these errors anywhere, so I'm guessing they're not known bugs?

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com