I just had a minor CephFS meltdown caused by underprovisioned RAM on the
MDS servers. This is a CephFS with two ranks; I manually failed over the
first rank and the new MDS server ran out of RAM in the rejoin phase
(ceph-mds didn't get OOM-killed, but I think things slowed down enough
due to swapping out that something timed out). This happened 4 times,
with the rank bouncing between two MDS servers, until I brought up an
MDS on a bigger machine.
The new MDS managed to become active, but then crashed with an assert:
2019-08-13 16:03:37.346 7fd4578b2700 1 mds.0.1164 clientreplay_done
2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.mon02 Updating MDS map to
version 1239 from mon.1
2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map i am
now mds.0.1164
2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map state
change up:clientreplay --> up:active
2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 active_start
2019-08-13 16:03:37.690 7fd45e2a7700 1 mds.0.1164 cluster recovered.
2019-08-13 16:03:45.130 7fd45e2a7700 1 mds.mon02 Updating MDS map to
version 1240 from mon.1
2019-08-13 16:03:46.162 7fd45e2a7700 1 mds.mon02 Updating MDS map to
version 1241 from mon.1
2019-08-13 16:03:50.286 7fd4578b2700 -1
/build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
16:03:50.279463
/build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
assert(o->get_num_ref() == 0)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14e) [0x7fd46650eb5e]
2: (()+0x2c4cb7) [0x7fd46650ecb7]
3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
LogSegment*)+0x1f2) [0x55f423dc7192]
5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
8: (()+0x76db) [0x7fd465dc26db]
9: (clone()+0x3f) [0x7fd464fa888f]
Thankfully this didn't happen on a subsequent attempt, and I got the
filesystem happy again.
At this point, of the 4 kernel clients actively using the filesystem, 3
had gone into a strange state (can't SSH in, partial service). Here is a
kernel log from one of the hosts (the other two were similar):
https://mrcn.st/p/ezrhr1qR
After playing some service failover games and hard rebooting the three
affected client boxes everything seems to be fine. The remaining FS
client box had no kernel errors (other than blocked task warnings and
cephfs talking about reconnections and such) and seems to be fine.
I can't find these errors anywhere, so I'm guessing they're not known bugs?
--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com