CephFS meltdown fallout: mds assert failure, kernel oopses

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I just had a minor CephFS meltdown caused by underprovisioned RAM on the MDS servers. This is a CephFS with two ranks; I manually failed over the first rank and the new MDS server ran out of RAM in the rejoin phase (ceph-mds didn't get OOM-killed, but I think things slowed down enough due to swapping out that something timed out). This happened 4 times, with the rank bouncing between two MDS servers, until I brought up an MDS on a bigger machine.

The new MDS managed to become active, but then crashed with an assert:

2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.mon02 Updating MDS map to version 1239 from mon.1 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map i am now mds.0.1164 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map state change up:clientreplay --> up:active
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
2019-08-13 16:03:45.130 7fd45e2a7700 1 mds.mon02 Updating MDS map to version 1240 from mon.1 2019-08-13 16:03:46.162 7fd45e2a7700 1 mds.mon02 Updating MDS map to version 1241 from mon.1 2019-08-13 16:03:50.286 7fd4578b2700 -1 /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13 16:03:50.279463 /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED assert(o->get_num_ref() == 0)

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fd46650eb5e]
 2: (()+0x2c4cb7) [0x7fd46650ecb7]
 3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, LogSegment*)+0x1f2) [0x55f423dc7192]
 5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
 6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
 7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
 8: (()+0x76db) [0x7fd465dc26db]
 9: (clone()+0x3f) [0x7fd464fa888f]

Thankfully this didn't happen on a subsequent attempt, and I got the filesystem happy again.

At this point, of the 4 kernel clients actively using the filesystem, 3 had gone into a strange state (can't SSH in, partial service). Here is a kernel log from one of the hosts (the other two were similar):
https://mrcn.st/p/ezrhr1qR

After playing some service failover games and hard rebooting the three affected client boxes everything seems to be fine. The remaining FS client box had no kernel errors (other than blocked task warnings and cephfs talking about reconnections and such) and seems to be fine.

I can't find these errors anywhere, so I'm guessing they're not known bugs?

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux