On Tue, Aug 13, 2019 at 1:06 PM Hector Martin <hector@xxxxxxxxxxxxxx> wrote: > > I just had a minor CephFS meltdown caused by underprovisioned RAM on the > MDS servers. This is a CephFS with two ranks; I manually failed over the > first rank and the new MDS server ran out of RAM in the rejoin phase > (ceph-mds didn't get OOM-killed, but I think things slowed down enough > due to swapping out that something timed out). This happened 4 times, > with the rank bouncing between two MDS servers, until I brought up an > MDS on a bigger machine. > > The new MDS managed to become active, but then crashed with an assert: > > 2019-08-13 16:03:37.346 7fd4578b2700 1 mds.0.1164 clientreplay_done > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.mon02 Updating MDS map to > version 1239 from mon.1 > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map i am > now mds.0.1164 > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map state > change up:clientreplay --> up:active > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 active_start > 2019-08-13 16:03:37.690 7fd45e2a7700 1 mds.0.1164 cluster recovered. > 2019-08-13 16:03:45.130 7fd45e2a7700 1 mds.mon02 Updating MDS map to > version 1240 from mon.1 > 2019-08-13 16:03:46.162 7fd45e2a7700 1 mds.mon02 Updating MDS map to > version 1241 from mon.1 > 2019-08-13 16:03:50.286 7fd4578b2700 -1 > /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void > MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13 > 16:03:50.279463 > /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED > assert(o->get_num_ref() == 0) > > ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x14e) [0x7fd46650eb5e] > 2: (()+0x2c4cb7) [0x7fd46650ecb7] > 3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d] > 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, > LogSegment*)+0x1f2) [0x55f423dc7192] > 5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd] > 6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430] > 7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5] > 8: (()+0x76db) [0x7fd465dc26db] > 9: (clone()+0x3f) [0x7fd464fa888f] > > Thankfully this didn't happen on a subsequent attempt, and I got the > filesystem happy again. > > At this point, of the 4 kernel clients actively using the filesystem, 3 > had gone into a strange state (can't SSH in, partial service). Here is a > kernel log from one of the hosts (the other two were similar): > https://mrcn.st/p/ezrhr1qR > > After playing some service failover games and hard rebooting the three > affected client boxes everything seems to be fine. The remaining FS > client box had no kernel errors (other than blocked task warnings and > cephfs talking about reconnections and such) and seems to be fine. > > I can't find these errors anywhere, so I'm guessing they're not known bugs? Jeff, the oops seems to be a NULL dereference in ceph_lock_message(). Please take a look. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com