On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote: > On Tue, Aug 13, 2019 at 1:06 PM Hector Martin <hector@xxxxxxxxxxxxxx> wrote: > > I just had a minor CephFS meltdown caused by underprovisioned RAM on the > > MDS servers. This is a CephFS with two ranks; I manually failed over the > > first rank and the new MDS server ran out of RAM in the rejoin phase > > (ceph-mds didn't get OOM-killed, but I think things slowed down enough > > due to swapping out that something timed out). This happened 4 times, > > with the rank bouncing between two MDS servers, until I brought up an > > MDS on a bigger machine. > > > > The new MDS managed to become active, but then crashed with an assert: > > > > 2019-08-13 16:03:37.346 7fd4578b2700 1 mds.0.1164 clientreplay_done > > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.mon02 Updating MDS map to > > version 1239 from mon.1 > > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map i am > > now mds.0.1164 > > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 handle_mds_map state > > change up:clientreplay --> up:active > > 2019-08-13 16:03:37.502 7fd45e2a7700 1 mds.0.1164 active_start > > 2019-08-13 16:03:37.690 7fd45e2a7700 1 mds.0.1164 cluster recovered. > > 2019-08-13 16:03:45.130 7fd45e2a7700 1 mds.mon02 Updating MDS map to > > version 1240 from mon.1 > > 2019-08-13 16:03:46.162 7fd45e2a7700 1 mds.mon02 Updating MDS map to > > version 1241 from mon.1 > > 2019-08-13 16:03:50.286 7fd4578b2700 -1 > > /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void > > MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13 > > 16:03:50.279463 > > /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED > > assert(o->get_num_ref() == 0) > > > > ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic > > (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x14e) [0x7fd46650eb5e] > > 2: (()+0x2c4cb7) [0x7fd46650ecb7] > > 3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d] > > 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, > > LogSegment*)+0x1f2) [0x55f423dc7192] > > 5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd] > > 6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430] > > 7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5] > > 8: (()+0x76db) [0x7fd465dc26db] > > 9: (clone()+0x3f) [0x7fd464fa888f] > > > > Thankfully this didn't happen on a subsequent attempt, and I got the > > filesystem happy again. > > > > At this point, of the 4 kernel clients actively using the filesystem, 3 > > had gone into a strange state (can't SSH in, partial service). Here is a > > kernel log from one of the hosts (the other two were similar): > > https://mrcn.st/p/ezrhr1qR > > > > After playing some service failover games and hard rebooting the three > > affected client boxes everything seems to be fine. The remaining FS > > client box had no kernel errors (other than blocked task warnings and > > cephfs talking about reconnections and such) and seems to be fine. > > > > I can't find these errors anywhere, so I'm guessing they're not known bugs? > > Jeff, the oops seems to be a NULL dereference in ceph_lock_message(). > Please take a look. > (sorry for duplicate mail -- the other one ended up in moderation) Thanks Ilya, That function is pretty straightforward. We don't do a whole lot of pointer chasing in there, so I'm a little unclear on where this would have crashed. Right offhand, that kernel is probably missing 1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but that seems unlikely to result in an oops. Hector, if you have the debuginfo for this kernel installed on one of these machines, could you run gdb against the ceph.ko module and then do: gdb> list *(ceph_lock_message+0x212) That may give me a better hint as to what went wrong. Thanks, -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com