Re: CephFS meltdown fallout: mds assert failure, kernel oopses

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 14 Aug 2019 19:29:13 +0200

On Tue, Aug 13, 2019 at 1:06 PM Hector Martin <hector@xxxxxxxxxxxxxx> wrote:
>
> I just had a minor CephFS meltdown caused by underprovisioned RAM on the
> MDS servers. This is a CephFS with two ranks; I manually failed over the
> first rank and the new MDS server ran out of RAM in the rejoin phase
> (ceph-mds didn't get OOM-killed, but I think things slowed down enough
> due to swapping out that something timed out). This happened 4 times,
> with the rank bouncing between two MDS servers, until I brought up an
> MDS on a bigger machine.
>
> The new MDS managed to become active, but then crashed with an assert:
>
> 2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1239 from mon.1
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am
> now mds.0.1164
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state
> change up:clientreplay --> up:active
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
> 2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
> 2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1240 from mon.1
> 2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1241 from mon.1
> 2019-08-13 16:03:50.286 7fd4578b2700 -1
> /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
> MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
> 16:03:50.279463
> /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
> assert(o->get_num_ref() == 0)
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14e) [0x7fd46650eb5e]
>   2: (()+0x2c4cb7) [0x7fd46650ecb7]
>   3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
>   4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
> LogSegment*)+0x1f2) [0x55f423dc7192]
>   5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
>   6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
>   7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
>   8: (()+0x76db) [0x7fd465dc26db]
>   9: (clone()+0x3f) [0x7fd464fa888f]
>
> Thankfully this didn't happen on a subsequent attempt, and I got the
> filesystem happy again.
>
> At this point, of the 4 kernel clients actively using the filesystem, 3
> had gone into a strange state (can't SSH in, partial service). Here is a
> kernel log from one of the hosts (the other two were similar):
> https://mrcn.st/p/ezrhr1qR
>
> After playing some service failover games and hard rebooting the three
> affected client boxes everything seems to be fine. The remaining FS
> client box had no kernel errors (other than blocked task warnings and
> cephfs talking about reconnections and such) and seems to be fine.
>
> I can't find these errors anywhere, so I'm guessing they're not known bugs?

Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
Please take a look.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com