Re: Ceph MDS ASSERT In function 'MDRequestRef'

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 13 Feb 2020 09:31:54 -0800

Hello Stefan,

On Thu, Feb 13, 2020 at 9:19 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
> Hi,
>
> We hit the following assert:
>
> -10001> 2020-02-13 17:42:35.543 7f11b5669700 -1 /build/ceph-13.2.8/src/mds/MDCache.cc: In function 'MDRequestRef MDCa
> che::request_get(metareqid_t)' thread 7f11b5669700 time 2020-02-13 17:42:35.545815
> /build/ceph-13.2.8/src/mds/MDCache.cc: 9523: FAILED assert(p != active_requests.end())
>
>  ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f11bd8e69de]
>  2: (()+0x287b67) [0x7f11bd8e6b67]
>  3: (MDCache::request_get(metareqid_t)+0x94) [0x560cde8bb214]
>  4: (Server::journal_close_session(Session*, int, Context*)+0x9dd) [0x560cde829d1d]
>  5: (Server::handle_client_session(MClientSession*)+0x1071) [0x560cde82b0f1]
>  6: (Server::dispatch(Message*)+0x30b) [0x560cde86f87b]
>  7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x560cde7e1664]
>  8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x560cde7f8c7b]
>  9: (MDSRankDispatcher::ms_dispatch(Message*)+0xa3) [0x560cde7f92e3]
>  10: (MDSDaemon::ms_dispatch(Message*)+0xd3) [0x560cde7d92b3]
>  11: (DispatchQueue::entry()+0xb92) [0x7f11bd9a9e52]
>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11bda46e2d]
>  13: (()+0x76db) [0x7f11bd1d76db]
>  14: (clone()+0x3f) [0x7f11bc3bd88f]
>
> Before we hit this assert there were a few (kernel clients, 5.3.0-26/28)
> that were not playing nicely:
>
> 16:32 < bitrot> mds.mds1 [WRN] client.61994841 isn't responding to mclientcaps(revoke), ino 0x1003846ddc5 pending
>                 pAsLsXsFscr issued pAsLsXsFscr, sent 62.342791 seconds ago
> 16:32 < bitrot> mon.mon1 [WRN] Health check failed: 1 clients failing to respond to capability release
>                 (MDS_CLIENT_LATE_RELEASE)
>
> We rebooted both clients. After that one of them again had some slow
> requests. We umounted the file system, slowly after that the MDS hit the
> assert. Failover went fine this time.
>
> This looks like issue: https://tracker.ceph.com/issues/23059 ... but
> that should already have been resolved. Is this the same issue, and or a
> regression?
>
> We run 13.2.8.

Thanks for the information. It looks like this bug:
https://tracker.ceph.com/issues/42467#note-7

Do you have logs you can share? You can use ceph-post-file [1] to share.

[1] https://docs.ceph.com/docs/master/man/8/ceph-post-file/

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx