MDS crash

Randy Orr <randy.orr@xxxxxxxxxx> · Wed, 10 Aug 2016 11:20:27 -0500

Hello,
We have recently had some failures with our MDS processes. We are running Jewel 10.2.1. The two MDS services are on dedicated hosts running in active/standby on Ubuntu 14.04.3 with kernel 3.19.0-56-generic. I have searched the mailing list and open tickets without much luck so far.

The first indication of a problem is:

mds/Locker.cc: In function 'bool Locker::check_inode_max_size(CInode*, bool, bool, uint64_t, bool, uint64_t, utime_t)' thread 7fc305b83700 time 2016-08-09 18:51:50.626630
mds/Locker.cc: 2190: FAILED assert(in->is_file())

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x563d1e0a2d3b]
 2: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x15e3) [0x563d1de506a3]
 3: (Server::handle_client_open(std::shared_ptr<MDRequestImpl>&)+0x1061) [0x563d1dd386a1]
 4: (Server::dispatch_client_request(std::shared_ptr<MDRequestImpl>&)+0xa0b) [0x563d1dd5709b]
 5: (Server::handle_client_request(MClientRequest*)+0x47f) [0x563d1dd5768f]
 6: (Server::dispatch(Message*)+0x3bb) [0x563d1dd5b8db]
 7: (MDSRank::handle_deferrable_message(Message*)+0x80c) [0x563d1dce1f8c]
 8: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x563d1dceb081]
 9: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x563d1dcec1d5]
 10: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x563d1dcd3f83]
 11: (DispatchQueue::entry()+0x78b) [0x563d1e1996cb]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563d1e08862d]
 13: (()+0x8184) [0x7fc30bd7c184]
 14: (clone()+0x6d) [0x7fc30a2d337d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

...

I snipped the dump of recent events, but can certainly include them if it would help in debugging.

...

Upstart then attempts to restart the process, the logs from this are here: https://gist.github.com/anonymous/256bd6e886421840d151890e0205766d

It looks to me like it goes through the replay -> reconnect -> rejoin -> active process successfully and then immediately crashes with the same error after becoming active. Upstart continues to try restarting until it hits the max number of attempts. At that point the standby takes over and goes through the same loop. Restarting manually gave the same issue on both hosts. This process continued for several cycles before I rebooted the physical host for the MDS process. At that point it started successfully without issue. After rebooting the standby host it too was able to start successfully. 

Looking at metrics for the MDS host and the ceph cluster in general there is nothing out of place or abnormal. CPU, memory, network, disk were all within normal bounds. Other than the MDS processes failing the cluster was healthy, no slow requests or failed OSDs. 

Any thoughts on what might be causing this issue? Is there any further information I can provide to help debug this?

Thanks in advance. 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com