MDS crashes on client connection

"Kadiyska, Yana" <ykadiysk@xxxxxxxxxx> · Wed, 6 Mar 2019 18:06:51 +0000

Hi all,

I am a new user on this list. I have a legacy production system running ceph version 0.94.7

Ceph itself appears to be functioning well, ceph -s is reporting good health.

I am connecting to the filesystem via an hdfs client. Upon connection I see the client receiving messages like (I’ve snipped since this goes on for a while)

hadoop fs -ls /

2019-03-06 17:30:02.444021 7fc381153700 
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=2 pgs=25 cs=1 l=0 c=0x7fc3a8e78ac0).fault, initiating reconnect
2019-03-06 17:30:02.444224 7fc38c1b3700 
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=1 pgs=25 cs=2 l=0 c=0x7fc3a8e78ac0).fault
2019-03-06 17:34:54.283031 7fc38c1b3700 
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8125f30 sd=191 :51405 s=1 pgs=18 cs=2 l=0 c=0x7fc3a812a1d0).connect got RESETSESSION
2019-03-06 17:34:54.283053 7fc383157700 
0 client.2155885101 ms_handle_remote_reset on YY.YY.YY.YY:6800/2651
2019-03-06 17:34:54.412070 7fc381153700 
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=192 :51406 s=2 pgs=19 cs=1 l=0 c=0x7fc3a8e747c0).fault, initiating reconnect
2019-03-06 17:34:54.412363 7fc381052700 
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=191 :51406 s=1 pgs=19 cs=2 l=0 c=0x7fc3a8e747c0).fault
ls: Connection timed out

Which makese sense because the MDS crashes like this (it goes from active to reconnect state which I guess explains the change in PIDs that the client is seeing):

 *** Caught signal (Segmentation fault) **
 in thread 7f04b6b12700

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph_mds() [0x89982a]
 2: (()+0x10350) [0x7f04baecc350]
 3: (CInode::get_caps_allowed_for_client(client_t) const+0x130) [0x7a19f0]
 4: (CInode::encode_inodestat(ceph::buffer::list&, Session*, SnapRealm*, snapid_t, unsigned int, int)+0x132d) [0x7b383d]
 5: (Server::set_trace_dist(Session*, MClientReply*, CInode*, CDentry*, snapid_t, int, std::tr1::shared_ptr<MDRequestImpl>&)+0x471) [0x5f26e1]
 6: (Server::reply_client_request(std::tr1::shared_ptr<MDRequestImpl>&, MClientReply*)+0x846) [0x611056]
 7: (Server::respond_to_request(std::tr1::shared_ptr<MDRequestImpl>&, int)+0x4d9) [0x611759]
 8: (Server::handle_client_getattr(std::tr1::shared_ptr<MDRequestImpl>&, bool)+0x47b) [0x613eab]
 9: (Server::dispatch_client_request(std::tr1::shared_ptr<MDRequestImpl>&)+0xa38) [0x633da8]
 10: (Server::handle_client_request(MClientRequest*)+0x3df) [0x63435f]
 11: (Server::dispatch(Message*)+0x3f3) [0x63b8b3]
 12: (MDS::handle_deferrable_message(Message*)+0x847) [0x5b6c27]
 13: (MDS::_dispatch(Message*)+0x6d) [0x5d2bed]
 14: (MDS::ms_dispatch(Message*)+0xa2) [0x5d3f72]
 15: (DispatchQueue::entry()+0x63a) [0xa7482a]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x97403d]
 17: (()+0x8192) [0x7f04baec4192]
 18: (clone()+0x6d) [0x7f04ba3d126d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

So far I have gone through the whole path here:
http://docs.ceph.com/docs/hammer/cephfs/disaster-recovery/

I’ve reset the journal, session and fs – everything looks good (journal export core-dumps but all other status checks report healthy).

I’m hoping for a suggestion on what else could be causing this/what I can try resetting. The next step for me would be to remove the filesystem so I’m willing to try any suggestion.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com