Hi all, I am a new user on this list. I have a legacy production system running ceph version 0.94.7
Ceph itself appears to be functioning well, ceph -s is reporting good health. I am connecting to the filesystem via an hdfs client. Upon connection I see the client receiving messages like (I’ve snipped since this goes on for a while) hadoop fs -ls / 2019-03-06 17:30:02.444021 7fc381153700
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=2 pgs=25 cs=1 l=0 c=0x7fc3a8e78ac0).fault, initiating reconnect 2019-03-06 17:30:02.444224 7fc38c1b3700
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=1 pgs=25 cs=2 l=0 c=0x7fc3a8e78ac0).fault 2019-03-06 17:34:54.283031 7fc38c1b3700
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8125f30 sd=191 :51405 s=1 pgs=18 cs=2 l=0 c=0x7fc3a812a1d0).connect got RESETSESSION 2019-03-06 17:34:54.283053 7fc383157700
0 client.2155885101 ms_handle_remote_reset on YY.YY.YY.YY:6800/2651 2019-03-06 17:34:54.412070 7fc381153700
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=192 :51406 s=2 pgs=19 cs=1 l=0 c=0x7fc3a8e747c0).fault, initiating reconnect 2019-03-06 17:34:54.412363 7fc381052700
0 -- XX.XX.XX.XX:0/3968222009 >> YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=191 :51406 s=1 pgs=19 cs=2 l=0 c=0x7fc3a8e747c0).fault ls: Connection timed out Which makese sense because the MDS crashes like this (it goes from active to reconnect state which I guess explains the change in PIDs that the client is seeing): *** Caught signal (Segmentation fault) ** in thread 7f04b6b12700 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89982a] 2: (()+0x10350) [0x7f04baecc350] 3: (CInode::get_caps_allowed_for_client(client_t) const+0x130) [0x7a19f0] 4: (CInode::encode_inodestat(ceph::buffer::list&, Session*, SnapRealm*, snapid_t, unsigned int, int)+0x132d) [0x7b383d] 5: (Server::set_trace_dist(Session*, MClientReply*, CInode*, CDentry*, snapid_t, int, std::tr1::shared_ptr<MDRequestImpl>&)+0x471) [0x5f26e1] 6: (Server::reply_client_request(std::tr1::shared_ptr<MDRequestImpl>&, MClientReply*)+0x846) [0x611056] 7: (Server::respond_to_request(std::tr1::shared_ptr<MDRequestImpl>&, int)+0x4d9) [0x611759] 8: (Server::handle_client_getattr(std::tr1::shared_ptr<MDRequestImpl>&, bool)+0x47b) [0x613eab] 9: (Server::dispatch_client_request(std::tr1::shared_ptr<MDRequestImpl>&)+0xa38) [0x633da8] 10: (Server::handle_client_request(MClientRequest*)+0x3df) [0x63435f] 11: (Server::dispatch(Message*)+0x3f3) [0x63b8b3] 12: (MDS::handle_deferrable_message(Message*)+0x847) [0x5b6c27] 13: (MDS::_dispatch(Message*)+0x6d) [0x5d2bed] 14: (MDS::ms_dispatch(Message*)+0xa2) [0x5d3f72] 15: (DispatchQueue::entry()+0x63a) [0xa7482a] 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x97403d] 17: (()+0x8192) [0x7f04baec4192] 18: (clone()+0x6d) [0x7f04ba3d126d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. So far I have gone through the whole path here: http://docs.ceph.com/docs/hammer/cephfs/disaster-recovery/ I’ve reset the journal, session and fs – everything looks good (journal export core-dumps but all other status checks report healthy). I’m hoping for a suggestion on what else could be causing this/what I can try resetting. The next step for me would be to remove the filesystem so I’m willing to try any suggestion. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com