Bad crc causing osd hang and block all request.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi lists,
 
ceph version:luminous 12.2.2
 
The cluster was doing writing thoughput test when this problem happened.
The cluster health became error
Health check update: 27 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
Clients can't write any data into cluster.
osd22 and osd40 are the osds who is resposible for the problem.
osd22's log shows below mesage and keep repeating
2018-01-07 20:44:52.202322 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x96aa9400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969602 vs existing csq=969601 existing_state=STATE_STANDBY
2018-01-07 20:44:52.250600 b56db8e0  0 bad crc in data 3751247614 != exp 3467727689
2018-01-07 20:44:52.252470 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x95c04000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969604 vs existing csq=969603 existing_state=STATE_STANDBY
2018-01-07 20:44:52.300354 b5edb8e0  0 bad crc in data 3751247614 != exp 3467727689
2018-01-07 20:44:52.302788 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x978e7a00 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969606 vs existing csq=969605 existing_state=STATE_STANDBY
2018-01-07 20:44:52.350987 b56db8e0  0 bad crc in data 3751247614 != exp 3467727689
2018-01-07 20:44:52.352953 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x97420e00 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969608 vs existing csq=969607 existing_state=STATE_STANDBY
2018-01-07 20:44:52.400959 b5edb8e0  0 bad crc in data 3751247614 != exp 3467727689
 
osd40's log shows below message and keep repeating
2018-01-07 20:44:52.200709 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601 l=0).fault initiating reconnect
2018-01-07 20:44:52.251423 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603 l=0).fault initiating reconnect
2018-01-07 20:44:52.301166 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605 l=0).fault initiating reconnect
2018-01-07 20:44:52.351810 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607 l=0).fault initiating reconnect
2018-01-07 20:44:52.401782 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609 l=0).fault initiating reconnect
 
The NIC of osd22' s host was keeping sending data to osd40's at about 50MBps when this happened.
 
After reboot osd22 the cluster goes back to normal..
This happened twice in my writing test with the same osds(osd22 and osd40).
 
What could cause this problem?Is this caused by a faulty HDD?
what data's crc didn't match ? 
 
 
2018-01-09

lin.yunfan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux