Bad crc causing osd hang and block all request.

"shadow_lin"<shadow_lin@xxxxxxx> · Tue, 9 Jan 2018 00:41:45 +0800

Hi lists,

ceph version:luminous 12.2.2

The cluster was doing writing thoughput test when this problem 
happened.
The cluster health became error 
Health check update: 27 stuck requests are blocked > 4096 sec 
(REQUEST_STUCK)
Clients can't write any data into cluster.
osd22 and osd40 are the osds who is resposible for the problem.
osd22's log shows below mesage and keep repeating

2018-01-07 
20:44:52.202322 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x96aa9400 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969602 vs existing csq=969601 
existing_state=STATE_STANDBY

2018-01-07 
20:44:52.250600 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 
20:44:52.252470 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x95c04000 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969604 vs existing csq=969603 
existing_state=STATE_STANDBY

2018-01-07 
20:44:52.300354 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 
20:44:52.302788 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x978e7a00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969606 vs existing csq=969605 
existing_state=STATE_STANDBY

2018-01-07 
20:44:52.350987 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 
20:44:52.352953 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x97420e00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969608 vs existing csq=969607 
existing_state=STATE_STANDBY

2018-01-07 
20:44:52.400959 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

osd40's 
log shows below message and keep repeating
2018-01-07 
20:44:52.200709 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601 
l=0).fault initiating reconnect

2018-01-07 
20:44:52.251423 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603 
l=0).fault initiating reconnect

2018-01-07 
20:44:52.301166 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605 
l=0).fault initiating reconnect

2018-01-07 
20:44:52.351810 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607 
l=0).fault initiating reconnect

2018-01-07 
20:44:52.401782 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609 
l=0).fault initiating reconnect

The NIC 
of osd22' s host was keeping sending data to osd40's at about 50MBps 
when this happened.

After 
reboot osd22 the cluster goes back to normal..
This 
happened twice in my writing test with the same osds(osd22 and 
osd40).

What 
could cause this problem?Is this caused by a faulty HDD?
what 
data's crc didn't match ? 

2018-01-09

lin.yunfan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com