Hi lists,
ceph version:luminous 12.2.2
The cluster was doing writing thoughput test when this problem
happened.
The cluster health became error
Health check update: 27 stuck requests are blocked > 4096 sec
(REQUEST_STUCK)
Clients can't write any data into cluster.
osd22 and osd40 are the osds who is resposible for the problem.
osd22's log shows below mesage and keep repeating
2018-01-07
20:44:52.202322 b56db8e0 0 -- 10.0.2.12:6802/2798 >>
10.0.2.21:6802/2785 conn(0x96aa9400 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 969602 vs existing csq=969601
existing_state=STATE_STANDBY
2018-01-07
20:44:52.250600 b56db8e0 0 bad crc in data 3751247614 != exp
3467727689
2018-01-07
20:44:52.252470 b5edb8e0 0 -- 10.0.2.12:6802/2798 >>
10.0.2.21:6802/2785 conn(0x95c04000 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 969604 vs existing csq=969603
existing_state=STATE_STANDBY
2018-01-07
20:44:52.300354 b5edb8e0 0 bad crc in data 3751247614 != exp
3467727689
2018-01-07
20:44:52.302788 b56db8e0 0 -- 10.0.2.12:6802/2798 >>
10.0.2.21:6802/2785 conn(0x978e7a00 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 969606 vs existing csq=969605
existing_state=STATE_STANDBY
2018-01-07
20:44:52.350987 b56db8e0 0 bad crc in data 3751247614 != exp
3467727689
2018-01-07
20:44:52.352953 b5edb8e0 0 -- 10.0.2.12:6802/2798 >>
10.0.2.21:6802/2785 conn(0x97420e00 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 969608 vs existing csq=969607
existing_state=STATE_STANDBY
2018-01-07
20:44:52.400959 b5edb8e0 0 bad crc in data 3751247614 != exp
3467727689
osd40's
log shows below message and keep repeating
2018-01-07
20:44:52.200709 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >>
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601
l=0).fault initiating reconnect
2018-01-07
20:44:52.251423 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >>
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603
l=0).fault initiating reconnect
2018-01-07
20:44:52.301166 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >>
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605
l=0).fault initiating reconnect
2018-01-07
20:44:52.351810 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >>
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607
l=0).fault initiating reconnect
2018-01-07
20:44:52.401782 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >>
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609
l=0).fault initiating reconnect
The NIC
of osd22' s host was keeping sending data to osd40's at about 50MBps
when this happened.
After
reboot osd22 the cluster goes back to normal..
This
happened twice in my writing test with the same osds(osd22 and
osd40).
What
could cause this problem?Is this caused by a faulty HDD?
what
data's crc didn't match ?
2018-01-09
lin.yunfan |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com