osd bad crc cause whole cluster halt

lin yunfan <lin.yunfan@xxxxxxxxx> · Wed, 23 Jan 2019 16:33:09 +0800

Hi list,
I have encounter this problem both on jewel cluster and luminous cluster.
The symptom is some request will be blocked forever and the whole
cluster won't able to receive any data anymore. Further investigating
shows the blocked request happened on 2 osds(the pool size is 2 so I
guess it will effect all osd in an pg acting set).The osd log have
many repeated error message like below

error message in jewel cluster
first effected osd

2019-01-21 10:08:18.836864 941f9b10  0 bad crc in data 1155321283 !=
exp 1909584237
2019-01-21 10:08:18.837013 941f9b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x7c15c000 sd=149 :53242 s=2 pgs=5 cs=1 l=0
c=0x7a697e60).fault, initiating reconnect
2019-01-21 10:08:18.839328 7cbb4b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=0 pgs=0 cs=0 l=0
c=0x8b2f5440).accept connect_seq 2 vs existing 2 state
 connecting
2019-01-21 10:08:18.850772 7cbb4b10  0 bad crc in data 1155321283 !=
exp 1909584237
2019-01-21 10:08:18.850910 7cbb4b10  0 -- 10.0.2.39:6800/23795 >>
10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=2 pgs=58 cs=3 l=0
c=0x7a697e60).fault with nothing to send, going to st
andby

second effected osd
2019-01-21 10:06:12.282115 9513cb10  0 bad crc in data 1035875608 !=
exp 3787091679
2019-01-21 10:06:12.290395 abdcdb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289084
rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size
4194304 write_size
4194304,write 65536~524288] v5503'1224666 uv1224666 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x70df6800
2019-01-21 10:06:12.297356 9eb3cb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289067
rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size
4194304 write_size
4194304,write 3211264~524288] v5503'1236405 uv1236405 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x716a0e00
2019-01-21 10:06:12.303597 abdcdb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289085
rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size
4194304 write_size
4194304,write 589824~524288] v5503'1224667 uv1224667 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x6e537000
2019-01-21 10:06:12.310642 9c33cb10  0 -- 10.0.1.40:6804/28471
submit_message osd_op_reply(1031289069
rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size
4194304 write_size
4194304,write 3735552~458752] v5503'1236406 uv1236406 ondisk = 0) v7
remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message
0x71655000
2019-01-21 10:08:18.837438 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :6804 s=2 pgs=3916 cs=1
l=0 c=0x9202a7e0).fault, initiating reconnect
2019-01-21 10:08:18.839301 9323cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x702a6000 sd=80 :6804 s=0 pgs=0 cs=0 l=0
c=0x8d086480).accept connect_seq 2 vs existing 2 state
connecting
2019-01-21 10:08:18.851839 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=2 pgs=3930 cs=3
l=0 c=0x9202a7e0).fault, initiating reconnect
2019-01-21 10:08:18.860245 7eaf5b10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=1 pgs=3930 cs=4
l=0 c=0x9202a7e0).fault
2019-01-21 10:08:18.877537 94b3cb10  0 -- 10.0.2.40:6804/28471 >>
10.0.2.39:6800/23795 pipe(0x888acd00 sd=80 :42638 s=2 pgs=3931 cs=5
l=0 c=0x9202a7e0).fault, initiating reconnect

error message in luminous cluster
first effected osd
2018-12-11 23:14:43.034926 b560c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x7648e00 :-1
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=8870304 cs=8856031
l=0).handle_connect_msg: challenging authorizer
2018-12-11 23:14:43.042915 b560c8e0 0 bad crc in front 1566330326 !=
exp 3283985696
2018-12-11 23:14:43.044587 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x919e700 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: challenging authorizer
2018-12-11 23:14:43.045153 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >>
10.0.2.37:6800/13016 conn(0x919e700 :6802
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 8856034 vs existing
csq=8856033 existing_state=STATE_STANDBY

second effected osd
2018-12-11 23:15:23.693508 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450977
cs=8863341 l=0).fault initiating reconnect
2018-12-11 23:15:23.704284 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450978
cs=8863343 l=0).fault initiating reconnect
2018-12-11 23:15:23.714925 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450979
cs=8863345 l=0).fault initiating reconnect
2018-12-11 23:15:23.725507 b56158e0 0 -- 10.0.2.37:6800/13016 >>
10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450980
cs=8863347 l=0).fault initiating reconnect

Is this a bug? what could cause this? I think it is not right to let a
few faulty osd to make the whole cluster not working.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com