Hi list, I have encounter this problem both on jewel cluster and luminous cluster. The symptom is some request will be blocked forever and the whole cluster won't able to receive any data anymore. Further investigating shows the blocked request happened on 2 osds(the pool size is 2 so I guess it will effect all osd in an pg acting set).The osd log have many repeated error message like below error message in jewel cluster first effected osd 2019-01-21 10:08:18.836864 941f9b10 0 bad crc in data 1155321283 != exp 1909584237 2019-01-21 10:08:18.837013 941f9b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x7c15c000 sd=149 :53242 s=2 pgs=5 cs=1 l=0 c=0x7a697e60).fault, initiating reconnect 2019-01-21 10:08:18.839328 7cbb4b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=0 pgs=0 cs=0 l=0 c=0x8b2f5440).accept connect_seq 2 vs existing 2 state connecting 2019-01-21 10:08:18.850772 7cbb4b10 0 bad crc in data 1155321283 != exp 1909584237 2019-01-21 10:08:18.850910 7cbb4b10 0 -- 10.0.2.39:6800/23795 >> 10.0.2.40:6804/28471 pipe(0x6b782b00 sd=235 :6800 s=2 pgs=58 cs=3 l=0 c=0x7a697e60).fault with nothing to send, going to st andby second effected osd 2019-01-21 10:06:12.282115 9513cb10 0 bad crc in data 1035875608 != exp 3787091679 2019-01-21 10:06:12.290395 abdcdb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289084 rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size 4194304 write_size 4194304,write 65536~524288] v5503'1224666 uv1224666 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x70df6800 2019-01-21 10:06:12.297356 9eb3cb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289067 rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size 4194304 write_size 4194304,write 3211264~524288] v5503'1236405 uv1236405 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x716a0e00 2019-01-21 10:06:12.303597 abdcdb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289085 rbd_data.28ae2238e1f29.0000000000a7df16 [set-alloc-hint object_size 4194304 write_size 4194304,write 589824~524288] v5503'1224667 uv1224667 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x6e537000 2019-01-21 10:06:12.310642 9c33cb10 0 -- 10.0.1.40:6804/28471 submit_message osd_op_reply(1031289069 rbd_data.28ae2238e1f29.0000000000a7defd [set-alloc-hint object_size 4194304 write_size 4194304,write 3735552~458752] v5503'1236406 uv1236406 ondisk = 0) v7 remote, 10.0.1.121:0/3226500701, failed lossy con, dropping message 0x71655000 2019-01-21 10:08:18.837438 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :6804 s=2 pgs=3916 cs=1 l=0 c=0x9202a7e0).fault, initiating reconnect 2019-01-21 10:08:18.839301 9323cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x702a6000 sd=80 :6804 s=0 pgs=0 cs=0 l=0 c=0x8d086480).accept connect_seq 2 vs existing 2 state connecting 2019-01-21 10:08:18.851839 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=2 pgs=3930 cs=3 l=0 c=0x9202a7e0).fault, initiating reconnect 2019-01-21 10:08:18.860245 7eaf5b10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=129 :42636 s=1 pgs=3930 cs=4 l=0 c=0x9202a7e0).fault 2019-01-21 10:08:18.877537 94b3cb10 0 -- 10.0.2.40:6804/28471 >> 10.0.2.39:6800/23795 pipe(0x888acd00 sd=80 :42638 s=2 pgs=3931 cs=5 l=0 c=0x9202a7e0).fault, initiating reconnect error message in luminous cluster first effected osd 2018-12-11 23:14:43.034926 b560c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x7648e00 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=8870304 cs=8856031 l=0).handle_connect_msg: challenging authorizer 2018-12-11 23:14:43.042915 b560c8e0 0 bad crc in front 1566330326 != exp 3283985696 2018-12-11 23:14:43.044587 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x919e700 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer 2018-12-11 23:14:43.045153 b4e0c8e0 0 -- 10.0.2.2:6802/15865 >> 10.0.2.37:6800/13016 conn(0x919e700 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 8856034 vs existing csq=8856033 existing_state=STATE_STANDBY second effected osd 2018-12-11 23:15:23.693508 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450977 cs=8863341 l=0).fault initiating reconnect 2018-12-11 23:15:23.704284 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450978 cs=8863343 l=0).fault initiating reconnect 2018-12-11 23:15:23.714925 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450979 cs=8863345 l=0).fault initiating reconnect 2018-12-11 23:15:23.725507 b56158e0 0 -- 10.0.2.37:6800/13016 >> 10.0.2.2:6802/15865 conn(0x2f984e00 :6800 s=STATE_OPEN pgs=4450980 cs=8863347 l=0).fault initiating reconnect Is this a bug? what could cause this? I think it is not right to let a few faulty osd to make the whole cluster not working. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com