Re: Socket errors, CRC, lossy con messages

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Tue, 11 Apr 2017 08:41:23 +0200

On 04/10/2017 08:16 PM, Alex Gorbachev wrote:
I am trying to understand the cause of a problem we started
encountering a few weeks ago.  There are 30 or so per hour messages on
OSD nodes of type:

ceph-osd.33.log:2017-04-10 13:42:39.935422 7fd7076d8700  0 bad crc in
data 2227614508 != exp 2469058201

and

2017-04-10 13:42:39.939284 7fd722c42700  0 -- 10.80.3.25:6826/5752
submit_message osd_op_reply(1826606251
rbd_data.922d95238e1f29.00000000000101bf [set-alloc-hint object_size
16777216 write_size 16777216,write 6328320~12288] v103574'18626765
uv18626765 ondisk = 0) v6 remote, 10.80.3.216:0/1934733503, failed
lossy con, dropping message 0x3b55600 [..]

Is that happening on entire cluster, or just specific OSDs? That is a clear 
indication of data corruption, in the above example osd.33 calculated crc 
for received data block and found out that it doesn't match what was 
precalculated by sending side. Try gathering some more examples of such crc 
errors and isolate osd/host that sends malformed data, then do usual 
diagnostics like memory test on that mahcine.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com