On 04/10/2017 08:16 PM, Alex Gorbachev wrote:
I am trying to understand the cause of a problem we started encountering a few weeks ago. There are 30 or so per hour messages on OSD nodes of type: ceph-osd.33.log:2017-04-10 13:42:39.935422 7fd7076d8700 0 bad crc in data 2227614508 != exp 2469058201 and 2017-04-10 13:42:39.939284 7fd722c42700 0 -- 10.80.3.25:6826/5752 submit_message osd_op_reply(1826606251 rbd_data.922d95238e1f29.00000000000101bf [set-alloc-hint object_size 16777216 write_size 16777216,write 6328320~12288] v103574'18626765 uv18626765 ondisk = 0) v6 remote, 10.80.3.216:0/1934733503, failed lossy con, dropping message 0x3b55600 [..]
Is that happening on entire cluster, or just specific OSDs? That is a clear indication of data corruption, in the above example osd.33 calculated crc for received data block and found out that it doesn't match what was precalculated by sending side. Try gathering some more examples of such crc errors and isolate osd/host that sends malformed data, then do usual diagnostics like memory test on that mahcine.
-- Piotr Dałek piotr.dalek@xxxxxxxxxxxx https://www.ovh.com/us/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com