Hi,
On 11/25/2015 06:41 PM, Robert LeBlanc wrote: Ok, that's clear thanks.Since the one that is different is not your primary for the pg, then pg repair is safe. I think we managed to identify the root cause of the scrubbing errors even if the files are identical. It seems to be a hardware issue (faulty RAM module), which is really hard to detect, even if you have an ECC capable module. The glitch happens here: node2:~# while true; do sha1sum /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1; sleep 0.1; done acd62deb72530e22b7ebdce3e2e47e0480af533b /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 ... acd62deb72530e22b7ebdce3e2e47e0480af533b /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 acd62deb72530e22b7ebdce3e2e47e0480af533b /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 acd62deb72530e22b7ebdce3e2e47e0480af533b /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 acd62deb72530e22b7ebdce3e2e47e0480af533b /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1 .... So, sometimes it calculates different values. We managed to copy this file several times to find the difference: # diff 48.bin 49.bin 40095c40095 < hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC --- > hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC So, it has a single bit difference (0x50 vs 0x54) I think this presentation could be very useful about the silent corruption of data: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf We will test all of our RAM modules now (it should have happened before, of course...), but it seems you have to be very careful with the cheap commodity hardware. Regards, Csaba |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com