Re: Scrubbing question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 11/25/2015 06:41 PM, Robert LeBlanc wrote:
Since the one that is different is not your primary for the pg, then
pg repair is safe.
Ok, that's clear thanks.
I think we managed to identify the root cause of the scrubbing errors even if the files are identical.
It seems to be a hardware issue (faulty RAM module), which is really hard to detect, even if you have an ECC capable module.

The glitch happens here:
node2:~# while true; do sha1sum /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1; sleep 0.1; done
acd62deb72530e22b7ebdce3e2e47e0480af533b  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
...
acd62deb72530e22b7ebdce3e2e47e0480af533b  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
acd62deb72530e22b7ebdce3e2e47e0480af533b  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
acd62deb72530e22b7ebdce3e2e47e0480af533b  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
acd62deb72530e22b7ebdce3e2e47e0480af533b  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4  /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1
....

So, sometimes it calculates different values. We managed to copy this file several times to find the difference:
# diff 48.bin 49.bin
40095c40095
< hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC
---
> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC
So, it has a single bit difference (0x50 vs 0x54)

I think this presentation could be very useful about the silent corruption of data:
https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf

We will test all of our RAM modules now (it should have happened before, of course...), but it seems you have to be very careful with the cheap commodity hardware.

Regards,
Csaba

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux