Re: Scrubbing question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
I have also seen inconsistent PGs despite md5 being the same on all
objects, however all my hardware uses ECC RAM, which as I understand
should prevent this type of error. To be clear - in your case you were
using ECC or non-ECC module?

--
Tomasz Kuzemko
tomasz.kuzemko@xxxxxxx

W dniu 26.11.2015 o 15:23, Major Csaba pisze:
> Hi,
> 
> On 11/25/2015 06:41 PM, Robert LeBlanc wrote:
>> Since the one that is different is not your primary for the pg, then
>> pg repair is safe.
> Ok, that's clear thanks.
> I think we managed to identify the root cause of the scrubbing errors
> even if the files are identical.
> It seems to be a hardware issue (faulty RAM module), which is really
> hard to detect, even if you have an ECC capable module.
> 
> The glitch happens here:
> *node2:~# while true; do sha1sum
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1;
> sleep 0.1; done**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **...
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
> **....*
> 
> So, sometimes it calculates different values. We managed to copy this
> file several times to find the difference:
> *# diff 48.bin 49.bin **
> **40095c40095**
> **<
> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**
> **---**
> **>
> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*
> So, it has a single bit difference (0x50 vs 0x54)
> 
> I think this presentation could be very useful about the silent
> corruption of data:
> <https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf>https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
> 
> We will test all of our RAM modules now (it should have happened before,
> of course...), but it seems you have to be very careful with the cheap
> commodity hardware.
> 
> Regards,
> Csaba
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux