Hi, I have also seen inconsistent PGs despite md5 being the same on all objects, however all my hardware uses ECC RAM, which as I understand should prevent this type of error. To be clear - in your case you were using ECC or non-ECC module? -- Tomasz Kuzemko tomasz.kuzemko@xxxxxxx W dniu 26.11.2015 o 15:23, Major Csaba pisze: > Hi, > > On 11/25/2015 06:41 PM, Robert LeBlanc wrote: >> Since the one that is different is not your primary for the pg, then >> pg repair is safe. > Ok, that's clear thanks. > I think we managed to identify the root cause of the scrubbing errors > even if the files are identical. > It seems to be a hardware issue (faulty RAM module), which is really > hard to detect, even if you have an ECC capable module. > > The glitch happens here: > *node2:~# while true; do sha1sum > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1; > sleep 0.1; done** > **acd62deb72530e22b7ebdce3e2e47e0480af533b > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **... > **acd62deb72530e22b7ebdce3e2e47e0480af533b > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **acd62deb72530e22b7ebdce3e2e47e0480af533b > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **acd62deb72530e22b7ebdce3e2e47e0480af533b > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **acd62deb72530e22b7ebdce3e2e47e0480af533b > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 > /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1** > **....* > > So, sometimes it calculates different values. We managed to copy this > file several times to find the difference: > *# diff 48.bin 49.bin ** > **40095c40095** > **< > hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC** > **---** > **> > hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC* > So, it has a single bit difference (0x50 vs 0x54) > > I think this presentation could be very useful about the silent > corruption of data: > <https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf>https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf > > We will test all of our RAM modules now (it should have happened before, > of course...), but it seems you have to be very careful with the cheap > commodity hardware. > > Regards, > Csaba > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com