Another corruption detection/correction question - exposure between 'event' and 'repair'?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Wed, 23 Dec 2015 09:13:03 -0000

After messing up some of my data in the past (my own doing, playing with
BTRFS in old kernels), I've been extra cautious and now run a ZFS mirror
across multiple RBD images. It's led me to believe that I have a faulty
SSD in one of my hosts:

sdb without a journal - fine (but slow)
sdc without a journal - fine (but slow)
sdd without a journal - fine (but slow)

sdb with sda4 as journal - checksum errors appear in ZFS
sdc with sda5 as journal - checksum errors appear in ZFS
sdd with sda6 as journal - checksum errors appear in ZFS

So, I believe the SSD in sda is in some way defective, but my question
is around the detection and correction of this 'corruption'.

"nodeep-scrub flag(s) set" currently, due to the performance impact.
But, if it were set, it seems to find problems, which I can then repair.
However ... is this a safe repair, using a good copy each object? Will
it be with NewStore? I still seem to get errors regularly bubbling their
way up into ZFS, but I can't reliably ascertain whether they're the
result of a corruption which has happened *before* the next Ceph deep
scrub (therefore still exposed anyway in this timeframe?), or is *after*
a repair?

I'm obviously hoping for an eventual scenario where this is all
transparent to the ZFS layer and it stops detecting checksum errors :)

Thanks,
Chris

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com