On 7/20/18 3:20 AM, Filippo Giunchedi wrote: > On Tue, Jul 17, 2018 at 11:26 AM Carlos Maiolino <cmaiolino@xxxxxxxxxx> wrote: >>>> Ouch, indeed we've seen this problem on multiple nodes, said hosts >>>> belong to the same and latest shipment from the OEM. We'll run >>>> hardware diagnostics on these hosts and others we've received at >>>> another datacenter (which haven't shown issues so far but don't serve >>>> reads either). >>> >>> Update on this: we've ran hw diagnostics and couldn't find anything >>> wrong, xfs_repair does fix the issue so we'll be going ahead with >>> that. Is there anything we can do to help debugging in case this >>> happens again? >>> >> >> There is a patch being discussed on list to help catch these bit corruptions >> before they reach the disk, but, bear in mind we can only improve the validation >> of our metadata. Nothing actually forbids these bit flips are occurring on your >> data, and you are actually writing corrupted data into your files. > > We've found no other cases of bit flips or corruption in metadata or > the data itself though. > To recap what we've seen, hardware bit flipping is extremely unlikely: > the same type of sb_fdblocks corruption has appeared on four different > hosts affecting at most one third of xfs filesystems per host. Also > the corruption looks always the same, namely the 33rd bit flipped > which also seems suspicious. Running a debug kernel with memory poisoning, KASAN, or something similar might help catch it if it's a stray memory write of some sort... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html