On Mon, Jul 16, 2018 at 11:29:51AM +0200, Filippo Giunchedi wrote: > On Wed, Jul 11, 2018 at 10:31 AM Filippo Giunchedi > <fgiunchedi@xxxxxxxxxxxxx> wrote: > > > that sb_fdblocks really is ~17T which indicates the problem > > > really is on disk. > > > > > > 4461713825 > > > 100001001111100000101100110100001 > > > 166746529 > > > 1001111100000101100110100001 > > > > > > you have a bit flipped in the problematic value... but you're running > > > with CRCs so it seems unlikely to have been some sort of bit-rot (that, > > > and the fact that you're hitting the same problem on multiple nodes). > > > > Ouch, indeed we've seen this problem on multiple nodes, said hosts > > belong to the same and latest shipment from the OEM. We'll run > > hardware diagnostics on these hosts and others we've received at > > another datacenter (which haven't shown issues so far but don't serve > > reads either). > > Update on this: we've ran hw diagnostics and couldn't find anything > wrong, xfs_repair does fix the issue so we'll be going ahead with > that. Is there anything we can do to help debugging in case this > happens again? > There is a patch being discussed on list to help catch these bit corruptions before they reach the disk, but, bear in mind we can only improve the validation of our metadata. Nothing actually forbids these bit flips are occurring on your data, and you are actually writing corrupted data into your files. Cheers > thanks a lot! > Filippo > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Carlos -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html