On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote: > On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote: > > Eyeballing the corrupted blocks and matching good blocks doesn't show > > any obvious pattern. The files themselves contain compressed data so > > it's all highly random at the block level, and the corruptions > > themselves similarly look like random bytes. > > > > The corrupt blocks are not a copy of other data in the file within the > > surrounding 256k of the corrupt block. > > > > So you obviously have a fairly large/complex storage configuration. I > think you have to assume that this corruption could be introduced pretty > much anywhere in the stack (network, mm, fs, block layer, md) until it > can be narrowed down. > > > ---------------------------------------------------------------------- > > System configuration > > ---------------------------------------------------------------------- > > > > linux-4.9.76 > > xfsprogs 4.10 > > CPU: 2 x E5620 (16 cores total) > > 192G RAM > > > > # grep bigfs /etc/mtab > > /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0 > > # xfs_info /bigfs > > meta-data=/dev/mapper/vg00-bigfs isize=512 agcount=246, agsize=268435328 blks > > = sectsz=4096 attr=2, projid32bit=1 > > = crc=1 finobt=1 spinodes=0 rmapbt=0 > > = reflink=0 > > data = bsize=4096 blocks=65929101312, imaxpct=5 > > = sunit=128 swidth=1152 blks > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > log =internal bsize=4096 blocks=521728, version=2 > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd. Are these all on the one raid controller? i.e. what's the physical layout of all these disks? > > The raids all check clean. > > > > The XFS has been expanded a number of times. > > > > ---------------------------------------------------------------------- > > Explicit example... > > ---------------------------------------------------------------------- > > > > 2018-03-04 21:40:44 data + md5 files written > > 2018-03-04 22:43:33 checksum mismatch detected > > Seems like the corruption is detected fairly soon after creation. How > often are these files explicitly checked/read? I also assume the files > aren't ever modified..? > > FWIW, the patterns that you have shown so far do seem to suggest > something higher level than a physical storage problem. Otherwise, I'd > expect these instances wouldn't always necessarily land in file data. > Have you run 'xfs_repair -n' on the fs to confirm there aren't any other > problems? > > OTOH, a 256b corruption seems quite unusual for a filesystem with 4k > blocks. I suppose that could suggest some kind of memory/cache > corruption as opposed to a bad page/extent state or something of that > nature. Especially with the data write mechanisms being used - e.g. NFS won't be doing partial sector reads and writes for data transfer - it'll all be done in blocks much larger that the filesystem block size (e.g. 1MB IOs). > Hmm, I guess the only productive thing I can think of right now is to > see if you can try and detect the problem as soon as possible. For e.g., > it sounds like this is a closed system. If so, could you follow up every > file creation with an immediate md5 verification (perhaps followed by an > fadvise(DONTNEED) and another md5 check to try and catch an inconsistent > pagecache)? Perhaps others might have further ideas.. Basically, the only steps now are a methodical, layer by layer checking of the IO path to isolate where the corruption is being introduced. First you need a somewhat reliable reproducer that can be used for debugging. Write patterned files (e.g. encode a file id, file offset and 16 bit cksum in every 8 byte chunk) and then verify them. When you get a corruption, the corrupted data will tell you where the corruption came from. It'll either be silent bit flips, some other files' data, or it will be stale data.i See if the corruption pattern is consistent. See if the locations correlate to a single disk, a single raid controller, a single backplane, etc. i.e. try to find some pattern to the corruption. Unfortunately, I can't find the repository for the data checking tools that were developed years ago for doing exactly this sort of testing (genstream+checkstream) online anymore - they seem to have disappeared from the internet. (*) Shouldn't be too hard to write a quick tool to do this, though. Also worth testing is whether the same corruption occurs when you use direct IO to write and read the files. That would rule out a large chunk of the filesystem and OS code as the cause of the corruption. (*) Google is completely useless for searching for historic things, mailing lists and/or code these days. Searching google now reminds of the bad old days of AltaVista - "never finds what I'm looking for".... > > file size: 31232491008 bytes > > > > The file is moved to "badfile", and the file regenerated from source > > data as "goodfile". What does "regenerated from source" mean? DOes that mean a new file is created, compressed and then copied across? Or is it just the original file being copied again? > > From extent 16, the actual corrupt sector offset within the lv device > > underneath xfs is: > > > > 289315926016 + (53906431 - 45826040) == 289324006407 > > > > Then we can look at the devices underneath the lv: > > > > # lvs --units s -o lv_name,seg_start,seg_size,devices > > LV Start SSize Devices > > bigfs 0S 105486999552S /dev/md0(0) > > bigfs 105486999552S 105487007744S /dev/md4(0) > > bigfs 210974007296S 105487007744S /dev/md9(0) > > bigfs 316461015040S 35160866816S /dev/md1(0) > > bigfs 351621881856S 105487007744S /dev/md5(0) > > bigfs 457108889600S 70323920896S /dev/md3(0) > > > > Comparing our corrupt sector lv offset with the start sector of each md > > device, we can see the corrupt sector is within /dev/md9 and not at a > > boundary. The corrupt sector offset within the lv data on md9 is given > > by: Does, the problem always occur on /dev/md9? If so, does the location correlate to a single disk in /dev/md9? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html