Re: file corruptions, 2nd half of 512b block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
Hi,

I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.

System configuration details below.

For those cases where the corrupt file can be regenerated from other
data and the new file compared to the corrupt file (15 files in all),
the corruptions are invariably in the 2nd 256b half of a 512b sector,
part way through the file. That's pretty odd! Perhaps some kind of
buffer tail problem?

Are there any known issues that might cause this?

Nothing that I can think of. A quick look through the writeback changes
shows this[1] commit, but I'd expect any corruption in that case to
manifest as page size (4k) rather than at 256b granularity.

[1] 40214d128e ("xfs: trim writepage mapping to within eof")

Looks like that issue can occur if the file is closed, then reopened and appended to. That's possible with the files written via ftp (the ftp upload allows for continuation of partial files), but not the files written via NFS - if they're incomplete they're removed and started from scratch.

So you obviously have a fairly large/complex storage configuration. I
think you have to assume that this corruption could be introduced pretty
much anywhere in the stack (network, mm, fs, block layer, md) until it
can be narrowed down.

Yup.

Per below I'm seeing a good checksum a bit after arrival and bad checksum some time later, so looks like it's not network.

2018-03-04 21:40:44 data + md5 files written
2018-03-04 22:43:33 checksum mismatch detected

Seems like the corruption is detected fairly soon after creation. How
often are these files explicitly checked/read? I also assume the files
aren't ever modified..?

Correct, the files aren't ever (deliberately) modified.

The files are generally checked once, some time (minutes to hours) after landing. After the first check I've been (perhaps foolishly) relying on raid6 scrubs to keep the data intact.

The files may be read a few times more over the course of a month, then they're either removed or just sit there quietly for months to years.
FWIW, the patterns that you have shown so far do seem to suggest
something higher level than a physical storage problem. Otherwise, I'd
expect these instances wouldn't always necessarily land in file data.
Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
problems?

I haven't tried xfs_repair yet. At 181T used and high but unknown at this point number of dirs and files, I imagine it will take quite a while and the filesystem shouldn't really be unavailable for more than low numbers of hours. I can use an LVM snapshot to do the 'xfs_repair -n', but I need to add enough spare capacity to hold the amount of data that arrives (at 0.5-1TB/day) during life of the check / snapshot. That might take a bit of fiddling because the system is getting short on drive bays.

Is it possible to work out approximately how long the check might take?

OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
blocks. I suppose that could suggest some kind of memory/cache
corruption as opposed to a bad page/extent state or something of that
nature.

I should have mentioned in the system summary: it's ECC RAM, with no EDAC errors coming up. So it shouldn't be memory corruption due to a bad stick or whatever. But, yes, there can be other causes.

Hmm, I guess the only productive thing I can think of right now is to
see if you can try and detect the problem as soon as possible. For e.g.,
it sounds like this is a closed system. If so, could you follow up every
file creation with an immediate md5 verification (perhaps followed by an
fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
pagecache)? Perhaps others might have further ideas..

The check runs "soon" after file arrival (usually minutes), but not immediately. I could potentially alter the ftp receiver to calculate the md5 as the file data is received and cross check with the md5 file at the end, but doing same with the files that arrive via NFS would be difficult.

The great majority of the corruptions have been in the files arriving via NFS - possibly because those files tend to be much larger so random corruptions simply hit them more, but also I guess possibly because NFS is more susceptible to whatever is causing the problem.

I have a number of instances where it definitely looks like the file has made it to the filesystem (but not necessarily disk) and checked ok, only to later fail the md5 check, e.g.:

2018-03-12 07:36:56 created
2018-03-12 07:50:05 check ok
2018-03-26 19:02:14 check bad

2018-03-13 08:13:10 created
2018-03-13 08:36:56 check ok
2018-03-26 14:58:39 check bad

2018-03-13 21:06:34 created
2018-03-13 21:11:18 check ok
2018-03-26 19:24:24 check bad

I've now (subsequent to those instances above) updated to your suggestion: do the check first (without DONTNEED), then if the file had pages in the vm before the first check (seen using 'vmtouch' Resident Pages), use DONTNEED (via 'vmtouch -e') and do the check again.

I haven't yet seen any corrupt files with this new scheme (it's now been in place for only 24 hours).

I've not played with vmtouch before so I'm not sure what's normal, but there seems to be some odd behaviour. Most of the time, 'vmtouch -e' clears the file from buffers immediately, but sometimes it leaves a single page resident, even in the face of repeated calls. I understand that fadvise(DONTNEED) is advisory (and of course there's always a chance something else can bring file pages back into vm), so I had it in a loop:

check_pages_buffered
checksum
if pages_were_buffered
 fadvise(DONTNEED)
 whilst pages_buffered
   fadvise(DONTNEED)
   sleep 2
 done
 checksum
fi

I had a case where that loop was running for 2.5 hours before self terminating, in the absence of anything else touching the file (that I could find), and another case where it continued for 1.5 hours before I killed it. It seems a single page can persist in memory (I don't know if it's the same page) for *hours* even with many, many fadvise(DONTNEED) calls. In testing, I was finally able to clear that file from vm using:

 echo 3 > /proc/sys/vm/drop_caches

...but that's a wee bit heavy to use to clear single pages so I'm now breaking the loop if pages_buffered <= 1.

Any idea what that impressively persistent page is about?

"cmp -l badfile goodfile" shows there are 256 bytes differing, in the
2nd half of (512b) block 53906431.

FWIW, that's the last (512b) sector of the associated (4k) page. Does
that happen to be consistent across whatever other instances you have a
record of?

Huh, I should have noticed that! Yes, all corruptions are the last 256b of a 4k page. And in fact all are the last 256b in the first 4k page of an 8k block. That's odd as well!

FYI, these are the 256b offsets now I'm now working with (there have been a few more since I started):

310799
876559
1400335
1676815
3516271
4243471
4919311
6267919
10212879
11520527
11842175
16179215
18018367
22609935
45314111
51365903
60588047
69212175
82352143
107812863
165136351
227067839
527947775

Thanks for your time!

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux