Re: possible nfsv3 write corruption

"Pallissard, Matthew" <matt@xxxxxxxxxxxxxx> · Fri, 28 Feb 2020 08:38:28 -0800

I'll just bump this once before letting it slip into the ether.

Matt Pallissard

On 2020-02-27T08:28:43, Pallissard, Matthew wrote:
> 
> Forgive me if this is the wrong list.
> 
> Ok, I have this super infrequent data corruption on write that seems to be limited to nfsv3 async mounts.  I have not tested nfsv4 yet.  I _think_ I've narrowed down to the 5.5.0 > X >= 5.1.4 (maybe earlier) kernels.  I had some users report they had random data corruption.  A bit of testing shows that it's reproducible and the corruption is nearly identical every time.
> 
> I'd like to get to the bottom of this so I can guarantee that a kernel upgrade will resolve the issue.
> 
> What winds up happening is every several hundred GiB[ish] we wind up with the first half of a 64 bit segment corrupted.  Here is some example output from a test.  My test writes a few Gib, alternating between 64 bits of `0`'s and 64 bits of `1`'s.  I then read it in and check the contents. Re-reading the file shows that it's corrupted on write, not read.
> 
> > 2020-02-14 11:04:34 crit   found mis-match on word segment 11911168 / 33554432!
> > 2020-02-14 11:04:34 crit   found mis-match on byte 7, 188 != 255
> > 2020-02-14 11:04:34 crit   found mis-match on byte 6, 0 != 255
> > 2020-02-14 11:04:34 crit   found mis-match on byte 5, 16 != 255
> > 2020-02-14 11:04:34 crit   found mis-match on byte 4, 128 != 255
> > 2020-02-14 11:04:34 crit   1011110000000000000100001000000011111111111111111111111111111111
> 
> > 2020-02-14 13:38:11 crit   found mis-match on word segment 1982464 / 33554432!
> > 2020-02-14 13:38:11 crit   found mis-match on byte 7, 188 != 255
> > 2020-02-14 13:38:11 crit   found mis-match on byte 6, 0 != 255
> > 2020-02-14 13:38:11 crit   found mis-match on byte 5, 16 != 255
> > 2020-02-14 13:38:11 crit   found mis-match on byte 4, 128 != 255
> > 2020-02-14 13:38:11 crit   1011110000000000000100001000000011111111111111111111111111111111
> 
> 
> Knowns;
> 
> 	* does not appear to happen on CentOS/EL 3.10 series kernel
> 
> 	* does not appear to happen on a 5.5 series kernel
> 		* I'm re-running all my tests now to confirm this.
> 
> 	* not hardware dependent
> 
> 	* not processor dependent
> 		* I tested 3 different Intel processors
> 
> 	* appears to only happen on NFS v3 async mounts
> 		* local disk and `-o sync` NFS v3 mounts have been tested
> 
> 	* It happens on random 64 bit segments
> 
> 	* It's *always* the same 4 bytes that are corrupted
> 
> 	* While often identical, the corrupted bytes are not always identical
> 		* the identical corruption pattern can appear on separate computers.
> 
> 	* It's *always* on words that are written with `1`'s <- this is the part I find most interesting
> 
> 	* whether or not I explicitly call `fflush` and `sync` has no effect on the results.
> 
> 	* usually takes ~80-2000Gib to reproduce, sometimes higher or lower but infrequent.
> 		* I've been writing 2GiB files
> 		* sometimes I never hit the corruption case.
> 
> 	* I've yet to see more than one corrupted segment in a file.
> 
> 
> A little bit about the build/run environments;
> 
> the hardware
> 	CentOS 7.
> 	CentOS glibc 2.17
> 	clang 9 / lld
> 	Dell PowerEdge R620
> 	Dell PowerEdge C6320
> 	Dell PowerEdge C6420
> 	Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
> 	Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> 	Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> 
> * I did compile locally on every box.  I also tested every compiled binary on every box.  It didn't seem to affect the results.
> * I don't have a tcpdump of this yet.  I'm hoping to get that started before the end of the week.
> * I read and write to the same file every time, unlinking it before writing again
> * I have not tried dropping the cache between any of the steps.
> * I have engaged our storage vendor to see what they have to say.  They're pretty good at getting useful metrics and insight so if there is anything I should have them gather server-side please let me know.
> 
> 
> If anyone as any insight or additional testing I can perform I would *greatly* appreciate it.  I would be thrilled if this turned out to be some dumb configuration option or other operational thing performed incorrectly.
> 
> 
> Thank you for your time.
> 
> Matt Pallissard