Data corruption problem

Wayne Walker <wwalker@xxxxxxxxxxxxxxxxxxxx> · Thu, 10 Feb 2011 23:14:59 -0600

First, I'm not certain whether this is samba, the linux cifs driver, or
something else.

During testing, one of my QA guys was running an inhouse program that
generates pseudo-random, but fully recreatable, data and writes it to
a file, the file is named with a name that is essentially the seed to
the pseudo- random stream, so, given a filename, it can read the file
and verify that the data is correct.

The file he created was on a CentOS 5.5 machine that was mounting a cifs
share on another CentOS 5.5 host running samba.  After 150K individual
files from 35 bytes to 9 GB, he created a 9 GB file that failed
validation.  He ran the test again with the same seed and it succeeded.
He ran it a 3rd time and it failed again.

He got me involved.  I found no useful messages (cifs, IO, kernel mem,
network, or samba) in any logs on client or server anywhere near the
times of the file creations.

I cmp'd the files.  Then used "od -A x -t a" with offsets and diffed the
3 files.  Each of the 2 failed files has a single block of 56K (57344) nuls.
The 2 failed files have these at different points in the 2 files.  Each
56K nul block starts on an offset where x % 57344 == 0.

first file:
>>> 519995392 / 57344.
9068.0 # matching 56K blocks before the one null 56K block

second file is certainly on a 1 K boundary, but I mislaid the diff data
for it and it's taking forever for cmp to run to find the offset and
verify that it's on a 56K boundary.  I'll follow up to this email
tomorrow with the result of the cmp.

So, I searched the kernel source, expecting to find 56K in the sata
driver code.  Instead the only place I found it that seemed relevant
was:

	./fs/cifs/README:  wsize default write size (default 57344)

I have since used cp to copy the file 4 times with tcpdump running at
both ends.  All 4 times have worked properly.  Don't know if that is
because tcpdump is slowing it down or if our test app could be at fault.
Our test app is talking to the local file system and not with a block
size of 56K, so I don't think it is our app.

Unfortunately, the tcpdumps at both ends are reporting the kernel
dropping about 50% of the packets, so even if I can get it to fail,
I'm still  unsure whether it's the client or the samba server, where
client would still leave me choosing betweem our app and fs/cifs.

The only other thing I can think of is the ethernet devices, but since
the packet is made up of 30+ ethernet frames, and being TCP there is
a payload checksum, I can't see the network layers being the culprit,
but just in case:

client w/ fs/cifs:
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

samba server:
01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller
03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller

A few questions:

0. Anyone already know of a bug in fs/cifs or samba that has this
symptom?

1. Anyone know how to get the kernel to not drop the packets?

2. Any other ideas on what I can do to gather more data to differentiate
between bad-app, fs/cifs, samba, or other-element-in-the-chain?

Thank you for all the work you guys do!

-- 

Wayne Walker
wwalker@xxxxxxxxxxxxxxxxxxxx
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html