[Bug 219300] ext4 corrupts data on a specific pendrive

bugzilla-daemon@xxxxxxxxxx · Mon, 23 Sep 2024 18:53:02 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=219300

--- Comment #9 from Theodore Tso (tytso@xxxxxxx) ---
It's not at all surprising that flaky hardware might have issues that are only
exposed on different surprising.   Different file systems might have very
different I/O patterns both in terms of spatially (what blocks get used) and
temporal (how many I/O requests are issued in parallel, and how quickly) and
from a I/O request type (e.g., how much if any CACHE FLUSH requests, how many
if any FORCED UNIT ATTENTION -- FUA).

One quick thing I'd suggest that you try is to experiment with file systems
other than ext4 and ntfs.  For example, what happens if you use xfs or btrfs or
f2fs with your test programs?    If the hardware fails with xfs or btrfs, then
that would very likely put the finger of blame on the hardware being cr*p.

The other thing that you can try is to run tests on the raw hardware.   For
example, something like this [1]to write random data to the disk, and then
verify the output.   The block device must be able to handle having random data
written at high speeds, and when you read back the data, you must get the same
data written back.   Unreasonable, I know, but if the storage device fails with
random writes without a file system in the mix, it's going to be hopeless once
you add a file system.

[1] https://github.com/axboe/fio/blob/master/examples/basic-verify.fio

I will note that large companies that buy millions of dollars of hardware,
whether it's for data centers use at hyperscaler cloud companies like Amazon or
Microsoft, or for Flash devices used in mobile devices such as Samsung,
Motorola, Google Pixel devices, etc., will spend an awful lot of time
qualifying the hardware to make sure it is high quality before they buy them. 
And they do this using raw tests to the block device, since this eliminates the
excuse from the hardware company that "oh, this must be a file system bug".   
If there are failures found when using storage tests against the raw block
device, there is no place for the hardware vendor to hide.....

But in general, as Artem said, if there are any I/O failures at all, that's a
huge red flagh.   That essentially *proves* that the hardware is dodgy.   You
can have dodgy hardware without I/O errors, but if there are I/O errors reading
or writing to a valid block/sector number, then by definition the hardware is
the problem.   And in your case, the errors are "USB disconnect" and "unit is
off-line".   That should never, ever happen, and if it does, then there is a
hardware problem.  It could be a cabling problem; it could be a problem with
the SCSI/SATA/NVME/USB controller, etc., but the file system folks will tell
you that if there are *any* such problems, resolve the hardware problem before
you asking the file system people to debug the problem.    It's much like
asking a civil egnineer to ask why the building might be design issues when
it's built on top of quicksand.  Buildings assume that they are built on stable
ground.   If the ground is not stable, then chose a different building site or
fix the ground first.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.