Re: possible regression fs corruption on 64GB nvme

Robert Beckett <bob.beckett@xxxxxxxxxxxxx> · Tue, 10 Sep 2024 18:27:55 +0100

 ---- On Tue, 10 Sep 2024 10:30:18 +0100  Robert Beckett  wrote --- 
 > 
 > 
 > 
 > 
 > 
 >  ---- On Mon, 09 Sep 2024 21:31:41 +0100  Keith Busch  wrote --- 
 >  > On Mon, Sep 09, 2024 at 02:29:14PM -0600, Keith Busch wrote:
 >  > > As a test, could you try kernel parameter "nvme.io_queue_depth_set=2"?
 >  > 
 >  > Err, I mean "nvme.io_queue_depth=2".
 >  > 
 > 
 > Thanks, I'll give it a try along with your other questions and report back.
 > 
 > For clarity, the repro steps dropped a step. They should have included the make command:
 > 
 > 
 > $ dd if=/dev/urandom of=test_file bs=1M count=10240
 > $ desync make test_file.caibx test_file
 > $ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
 > $ desync verify-index test_file.caibx test_file
 > 

CONFIG_SLUB_DEBUG_ON showed no debug output.

nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
I assume it is limiting to 2 outstanding requests concurrently.
Does it suggest an issue with the specific device's FW?
I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?
Any advice for handling this and/or investigating further?

My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.
I'd appreciate any insight you could give on the usage of the dmapools in the driver and whether you would expect them to be significant in this issue, or if they are just making a device/fw bug more observable.

Thanks

Bob

p.s. Here is an transcript of the issue seen in testing. To my knowledge, if everything is working as it should, nothing should be able to produce this output, that dropping caches and re-priming the page cache via a linear read it fixes things.

$ dd if=/dev/urandom of=test_file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 111.609 s, 96.2 MB/s
$ desync make test_file.caibx test_file
Chunking [=======================================================================================================================================] 100.00% 18s
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ desync verify-index test_file.caibx test_file
[=============>-----------------------------------------------------------------------------------------------------------------------------------]   9.00% 4s
Error: seed index for test_file doesn't match its data
$ md5sum test_file
ce4f1cca0b3dfd63ea2adfd745e4bfc1  test_file
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ md5sum test_file
1edb3eaf5ae57b6187cc0be843ed2e5c  test_file
$ desync verify-index test_file.caibx test_file
[=================================================================================================================================================] 100.00% 5s