possible regression fs corruption on 64GB nvme

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

We have been chasing an occasional fs corruption seen on 64GB NVME devices and would like to ask for advice and ideas as to what could be going on.
The devices in question are small cheap NVME devices which are eMMC behind an NVME bridge. They appear to be quite basic compared to other devices [1]

After a lot of testing, we managed to get a repro case that would trigger within 2-3 tests using the desync tool [2], reducing the repro time from a day or more to minutes. For repro steps see [3].
We bisected the issue to 

da9619a30e73b dmapool: link blocks across pages
https://lore.kernel.org/all/20230126215125.4069751-12-kbusch@xxxxxxxx/T/#u

With this patch we fail to verify the image within 2-3 attempts.
When we revert this patch it verifies every time.

This appears to be a concurrent read issue. The desync tool we use for testing fires off many threads.
If I first cat the file to /dev/null to prime the page cache with the file, it verifies fine.

I am currently attempting root cause analysis. As yet I don't know whether it is directly related to that dmapool patch or whether it is just exposing in underlying issue with the nvme driver.
For now, we are shipping with that patch reverted as a temporary fix while we work towards a root cause.

It was originally observed in the field after updating from 6.1 to 6.5 based kernels. Upon further testing this seems to hold true with the easily reproducible test scenario. This seems like a regression.
Testing on torvalds' latest tree shows the issue is still there as of 88fac17500f4ea49c7bac136cf1b27e7b9980075

I thought I'd let you all know in case you want issue a revert out of abundance of caution.

Some other thoughts about the issue:

- we have received reports of occasional filesystem corruptions on btrfs and ext4 filesystems on the same disk, this doesn't appear fs related
- it only seems to affect these 64GB single queue simple disks. Other devices with more capable disks have not showed this issue.
- using simple dd or md5sum testing does not sow the issue. desync seems to be very parallel in it's attack patterns.
- I was investigating a previous potential regression that was deemed not an issue https://lkml.org/lkml/2023/2/21/762 . I assume nvme doesn't need it's addresses to be ordered. I'm not familiar with the spec.


I'd appreciate any advice you may have on why this dmapool patch could potentially cause or expose an issue with these nvme devices.
If any more info would be useful to help diagnose, I'll happily provide it.

Thanks

Bob



[1]
$ sudo nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            NCE777D00B21D        E2M2 64GB                                0x1         61.87  GB /  61.87  GB    512   B +  0 B   10100080

$ sudo nvme get-feature /dev/nvme0n1
get-feature:0x01 (Arbitration), Current value:00000000
get-feature:0x02 (Power Management), Current value:00000000
get-feature:0x04 (Temperature Threshold), Current value:00000000
get-feature:0x05 (Error Recovery), Current value:00000000
get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
get-feature:0x07 (Number of Queues), Current value:00000000
get-feature:0x08 (Interrupt Coalescing), Current value:00000000
get-feature:0x09 (Interrupt Vector Configuration), Current value:00000000
get-feature:0x0a (Write Atomicity Normal), Current value:00000000
get-feature:0x0b (Async Event Configuration), Current value:00000000
get-feature:0x0c (Autonomous Power State Transition), Current value:00000000
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
get-feature:0x11 (Non-Operational Power State Config), Current value:00000000


[2]
https://github.com/folbricht/desync


[3]
$ dd if=/dev/urandom of=test_file bs=1M count=10240
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ desync verify-index test_file.caibx test_file








[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux