Re: [BUG] I/O timeouts and system freezes on Kingston A2000 NVME with BCACHEFS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-01-19 21:22, Jens Axboe wrote:
On 1/19/24 5:25 AM, Mia Kanashi wrote:
This issue was originally reported here: https://github.com/koverstreet/bcachefs/issues/628

Transferring large amounts of files to the bcachefs from the btrfs
causes I/O timeouts and freezes the whole system. This doesn't seem to
be related to the btrfs, but rather to the heavy I/O on the drive, as
it happens without btrfs being mounted. Transferring the files to the
HDD, and then from it to the bcachefs on the NVME sometimes doesn't
make the problem occur. The problem only happens on the bcachefs, not
on btrfs or ext4. It doesn't happen on the HDD, I can't test with
other NVME drives sadly. The behaviour when it is frozen is like this:
all drive accesses can't process, when not cached in ram, so every app
that is loaded in the ram, continues to function, but at the moment it
tries to access the drive it freezes, until the drive is reset and
those abort status messages appear in the dmesg, after that system is
unfrozen for a moment, if you keep copying the files then the problem
reoccurs once again.

This drive is known to have problems with the power management in the
past:
https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Troubleshooting
But those problems where since fixed with kernel workarounds /
firmware updates. This issue is may be related, perhaps bcachefs does
something different from the other filesystems, and workarounds don't
apply, which causes the bug to occur only on it. It may be a problem
in the nvme subsystem, or just some edge case in the bcachefs too, who
knows. I tried to disable ASPM and setting latency to 0 like was
suggested, it didn't fix the problem, so I don't know. If this is
indeed related to that specific drive it would be hard to reproduce.

From a quick look, looks like a broken drive/firmware. It is suspicious
that all failed IO is 256 blocks. You could try and limit the transfer
size and see if that helps:

# echo 64 > /sys/block/nvme0n1/queue/max_sectors_kb

Or maybe the transfer size is just a red herring, who knows. The error
code seems wonky:

[ 185.384762] nvme0n1: I/O Cmd(0x2) @ LBA 105272408, 256 blocks, I/O Error (sct 0x3 / sc 0x71)

Changing max_sectors_kb to 64 does indeed seem to fix the issue at the first glance, default value is 128. Also tried changing bcachefs flags during the format --btree_node_size=64k --bucket=64k
thought maybe that is related, but that didn't help.
It is really weird that this problem only occurs on bcachefs.




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux