[Bug 217599] Adaptec 71605z hangs with aacraid: Host adapter abort request after update to linux 6.4.0

bugzilla-daemon@xxxxxxxxxx · Sat, 16 Dec 2023 05:35:26 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217599

--- Comment #48 from encore2097@xxxxxxxxxxx ---
Hi Sagar,

I'm using a setup with 10 SATA disks in HBA mode and running a zfs raidz2
filesystem (akin to raid-6). This is a single CPU system so I don't believe the
CPU count is the main issue here —- although its likely related.

>From examining the logs, doing some research, and drawing from my experience,
it seems that timeouts and queues are the primary culprits. My suspicion is
that during heavy loads, there's an overflow somewhere in the stack (could be
in the kernel driver, firmware, or hardware), causing I/O requests to get lost
and timeout. After a series of these timeouts, the driver triggers an error and
resets the adapter.

I stumbled upon threads dating back to around 2017 where users faced similar
issues (check this one:
https://forum.proxmox.com/threads/pve-5-1-aacraid-scsi-hang.38259/). One
suggestion for a fix was to extend the disk timeout window for waiting on I/O.
However, the current kernel (set at 60s) has already doubled the previous value
of 30s, which makes me think it might not be the root cause but is also
related.

I'm not sure of the physical disk setup of other users connecting to their
controllers, but I reliably see this issue with my 10 disk setup so my
recommendation would be to increase the number of disks attached to the
controller and stress test it with simultaneous sequential and random I/O using
tools like dd and fio at the same time. 

My specific use case involves a file server and database with multiple users. I
consistently observe the adapter aborting requests and resetting a few minutes
after boot, when the file server and database applications start and warm up
their caches (cache size is approximately 120GB in RAM).

Upon further investigation, I found that anyone experiencing this issue could
gather more information by modifying aacraid with dump_stack() added around
line 713 of linux/latest/source/drivers/scsi/aacraid/linit.c within
aac_eh_abort (refer to this:
https://stackoverflow.com/questions/32557040/how-to-get-stack-trace-at-various-points-in-kernel-device-driver-code).

Unfortunately, due to unacceptable downtime I had to revert my system to a
different HBA and lack spare systems to test with.

Best regards.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.