https://bugzilla.kernel.org/show_bug.cgi?id=217599 --- Comment #48 from encore2097@xxxxxxxxxxx --- Hi Sagar, I'm using a setup with 10 SATA disks in HBA mode and running a zfs raidz2 filesystem (akin to raid-6). This is a single CPU system so I don't believe the CPU count is the main issue here —- although its likely related. >From examining the logs, doing some research, and drawing from my experience, it seems that timeouts and queues are the primary culprits. My suspicion is that during heavy loads, there's an overflow somewhere in the stack (could be in the kernel driver, firmware, or hardware), causing I/O requests to get lost and timeout. After a series of these timeouts, the driver triggers an error and resets the adapter. I stumbled upon threads dating back to around 2017 where users faced similar issues (check this one: https://forum.proxmox.com/threads/pve-5-1-aacraid-scsi-hang.38259/). One suggestion for a fix was to extend the disk timeout window for waiting on I/O. However, the current kernel (set at 60s) has already doubled the previous value of 30s, which makes me think it might not be the root cause but is also related. I'm not sure of the physical disk setup of other users connecting to their controllers, but I reliably see this issue with my 10 disk setup so my recommendation would be to increase the number of disks attached to the controller and stress test it with simultaneous sequential and random I/O using tools like dd and fio at the same time. My specific use case involves a file server and database with multiple users. I consistently observe the adapter aborting requests and resetting a few minutes after boot, when the file server and database applications start and warm up their caches (cache size is approximately 120GB in RAM). Upon further investigation, I found that anyone experiencing this issue could gather more information by modifying aacraid with dump_stack() added around line 713 of linux/latest/source/drivers/scsi/aacraid/linit.c within aac_eh_abort (refer to this: https://stackoverflow.com/questions/32557040/how-to-get-stack-trace-at-various-points-in-kernel-device-driver-code). Unfortunately, due to unacceptable downtime I had to revert my system to a different HBA and lack spare systems to test with. Best regards. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.