https://bugzilla.kernel.org/show_bug.cgi?id=121531 Bug ID: 121531 Summary: Adaptec 7805H SAS HBA (pm80xx): hangs when writing >80MB at once Product: IO/Storage Version: 2.5 Kernel Version: 3.16.0-4-amd64 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: SCSI Assignee: linux-scsi@xxxxxxxxxxxxxxx Reporter: martin.von.wittich@xxxxxxxx Regression: No Created attachment 222171 --> https://bugzilla.kernel.org/attachment.cgi?id=222171&action=edit dd loop output, writing 64 - 128 MB to a disk One of our customers attempted to install our Debian 8-based distribution on a Fujitsu PRIMERGY TX150 S8 server with an Adaptec 7805H SAS HBA. Unfortunately, the system tended to lock up during use; almost all services stopped responding, but it was still possible to run simple commands via SSH, e.g. "ssh server 'cat /proc/loadavg'" or "ssh server dmesg". Everything that required write access (like actually logging in via SSH, or using the web interface) just seemed to hang. Load average was extremely high (>100) and dmesg reported a lot of sas/pm80xx errors: [11748.246360] sas: trying to find task 0xffff88082fcc7d40 [11748.246362] sas: sas_scsi_find_task: aborting task 0xffff88082fcc7d40 [11748.246572] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [11748.246574] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [11748.246576] sas: task done but aborted [11748.246581] sas: sas_scsi_find_task: task 0xffff88082fcc7d40 is done [11748.246583] sas: sas_eh_handle_sas_errors: task 0xffff88082fcc7d40 is done [11748.246585] sas: trying to find task 0xffff88082fcc7c00 [11748.246587] sas: sas_scsi_find_task: aborting task 0xffff88082fcc7c00 [11748.246829] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [11748.246831] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [11748.246832] sas: task done but aborted [11748.246837] sas: sas_scsi_find_task: task 0xffff88082fcc7c00 is done [11748.246839] sas: sas_eh_handle_sas_errors: task 0xffff88082fcc7c00 is done [11748.246841] sas: trying to find task 0xffff88082fcc7ac0 [11748.246844] sas: sas_scsi_find_task: aborting task 0xffff88082fcc7ac0 [11748.247055] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [11748.247057] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [11748.247059] sas: task done but aborted [11748.247064] sas: sas_scsi_find_task: task 0xffff88082fcc7ac0 is done [11748.247067] sas: sas_eh_handle_sas_errors: task 0xffff88082fcc7ac0 is done [11748.247069] sas: trying to find task 0xffff88082fcc7840 [11748.247070] sas: sas_scsi_find_task: aborting task 0xffff88082fcc7840 [11748.247366] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [11748.247368] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [11748.247370] sas: task done but aborted [11748.247375] sas: sas_scsi_find_task: task 0xffff88082fcc7840 is done [11748.247377] sas: sas_eh_handle_sas_errors: task 0xffff88082fcc7840 is done [11748.247379] sas: trying to find task 0xffff88082ff72e00 [11748.247380] sas: sas_scsi_find_task: aborting task 0xffff88082ff72e00 [11748.247591] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [11748.247593] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [11748.247595] sas: task done but aborted [11748.247600] sas: sas_scsi_find_task: task 0xffff88082ff72e00 is done [11748.247601] sas: sas_eh_handle_sas_errors: task 0xffff88082ff72e00 is done [11748.247603] sas: trying to find task 0xffff88082ff72400 [11748.247605] sas: sas_scsi_find_task: aborting task 0xffff88082ff72400 At first we believed the underlying cause to be a hardware problem, but the problem persisted after the HBA and the backplane were replaced (the disks were ruled out as a possible cause because the selftests reported no errors). To isolate the issue, I ran the following tests in a live system on the affected server: 1) "smartctl -t long" on both disks; both reported "Completed", so the disks seem to be OK. 2) "dd if=/dev/sdX of=/dev/null bs=1M" on both disks; both completed successfully, with an average speed of ~150 MB/s. Reading seems to be fine too. 3) "dd if=/dev/zero of=/dev/sdX bs=1M" on both disks. It stopped responding, and dmesg started spewing lots of sas/pm80xx errors. So apparently writing to the disks causes the problem. To track it down further, I tried to repeatedly write 64 MB to one disk - this works without problems: root@unassigned:~# for i in $(seq 1 8); do dd if=/dev/zero of=/dev/sdc bs=1M count=64; done 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.482716 s, 139 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.482339 s, 139 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.474302 s, 141 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.464919 s, 144 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.465673 s, 144 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.465525 s, 144 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.473932 s, 142 MB/s 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.472965 s, 142 MB/s Then I tried to write increasing amounts of data to the disk; this reproducibly slows down at about ~80 MB. A few seconds later, dmesg starts spewing error messages. root@unassigned:~# for i in $(seq 64 128); do dd if=/dev/zero of=/dev/sdc bs=1M count=$i; done [...] 75+0 records in 75+0 records out 78643200 bytes (79 MB) copied, 0.595394 s, 132 MB/s 76+0 records in 76+0 records out 79691776 bytes (80 MB) copied, 33.6425 s, 2.4 MB/s 77+0 records in 77+0 records out 80740352 bytes (81 MB) copied, 0.631928 s, 128 MB/s 78+0 records in 78+0 records out 81788928 bytes (82 MB) copied, 0.621007 s, 132 MB/s 79+0 records in 79+0 records out 82837504 bytes (83 MB) copied, 0.651981 s, 127 MB/s 80+0 records in 80+0 records out 83886080 bytes (84 MB) copied, 0.674202 s, 124 MB/s 81+0 records in 81+0 records out 84934656 bytes (85 MB) copied, 33.7179 s, 2.5 MB/s 82+0 records in 82+0 records out [...] It seems to alternate between ~130 MB/sand 1-3 MB/s, and then completely hangs after 96 records. See dd-loop.txt for the full output. The errors in dmesg: [ 2645.124944] sas: Enter sas_scsi_recover_host busy: 146 failed: 146 [ 2645.124963] sas: trying to find task 0xffff88083658b200 [ 2645.124966] sas: sas_scsi_find_task: aborting task 0xffff88083658b200 [ 2647.457375] sas: task done but aborted [ 2647.457382] sas: task done but aborted [ 2647.457385] sas: task done but aborted [ 2647.457833] sas: task done but aborted [ 2647.457840] sas: task done but aborted [ 2647.457843] sas: task done but aborted [ 2647.457851] sas: task done but aborted [ 2647.457853] sas: task done but aborted [ 2647.457856] sas: task done but aborted [ 2647.457860] sas: task done but aborted [ 2647.457863] sas: task done but aborted [ 2647.457865] sas: task done but aborted [ 2647.457867] sas: task done but aborted [ 2647.457869] sas: task done but aborted [ 2647.457872] sas: task done but aborted [ 2647.457874] sas: task done but aborted [ 2647.457876] sas: task done but aborted [ 2647.457879] sas: task done but aborted [ 2647.457881] sas: task done but aborted [ 2647.457883] sas: task done but aborted [ 2647.457885] sas: task done but aborted [ 2647.458125] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [ 2647.458130] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [ 2647.458135] sas: task done but aborted [ 2647.458156] sas: sas_scsi_find_task: task 0xffff88083658b200 is done [ 2647.458159] sas: sas_eh_handle_sas_errors: task 0xffff88083658b200 is done [ 2647.458162] sas: trying to find task 0xffff880837ad30c0 [ 2647.458164] sas: sas_scsi_find_task: aborting task 0xffff880837ad30c0 [ 2647.458166] sas: sas_scsi_find_task: task 0xffff880837ad30c0 is done [ 2647.458168] sas: sas_eh_handle_sas_errors: task 0xffff880837ad30c0 is done [ 2647.458170] sas: trying to find task 0xffff880837ad3200 [ 2647.458172] sas: sas_scsi_find_task: aborting task 0xffff880837ad3200 [ 2647.458174] sas: sas_scsi_find_task: task 0xffff880837ad3200 is done [ 2647.458176] sas: sas_eh_handle_sas_errors: task 0xffff880837ad3200 is done [ 2647.458178] sas: trying to find task 0xffff880838dcfa80 [ 2647.458179] sas: sas_scsi_find_task: aborting task 0xffff880838dcfa80 [ 2647.458181] sas: sas_scsi_find_task: task 0xffff880838dcfa80 is done [ 2647.458183] sas: sas_eh_handle_sas_errors: task 0xffff880838dcfa80 is done [ 2647.458198] sas: trying to find task 0xffff880838d31700 [ 2647.458200] sas: sas_scsi_find_task: aborting task 0xffff880838d31700 [ 2647.458605] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [ 2647.458611] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [ 2647.458616] sas: task done but aborted [ 2647.458638] sas: sas_scsi_find_task: task 0xffff880838d31700 is done [ 2647.458641] sas: sas_eh_handle_sas_errors: task 0xffff880838d31700 is done [ 2647.458644] sas: trying to find task 0xffff880838ca6e80 [ 2647.458646] sas: sas_scsi_find_task: aborting task 0xffff880838ca6e80 [ 2647.459184] pm80xx mpi_ssp_completion 1514:sas IO status 0x1 [ 2647.459190] pm80xx mpi_ssp_completion 1523:SAS Address of IO Failure Drive:5000c50062c1b09d [ 2647.459194] sas: task done but aborted [ 2647.459217] sas: sas_scsi_find_task: task 0xffff880838ca6e80 is done [ 2647.459220] sas: sas_eh_handle_sas_errors: task 0xffff880838ca6e80 is done [ 2647.459222] sas: trying to find task 0xffff88083658b480 [ 2647.459225] sas: sas_scsi_find_task: aborting task 0xffff88083658b480 [...] To finally rule out a hardware issue, I installed Windows 10 onto one of the disks and copied the Windows 10 installation image (~ 5 GB) from a USB stick onto the first disk, then formatted the second disk too and copied the image on that disk too. That worked without problems, so I'm pretty sure that this has to be a bug in the Linux driver. I'll attach full dmesg copies, dmidecode/lspci/smartctl/uname output after filing this bug. -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html