On 1/27/22 19:17, John Garry wrote: > On 27/01/2022 06:37, Damien Le Moal wrote: > > Hi Damien, > >> I did some light testing of this series (boot + some fio runs) and >> everything looks good using my "ATTO Technology, Inc. ExpressSAS 12Gb/s >> SAS/SATA HBA (rev 06)" HBA (x86_64 host). > > Yeah, unfortunately these steps prob won't exercise much of the code > changes here since I figure error handling would not kick in. > > However using this same adapter type on my arm64 system has error > handling kick in almost straight away - and the handling looks sane. A > silver lining, I suppose .. I ran some more tests. In particular, I ran libzbc compliance tests on a 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your series, I see command timeout that take forever to recover from, with the drive revalidation failing after that. [ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73 [ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task timeout. [ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is aborted [ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is aborted [ 411.192602] ata21.00: qc timeout (cmd 0xec) [ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task timeout. [ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4) [ 431.685544] ata21.00: revalidation failed (errno=-5) [ 441.911948] ata21.00: qc timeout (cmd 0xec) [ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task timeout. [ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4) [ 462.404992] ata21.00: revalidation failed (errno=-5) [ 492.598769] ata21.00: qc timeout (cmd 0xec) ... So there is a problem. Need to dig into this. I see this issue only with libzbc passthrough tests. fio runs with libaio are fine. >> And sparse/make C=1 complains about: >> >> drivers/scsi/libsas/sas_port.c:77:13: warning: context imbalance in >> 'sas_form_port' - different lock contexts for basic block > > I think it's talking about the port->phy_list_lock usage - it prob > doesn't like segments where we fall out a loop with the lock held (which > was grabbed in the loop). Anyway it looks ok. Maybe we can improve this. > >> >> But I have not checked if it is something that your series touch. >> >> And there is a ton of complaints about __le32 use in the pm80xx code... >> I can try to have a look at these if you want, on top of your series. > > I really need to get make C=1 working for me - it segfaults in any env I > have :( I now have a 12 patch series that fixes *all* the sparse warnings. Some of the fixes were trivial, but most of them are simply hard bugs with the handling of le32 struct field values. There is no way that this driver is working as-is on big-endian machines. Some calculations are actually done using cpu_to_le32() values ! But even though these fixes should have essentially no effect on little-endian x86_64, with my series applied, I see the same command timeout problem as with your libsas update, and both series together result in the same timeout issue too. So it looks like "fixing" the code actually is revealing some other bug that was previously hidden... This will take some time to debug. Another problem I noticed: doing "rmmod pm80xx; modprobe pm80xx" result in a failure of device scans. I get loops of "link is slow to respond ->reset". For the above tests, I had to reboot every time I changed the driver module code. Another thing to look at. Will try to spend some more time on this next week. Cheers. > > Thanks, > John -- Damien Le Moal Western Digital Research