Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/02/2022 03:02, Damien Le Moal wrote:
The inconsistency is this line says:
[  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
Which seems to be sensical for NCQ_NON_DATA command, but then, this line
seems wrong:
[  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8

I need to go and check the specs what the FIS reply format is for
NCQ_NON_DATA.


[  137.187184] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
[  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
[  137.199339] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
command 0x63 inb 4
[  137.207577] pm80xx0:: pm8001_mpi_msg_consume  1446:: CI=46 PI=47
msgHeader=8104200d
[  137.215399] pm80xx0:: mpi_sata_completion  2481:IO_SUCCESS
[  137.220961] pm80xx0:: mpi_sata_completion  2503:SAS_PROTO_RESPONSE
len = 20
[  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8
[  137.233878] pm80xx0:: pm8001_mpi_msg_free_set  1403: CI=47 PI=47
[  137.236696] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
[  137.247102] pm80xx0:: pm80xx_chip_sata_req  4585:DMA
[  137.252186] pm80xx0:: pm80xx_chip_sata_req  4593:FPDMA
[  137.257400] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
command 0x65 inb f
[  167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[  167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
[  167.519049] pm80xx0:: pm8001_chip_abort_task  4607:cmd_tag = 2, abort
task tag = 0x1
[  187.969173] pm80xx0:: pm8001_exec_internal_task_abort  753:TMF task
timeout.

As I mentioned, having this fail is a red flag. If I was pushed to guess what has happened, I'd say the FW is faulting due to some erroneous driver behaviour.

[  187.976450] sas: sas_scsi_find_task: task 0x00000000aa372627 is aborted
[  187.983244] sas: sas_eh_handle_sas_errors: task 0x00000000aa372627 is
aborted




After these messages, the tests exit on failure (drive dropped) and
there are no more messages. Doing rmmod or anything else get stuck too.
I have to reset the machine to get back to a good state.

I am starting to think that NCQ NON DATA command is being very badly
handled... Switching the tests to run with all non-NCQ commands is
working fine, albeit horribly slow (much slower than with other HBAs,
e.g. Broadcom).

Digging...
I missed a KASAN splat during device scan on boot:

    33.725184]
==================================================================
[   33.746168] BUG: KASAN: use-after-free in __lock_acquire+0x41a5/0x5b50
[   33.764181] Read of size 8 at addr ffff88818a318660 by task
kworker/u64:6/583

...

==================================================================

This is the submission path, not completion. The code is:

(gdb) list *(pm8001_queue_command+0x842)
0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
486				atomic_dec(&pm8001_dev->running_req);
487				goto err_out_tag;
488			}
489			/* TODO: select normal or high priority */
490			spin_lock(&t->task_state_lock);
491			t->task_state_flags |= SAS_TASK_AT_INITIATOR;
492			spin_unlock(&t->task_state_lock);
493		} while (0);
494		rc = 0;
495		goto out_done;

So the task is already completed when the submission path tries to set
the state flag ? Debugging...

Yeah, that's how it looks.

I already mentioned this problem here:

https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@xxxxxxxxxx/

Maybe we should just fix it now to rule it out of possibly causing other issues... I was reluctant to fix it as many places seems to need to be touched. Let me check it.

Thanks,
John





[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux