Re: [PATCH] scsi: libsas: defer ata device eh commands to libata

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2018/2/28 16:09, John Garry wrote:
On 28/02/2018 04:14, Jason Yan wrote:


On 2018/2/27 23:00, Jack Wang wrote:
2018-02-27 12:50 GMT+01:00 John Garry <john.garry@xxxxxxxxxx>:
On 27/02/2018 06:59, Jason Yan wrote:

When ata device doing EH, some commands still attached with tasks
are not
passed to libata when abort failed or recover failed, so libata did
not
handle these commands. After these commands done, sas task is freed,
but
ata qc is not freed. This will cause ata qc leak and trigger a warning
like below:


It's seems like a bug that we're just killing the ATA command in libsas
error handling and not deferring them to ATA EH also.

And this WARN, below, in ata_eh_finish() is a longterm issue I see (but
maybe because of other issue also).

As mentioned to Jason privately, I wonder why Dan's patch excluded the
change here:
commit 3944f50995f947558c35fb16ae0288354756762c
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date:   Tue Nov 29 12:08:50 2011 -0800

     [SCSI] libsas: let libata handle command timeouts

     libsas-eh if it successfully aborts an ata command will hide the
timeout/
     condition (AC_ERR_TIMEOUT) from libata.  The command likely
completes
     with the all-zero task->task_status it started with.  Instead,
interpret
     a TMF_RESP_FUNC_COMPLETE as the end of the sas_task but keep the
scmd
     around for libata-eh to handle.

     Tested-by: Andrzej Jakowski <andrzej.jakowski@xxxxxxxxx>
     Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
     Signed-off-by: James Bottomley <JBottomley@xxxxxxxxxxxxx>



WARNING: CPU: 0 PID: 28512 at drivers/ata/libata-eh.c:4037
ata_eh_finish+0xb4/0xcc
CPU: 0 PID: 28512 Comm: kworker/u32:2 Tainted: G     W  OE 4.14.0#1
......
Call trace:
[<ffff0000088b7bd0>] ata_eh_finish+0xb4/0xcc
[<ffff0000088b8420>] ata_do_eh+0xc4/0xd8
[<ffff0000088b8478>] ata_std_error_handler+0x44/0x8c
[<ffff0000088b8068>] ata_scsi_port_error_handler+0x480/0x694
[<ffff000008875fc4>] async_sas_ata_eh+0x4c/0x80
[<ffff0000080f6be8>] async_run_entry_fn+0x4c/0x170
[<ffff0000080ebd70>] process_one_work+0x144/0x390
[<ffff0000080ec100>] worker_thread+0x144/0x418
[<ffff0000080f2c98>] kthread+0x10c/0x138
[<ffff0000080855dc>] ret_from_fork+0x10/0x18
Hi John, hi Jason,

We've seen this warning once on pm80xx with sata SSD in production (on
3.12 kernel), but failed to see the root cause.
In my case, it's a chain sequence, one SSD failed, lead to error
handle  & IO stuck.

Do you have reproducer?


I have found this warning several times when our test team running a
very large test suite. Sorry to say that I do not have a simple
reproducer yet.


Typically we would see this when commands timeout after we unplug a SATA
disk with IO in flight, right?


Yes, that's true.

John

Your change looks good to me, but would be good to hear from Dan &
James.

Thanks,
Jack







.




.





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux