Handling Asynchronous Notification when IO are outstanding

Gwendal Grignou <gwendal@xxxxxxxxxx> · Mon, 8 Mar 2010 16:27:49 -0800

I am working with Marvell 7042 controller and SiI3276 port multiplier
[PMP] and would like to handle asynchronous notification [AN]
properly.
However, if a command is outstanding when the PMP raises an AN, the
port is frozen, preventing _autopsy_ error code from doing its work.

For example, here is a case where a disk has a power glitch behind a
port multiplier while a command is outstanding. The PMP detects the
signal loss and send an AN.
In sata_mv.c  mv_err_intr() is called and detect the notification: it
pushes info in error descriptor and call ata_port_schedule_eh() via
sata_async_notification().

However, when we enter ata_scsi_error(), if a command is outstanding,
__ata_port_freeze() is called, preventing  sata_scr_read() to succeed
in ata_eh_link_autopsy():

Feb 25 02:11:57 bdfl11 kernel: ata4.00: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.01: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.02: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.03: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.04: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.05: failed to read SCR 1 (Emask=0x40)
Feb 25 02:11:57 bdfl11 kernel: ata4.15: exception Emask 0x4 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.15: edma_err_cause=02000100
pp_flags=00000005, fis_cause=00008200
Feb 25 02:11:57 bdfl11 kernel: ata4.00: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.01: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.02: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.03: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.04: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.04: cmd
ca/00:80:e7:78:56/00:00:00:00:00/e8 tag 3 dma 65536 out
Feb 25 02:11:57 bdfl11 kernel: res 50/00:00:4e:10:45/00:00:00:00:00/e8
Emask 0x4 (timeout)
Feb 25 02:11:57 bdfl11 kernel: ata4.04: status: { DRDY }
Feb 25 02:11:57 bdfl11 kernel: ata4.05: exception Emask 0x100 SAct 0x0
SErr 0x0 action 0x6 frozen
Feb 25 02:11:57 bdfl11 kernel: ata4.15: hard resetting link
Feb 25 02:11:58 bdfl11 kernel: ata4.15: SATA link up 3.0 Gbps (SStatus
123 SControl 300)
Feb 25 02:11:58 bdfl11 kernel: ata4.00: hard resetting link

I haven't found the right solution to handle this problem yet:

1: removing __ata_port_freeze() in ata_scsi_error() unilaterally is
very dangerous, it opens a new race condition and may schedule the
error handler several time.
2: in sata_mv, we can not wait for commands to complete like we do for
NCQ, because in the case above, the command sent to the failed disk
will never come back.

I am thinking of waiting for all IO to complete on all port but the
impacted one(s), adding a new action in ehi descriptor to indicate an
AN is scheduled, and preventing the error to froze the port if only
IOs to the failed ports are outstanding.
Then _autopsy_ code would collect and decode SERROR register for the
failed port.

Is it the right approach?

Thanks,
Gwendal.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html