Re: [PATCH 0/2] mpt2sas,mpt3sas - PCI master abort fixups

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 13 Apr 2015 07:38:00 -0700

On Mon, 2015-04-13 at 10:06 -0400, Joe Lawrence wrote:
> On 04/12/2015 08:54 PM, James Bottomley wrote:
> > On Sun, 2015-04-12 at 20:11 -0400, Joe Lawrence wrote:
> >> On 12/30/2014 09:07 AM, Joe Lawrence wrote:
> >>> A colleague noticed that the mpt2 and mpt3sas drivers do not correctly
> >>> check the PCI master abort pattern in _base_wait_for_doorbell_ack.  This
> >>> pattern should be checked *prior* to any valid bit patterns, which would
> >>> always return true since a PCI read on master abort sets all bits high.
> >>>
> >>> The second patch adds similar checking to _base_wait_for_doorbell_int and
> >>> _base_wait_for_doorbell_not_used to avoid potentially long loops around
> >>> PCI reads.
> >>>
> >>> Joe Lawrence (2):
> >>>   mpt2sas,mpt3sas: correct master-abort checking in doorbell ack
> >>>   mpt2sas,mpt3sas: additional master abort checks
> >>>
> >>>  drivers/scsi/mpt2sas/mpt2sas_base.c |   17 ++++++++++++-----
> >>>  drivers/scsi/mpt3sas/mpt3sas_base.c |   17 ++++++++++++-----
> >>>  2 files changed, 24 insertions(+), 10 deletions(-)
> >>>
> >>
> >> Avago ping?
> >>
> >> This one was pretty straightforward: check 0xFFFFFFFF *before* any
> >> individual bit(s), i.e. before reading the doorbell register.
> > 
> > OK, Joe, explain why this patch is important: what problems could result
> > from it not being present?  If you convince everyone then no more mpt2/3
> > sas patches until this is at least commented on and a plan of action
> > proposed.
> 
> Hi James,
> 
> As currently coded:  If the PCI read returns a master abort,
> _base_wait_for_doorbell_ack will loop until it exhausts its timeout (up
> to 15 seconds).  Other parts of the driver, like the periodic watchdog
> or EEH, may detect a similar problem before such a long time and cleanup
> the mess.  However, complete device removal may be stalled until whoever
> called _base_wait_for_doorbell_ack is satisfied that it has finished.

I think we all picked this up from the changelog.  What I meant was in
what situations might a card get a master abort ... because that's when
the problem becomes user visible.  It sounds like it's something that
might not occur very often or is a bit theoretical, is that right?

> This behavior is not really a bug, but feels like one in the making.
> Should additional code be introduced, copy/pasted, etc. it may not do
> what was intended.
> 
> For future reference, would a repost have been more appropriate?  This
> changeset was so small that I figured a status ping would have sufficed.

Either works.  I was just trying to work out what sort of attention
needs to be paid to the fix based on what sort of problem it fixes for
the end user.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html