Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 15, 2019 at 02:47:01PM -0400, Dennis Dalessandro wrote:
> On 4/12/2019 9:55 AM, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:
> > > On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
> > > 
> > > > This is a device bug then.
> > > 
> > > > A RDMA device must accept and respond to all TLPs that the CPU
> > > > could create for the user accessible BAR pages.
> > > 
> > > > A user process must not be able to crash the CPU or make the
> > > > device malfunction by accessing the exposed BAR page. This
> > > > includes a broad range of topics, like mis-aligned acceses,
> > > > SSE instructions, atomics, >etc.
> > > 
> > > > Is blocking AER even enough here? If the device isn't
> > > > generating a reasonable reply I have a bad feeling worse will
> > > > happen.
> > > 
> > > After blocking unsupported request error, we don't see any other
> > > issue including no system hang.
> > 
> > Are you specifically testing all the special TLPs the CPU can
> > produce?
> 
> All the special TLPs should have been tested. This however seems to
> be a missed test case. Not that surprising though given differences
> in BIOS and things of that nature that something falls through the
> cracks and is extra hard to find.

Is there a published erratum for this?  I don't have warm fuzzies yet
that we actually know the root cause here.

Kamenee said the problem case was:

  user-level application is making spurious read accesses (invalid
  width access) to this memory mapping causing the device to report an
  unsupported request error through AER.

So I guess that means the application performed a read and got invalid
data back?  I think the Root Complex had to supply *some* data to
complete the CPU's read, and since the HFI responded with UR instead
of data, the RC probably fabricated something.  Many RCs fabricate ~0,
but I don't think that's actually required by the spec, so I'm
doubtful that the application can reliably detect this.

I'd be really surprised that something as obvious as an invalid width
wasn't tested, especially if this is intended for direct mapping into
user applications.

Bjorn



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux