Re: [PATCH] pci/probe: Enable CRS for Intel Haswell root ports

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc Josh]

On Tue, Sep 2, 2014 at 1:30 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Sep 1, 2014 at 9:14 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>>
>> I'm not a fan of adding a whitelist for devices that work correctly.
>> I don't think that's a maintainable solution.  Since we haven't had
>> many systems yet that care about CRS, some kind of "enable CRS on
>> machines newer than X" might work.
>
> I'd suggest trying to just enable CRS unconditionally (ie revert my
> old commit ad7edfe04908). We've done other changes in this area since,
> and in particular, it's entirely possible that the original problem we
> had doesn't even trigger any more.
>
> In particular, looking at one of the old reports, I don't see that
> "Device %04x:%02x:%02x.%d not responding" warning that we *should*
> have triggered in pci_scan_device(). So I wonder if we had some case
> that read the vendor ID without that whole loop, and that _that_ was
> our fundamental problem.
>
> It would be great if somebody could test with the old hardware that
> triggered the problem originally, but those reports are from 2007, so
> it might be hard to find anything relevant today. But trying to just
> go back to unconditionally enabling CRS (say, in 3.18-rc1) and seeing
> if anybody hollers, might just be the right approach.

Josh actually does still have the hardware, so he *might* be able to test it.

My original theory was that the device didn't advertise CRS SV
capability, and something went wrong when we enabled it anyway.  But
lspci incorrectly decoded CRS SV from the RootCap, so we don't know
whether it was advertised, and I didn't have an explanation for what
could be going wrong if we enabled it.

But as you point out, we don't see the "not responding" message.  We
only print that if we read 0xffff0001 (device/vendor ID).  But lspci
claims the vendor:device ID is 0001:8168.  So my new theory is:

  - Root Port issues Configuration Request to read four bytes at 0x0.
  - Device generates Completion with Config Request Retry Status,
returning data of 10ec:8168.
  - Root Port (with SV disabled) retries the Config Request until it
succeeds, then returns 0x816810ec.  In this configuration, these
boards work correctly today.
  - Root Port (with SV enabled) replaces Vendor ID with 0001 but fails
to replace Device ID with ffff, so it returns 0x81680001.  This is a
hardware defect per PCIe r3.0, sec 2.3.2.
  - Linux sees 0x81680001, but we're looking for 0xffff0001, so we
think everything's fine and we don't retry or warn about it.

So I think we should try two patches:

  1) Enable CRS SV if support is advertised, i.e., Rajat's current patch
  2) Change pci_bus_read_dev_vendor_id() to test only the two bytes of
the vendor ID and ignore the device ID

I suspect that if Josh tested patch 1) alone, he would see the same
problem he originally reported, and that if he tested both together,
it would work correctly.  Obviously, if it works out this way, I would
apply them in the reverse order to avoid a bisection problem.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux