Re: `pcilib: sysfs_read_vpd: read failed: Input/output error` on ASRock E350M1

Myron Stowe <myron.stowe@xxxxxxxxx> · Tue, 17 May 2016 19:52:19 -0600

On Tue, May 17, 2016 at 3:21 PM, Paul Menzel
<paulepanter@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Dear Myron,
>
>
> Am Montag, den 16.05.2016, 19:42 -0600 schrieb Myron Stowe:
>> On Mon, May 16, 2016 at 2:37 PM, Paul Menzel wrote:
>
>> > Am Montag, den 16.05.2016, 12:02 -0600 schrieb Myron Stowe:
>> > >
>> > > On Sun, May 15, 2016 at 10:56 AM, Paul Menzel wrote:
>> > […]
>> > > >
>> > > > Is that related to [1]?
>> > > There have been a number of recent kernel commits to work-around know
>> > > buggy devices -
>> > >   v4.6-rc5
>> > >     67e6587  cxgb4: Set VPD size so we can read both VPD structures
>> > >     cb92148  PCI: Add pci_set_vpd_size() to set VPD size
>> > >   v4.6 -
>> > >     7c20078  PCI: Prevent VPD access for buggy devices
>> > >     c521b01  PCI: Sleep rather than busy-wait for VPD access completion
>> > >     408641e  PCI: Fold struct pci_vpd_pci22 into struct pci_vpd
>> > >     f1cd93f  PCI: Rename VPD symbols to remove unnecessary "pci22"
>> > >     da00684  PCI: Remove struct pci_vpd_ops.release function pointer
>> > >     6437907  PCI: Move pci_vpd_release() from header file to pci/access.c
>> > >     fc0a407  PCI: Move pci_read_vpd() and pci_write_vpd() close to other code
>> > >     104daa7  PCI: Determine actual VPD size on first access
>> > >     c556388  PCI: Use bitfield instead of bool for struct pci_vpd_pci22.busy
>> > >     f52e562  PCI: Allow access to VPD attributes with size 0
>> > >     9eb45d5  PCI: Update VPD definitions
>> > >   v4.3 -
>> > >     da2d03e  PCI: Use function 0 VPD only for identical functions
>> > >     9d92407  PCI: Fix devfn for VPD access through function 0
>> > >     7aa6ca4  PCI: Add VPD function 0 quirk for Intel Ethernet devices
>> > >     932c435  PCI: Add dev_flags bit to access VPD through function 0
>> > >
>> > > What were you getting/seeing before?
>> > The error wasn’t shown. Sorry if that is not a helpful answer. If I
>> > provide more information, please tell me how I can get them.
>> I expect that prior to the kernel updates you were seeing:
>>
>> lspci -s 3:0 -vv
>> ...
>>         Capabilities: [d0] Vital Product Data
>>                 Unknown small resource type 00, will not decode more.
>> ...
>
> From logs stored in coreboot’s board status repository [2].
>
> ```
> $ more asrock/e350m1/4.3-817-g477a0d6/2016-04-22T18_41_34Z/lspci-vvxxx.txt
> […]
>        Capabilities: [d0] Vital Product Data
>                No end tag found
> […]
> ```

The above - "No end tag found" - is from 'lspci' reading the VPD area
and parsing it looking for an small resource data type "End Tag"
(0x0f) and never finding one.  Thus it seems like the VPD area is
malformed at best.

>
>> and with 'xxd', looking at the VPD area you likely would have seen:
>>
>> # xxd /sys/bus/pci/devices/0000\:03\:00.0/vpd
>> 0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
>> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
>> 0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
>> ...
>>
>> and nothing in the 'dmesg' logs.
>
> ```
> $ more asrock/e350m1/4.3-817-g477a0d6/2016-04-22T18_41_34Z/lspci-vvxxx.txt
> […]
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 10 00 01 01 01 12 63 80 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 80 80 00 00 00 00 00 00 00 00 00
> […]
> ```

Careful here, the above is _not_ the VPD area.  The above is the
device's standard PCI (type 0) configuration header registers [I'm
thinking you mistook the 0xd0 from "Capabilities: [d0] Vital Product
Data" and thought that was the offset for the VPD area - it is not.
To get to the VPD area one has to manipulate a couple of VPD registers
which is basically obtaining the VPD bytes from within the device
itself (i.e. from within the device's firmware)].

To get the VPD area you will need to run a kernel prior to the commits
I pointed out and then use 'xxd' as shown above to gain insight into
what, if anything, is really there.

>
> There is nothing in the Linux 3.16 messages.
>
>> With the recent kernel updates you see the new output from the kernel
>> and 'lspci -vv' that started this thread.  I expect that you also get
>> something close to the following from 'xxd':
>>
>> # xxd /sys/bus/pci/devices/0000\:03\:00.0/vpd
>> xxd: Input/output error
>
> Correct.
>
>> and also see the following in the 'dmesg' logs.
>>
>> # dmesg | grep VPD
>> r8169 0000:03:00.0: invalid short VPD tag 00 at offset 1
>
> Indeed similar:
>
> ```
> $ dmesg | grep VPD
> [15841.258561] r8169 0000:03:00.0: invalid large VPD tag 7f at offset 0
> ```

Humm, the above is interesting to me in that it implies that the VPD
area is not necessarily all 0's as I was expecting.  A "Large VPD tag"
is denoted by the most significant bit of the first byte being set.
However, the value denoted - 0x7f - does not have the MSB set so now
I'm a bit confused (perhaps the MSB is stripped off - I'll need to
look at the related code).  Either way, its looking like perhaps the
VPD area is not all 0's so I'd really be interested in knowing what is
there.

Ok, looking at the code - ./drivers/pci/access.c::pci_vpd_size() - the
tag value output - 7f - had its MSB stripped off via
'pci_vpd_lrdt_tag()'.  So what must have been read for the first byte
was 0xff.

>
>> The new kernel behavior is correct in its complaint as the device is
>> advertising that it has VPD via a 'capability' yet its VPD data area
>> is likely all 0's.  This is a device error - specifically an error
>> with the device's firmware: either it should _not_ advertise via the
>> 'capability' that it has VPD or, it s VPD area needs to conform to the
>> VPD data format as expressed in the "PCI Local Bus Specification, Rev.
>> 3.0" appendix I - "Vital Product Data".
>>
>> With the new kernel behavior, effectively circumventing reading of an
>> invalid VPD area, you can't get to the data via "# xxd" (as shown
>> above).  You could run a kernel prior to the changes and run the "xxd"
>> command and likely see that the VPD area is all 0's.  I would be
>> interested in confirming such if you could do so please.
>
> Please see the paste above. Please tell me if I missed something.

I have to admit I'm a bit surprised at this point (that the VPD area
is not all 0's as I was expecting).  The only way we'll know is to
obtain the VPD area - if you can get that via 'xxd' it would be
interesting to see what is there exactly.

>
> So what to do about this? Contact Realtek? Add a workaround?

The VPD area is almost certainly malformed (or doesn't exist at all).
As a colleague of mine noted today, LSI devices seem to be especially
negligent in this manner (see commit 7c20078 ("PCI: Prevent VPD access
for buggy devices")).  Assuming the VPD area of this device is at best
invalid then the kernel is correct in its current behavior - not much
more it can do.  While it would be good for the device's firmware to
be upgraded to either not advertise it has VPD or to put in valid VPD,
I very much doubt you'd get any traction there.  You could try adding
this device's signature (device and vendor id) to the 7c20078 quirk
and see if there is any subtle difference.  If there was, it'd be very
subtle at best.

All that aside, you are just questioning the error output - correct
(i.e. you are not expecting, nor needing, valid VPD content for some
reason and questioning that I assume)?

Myron

>
>
> Thanks,
>
> Paul
>
>
>> > > > [1] https://lkml.org/lkml/2016/4/15/649
>> > [2] http://review.coreboot.org/cgit/board-status.git/tree/asrock/e350m1
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html