Re: [PATCH] pci-driver: Add driver load messages

Prarit Bhargava <prarit@xxxxxxxxxx> · Thu, 18 Feb 2021 13:36:35 -0500

On 1/26/21 10:12 AM, Bjorn Helgaas wrote:
> Hi Prarit,
> 
> On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote:
>> On 1/26/21 8:53 AM, Leon Romanovsky wrote:
>>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote:
>>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote:
>>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote:
>>>>>>   Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote:
>>>>>>>> There are two situations where driver load messages are helpful.
>>>>>>>>
>>>>>>>> 1) Some drivers silently load on devices and debugging driver or system
>>>>>>>> failures in these cases is difficult.  While some drivers (networking
>>>>>>>> for example) may not completely initialize when the PCI driver probe() function
>>>>>>>> has returned, it is still useful to have some idea of driver completion.
>>>>>>>
>>>>>>> Sorry, probably it is me, but I don't understand this use case.
>>>>>>> Are you adding global to whole kernel command line boot argument to debug
>>>>>>> what and when?
>>>>>>>
>>>>>>> During boot:
>>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*.
>>>>>>> If device fails, you should get an error from that device (fix the
>>>>>>> device to return an error), or something immediately won't work and
>>>>>>> you won't see it in sysfs.
>>>>>>
>>>>>> What if there is a panic during boot?  There's no way to get to sysfs.
>>>>>> That's the case where this is helpful.
>>>>>
>>>>> How? If you have kernel panic, it means you have much more worse problem
>>>>> than not-supported device. If kernel panic was caused by the driver, you
>>>>> will see call trace related to it. If kernel panic was caused by
>>>>> something else, supported/not supported won't help here.
>>>>
>>>> I still have no idea *WHICH* device it was that the panic occurred on.
>>>
>>> The kernel panic is printed from the driver. There is one driver loaded
>>> for all same PCI devices which are probed without relation to their
>>> number.>
>>> If you have host with ten same cards, you will see one driver and this
>>> is where the problem and not in supported/not-supported device.
>>
>> That's true, but you can also have different cards loading the same driver.
>> See, for example, any PCI_IDs list in a driver.
>>
>> For example,
>>
>> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
>> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
>>
>> Both load the megaraid driver and have different profiles within the
>> driver.  I have no idea which one actually panicked until removing
>> one card.
>>
>> It's MUCH worse when debugging new hardware and getting a panic
>> from, for example, the uncore code which binds to a PCI mapped
>> device.  One device might work and the next one doesn't.  And then
>> you can multiply that by seeing *many* panics at once and trying to
>> determine if the problem was on one specific socket, die, or core.
> 
> Would a dev_panic() interface that identified the device and driver
> help with this?
> 

^^ the more I look at this problem, the more a dev_panic() that would output a
device specific message at panic time is what I really need.

> For driver_load_messages, it doesn't seem necessarily PCI-specific.
> If we want a message like that, maybe it could be in
> driver_probe_device() or similar?  There are already a few pr_debug()
> calls in that path.  There are some enabled by initcall_debug that
> include the return value from the probe; would those be close to what
> you're looking for?

I took a look at those, and unfortunately they do not meet my requirements.
Ultimately, at panic time, I need to know that a driver was loaded on a device
at a specific location in the PCI space.

The driver_probe_device() pr_debug calls tell me the location and the driver,
but not anything to uniquely identify the device (ie, the PCI vendor and device
IDs).

It sounds like you've had some thoughts about a dev_panic() implementation.
Care to share them with me?  I'm more than willing to implement it but just want
to get your more experienced view of what is needed.

P.

> 
> Bjorn
>