Re: [PATCH] pci-driver: Add driver load messages

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 26 Jan 2021 09:12:59 -0600

Hi Prarit,

On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote:
> On 1/26/21 8:53 AM, Leon Romanovsky wrote:
> > On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote:
> >> On 1/26/21 8:14 AM, Leon Romanovsky wrote:
> >>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote:
> >>>>   Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote:
> >>>>>> There are two situations where driver load messages are helpful.
> >>>>>>
> >>>>>> 1) Some drivers silently load on devices and debugging driver or system
> >>>>>> failures in these cases is difficult.  While some drivers (networking
> >>>>>> for example) may not completely initialize when the PCI driver probe() function
> >>>>>> has returned, it is still useful to have some idea of driver completion.
> >>>>>
> >>>>> Sorry, probably it is me, but I don't understand this use case.
> >>>>> Are you adding global to whole kernel command line boot argument to debug
> >>>>> what and when?
> >>>>>
> >>>>> During boot:
> >>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*.
> >>>>> If device fails, you should get an error from that device (fix the
> >>>>> device to return an error), or something immediately won't work and
> >>>>> you won't see it in sysfs.
> >>>>
> >>>> What if there is a panic during boot?  There's no way to get to sysfs.
> >>>> That's the case where this is helpful.
> >>>
> >>> How? If you have kernel panic, it means you have much more worse problem
> >>> than not-supported device. If kernel panic was caused by the driver, you
> >>> will see call trace related to it. If kernel panic was caused by
> >>> something else, supported/not supported won't help here.
> >>
> >> I still have no idea *WHICH* device it was that the panic occurred on.
> > 
> > The kernel panic is printed from the driver. There is one driver loaded
> > for all same PCI devices which are probed without relation to their
> > number.>
> > If you have host with ten same cards, you will see one driver and this
> > is where the problem and not in supported/not-supported device.
> 
> That's true, but you can also have different cards loading the same driver.
> See, for example, any PCI_IDs list in a driver.
> 
> For example,
> 
> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
> 
> Both load the megaraid driver and have different profiles within the
> driver.  I have no idea which one actually panicked until removing
> one card.
> 
> It's MUCH worse when debugging new hardware and getting a panic
> from, for example, the uncore code which binds to a PCI mapped
> device.  One device might work and the next one doesn't.  And then
> you can multiply that by seeing *many* panics at once and trying to
> determine if the problem was on one specific socket, die, or core.

Would a dev_panic() interface that identified the device and driver
help with this?

For driver_load_messages, it doesn't seem necessarily PCI-specific.
If we want a message like that, maybe it could be in
driver_probe_device() or similar?  There are already a few pr_debug()
calls in that path.  There are some enabled by initcall_debug that
include the return value from the probe; would those be close to what
you're looking for?

Bjorn