On 1/26/21 10:12 AM, Bjorn Helgaas wrote: > Hi Prarit, > > On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>> Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>> >>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>> >>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>> what and when? >>>>>>> >>>>>>> During boot: >>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>> If device fails, you should get an error from that device (fix the >>>>>>> device to return an error), or something immediately won't work and >>>>>>> you won't see it in sysfs. >>>>>> >>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>> That's the case where this is helpful. >>>>> >>>>> How? If you have kernel panic, it means you have much more worse problem >>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>> will see call trace related to it. If kernel panic was caused by >>>>> something else, supported/not supported won't help here. >>>> >>>> I still have no idea *WHICH* device it was that the panic occurred on. >>> >>> The kernel panic is printed from the driver. There is one driver loaded >>> for all same PCI devices which are probed without relation to their >>> number.> >>> If you have host with ten same cards, you will see one driver and this >>> is where the problem and not in supported/not-supported device. >> >> That's true, but you can also have different cards loading the same driver. >> See, for example, any PCI_IDs list in a driver. >> >> For example, >> >> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >> >> Both load the megaraid driver and have different profiles within the >> driver. I have no idea which one actually panicked until removing >> one card. >> >> It's MUCH worse when debugging new hardware and getting a panic >> from, for example, the uncore code which binds to a PCI mapped >> device. One device might work and the next one doesn't. And then >> you can multiply that by seeing *many* panics at once and trying to >> determine if the problem was on one specific socket, die, or core. > > Would a dev_panic() interface that identified the device and driver > help with this? It would but, > > For driver_load_messages, it doesn't seem necessarily PCI-specific. > If we want a message like that, maybe it could be in > driver_probe_device() or similar? There are already a few pr_debug() > calls in that path. There are some enabled by initcall_debug that > include the return value from the probe; would those be close to what > you're looking for? I think this drivers/base/dd.c:727 might suffice. Let me try some tests with that and get back to you. Thanks for the pointers, P. > > Bjorn >