On 2/18/21 2:06 PM, Bjorn Helgaas wrote: > On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote: >> On 1/26/21 10:12 AM, Bjorn Helgaas wrote: >>> On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>>>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>>>> Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>>>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>>>> >>>>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>>>> >>>>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>>>> what and when? >>>>>>>>> >>>>>>>>> During boot: >>>>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>>>> If device fails, you should get an error from that device (fix the >>>>>>>>> device to return an error), or something immediately won't work and >>>>>>>>> you won't see it in sysfs. >>>>>>>> >>>>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>>>> That's the case where this is helpful. >>>>>>> >>>>>>> How? If you have kernel panic, it means you have much more worse problem >>>>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>>>> will see call trace related to it. If kernel panic was caused by >>>>>>> something else, supported/not supported won't help here. >>>>>> >>>>>> I still have no idea *WHICH* device it was that the panic occurred on. >>>>> >>>>> The kernel panic is printed from the driver. There is one driver loaded >>>>> for all same PCI devices which are probed without relation to their >>>>> number.> >>>>> If you have host with ten same cards, you will see one driver and this >>>>> is where the problem and not in supported/not-supported device. >>>> >>>> That's true, but you can also have different cards loading the same driver. >>>> See, for example, any PCI_IDs list in a driver. >>>> >>>> For example, >>>> >>>> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >>>> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >>>> >>>> Both load the megaraid driver and have different profiles within the >>>> driver. I have no idea which one actually panicked until removing >>>> one card. >>>> >>>> It's MUCH worse when debugging new hardware and getting a panic >>>> from, for example, the uncore code which binds to a PCI mapped >>>> device. One device might work and the next one doesn't. And >>>> then you can multiply that by seeing *many* panics at once and >>>> trying to determine if the problem was on one specific socket, >>>> die, or core. >>> >>> Would a dev_panic() interface that identified the device and >>> driver help with this? >> >> ^^ the more I look at this problem, the more a dev_panic() that >> would output a device specific message at panic time is what I >> really need. Bjorn, I went down this road a bit and had a realization. The issue isn't with printing something at panic time, but the *data* that is output. Each PCI device is associated with a struct device. That device struct's name is output for dev_dbg(), etc., commands. The PCI subsystem sets the device struct name at drivers/pci/probe.c: 1799 dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus), dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); My problem really is that the above information is insufficient when I (or a user) need to debug a system. The complexities of debugging multiple broken driver loads would be much easier if I didn't have to constantly add this output manually :). Would you be okay with adding a *debug* parameter to expand the device name to include the vendor & device ID pair? FWIW, I'm somewhat against yet-another-kernel-option but that's really the information I need. I could then add dev_dbg() statements in the local_pci_probe() function. Thoughts? P.