Re: [PATCH v2 2/2] PCI: pciehp: Abort hot-plug if pci_hp_add_bridge() fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 06, 2024 at 10:37:01AM +0200, Nam Cao wrote:
> I just discovered a new crash scenario with shpchp:
> 
> First, hot-add a bridge:
> (qemu) device_add pci-bridge,id=br2,bus=br1,chassis_nr=1,addr=1
> 
> But with the current patch, the hot-plug is still successful, just that the
> bridge is not properly configured:
> $ lspci -t
> -[0000:00]-+-00.0
>            +-01.0
>            +-02.0
>            +-03.0-[01-02]----00.0-[02]----01.0--  <-- new hot-added bridge
>            +-1f.0
>            +-1f.2
>            \-1f.3
> 
> But! Now I leave the hot-added bridge as is, and hot-add an ethernet card:
> (qemu) device_add e1000,bus=br1,id=eth0,addr=2
> 
> this command crashes the kernel (full crash log below).
> 
> The problem is because during hot-plugging of this ethernet card,
> pci_bus_add_devices() is invoked and the previously hot-plugged bridge hasn't
> been marked as "added" yet, so the driver attempts to add this bridge
> again, leading to the crash.
> 
> Now for pciehp, this scenario cannot happen, because there is only a single
> slot, so we cannot hot-plug another device. But still, this makes me think
> perhaps we shouldn't leave the hot-plugged bridge in a improperly-configured
> state as this patch is doing. I cannot think of a scenario that would crash pciehp
> similarly to shpchp. But that's just me, maybe there is a rare scenario
> that can crash pciehp, but I just haven't seen yet.
> 
> My point is: better safe than sorry. I propose going back to the original
> solution: calling pci_stop_and_remove_bus_device() and return a negative
> error, and get rid of this improperly configured bridge. It is useless
> anyways without a subordinate bus number.
> 
> What do you think?

When the kernel runs out of BAR address space for devices downstream of
a bridge, it doesn't de-enumerate the bridge.  The devices are just
unusable.

So my first intuition is that running out of bus numbers shouldn't result
in de-enumeration either.

Remind me, how exactly does the NULL pointer deref occur?  I think it's
because no struct pci_bus was allocated for the subordinate bus of the
hot-plugged bridge, right?  Because AFAICS that would happen in

pci_hp_add_bridge()
  pci_can_bridge_extend()
    pci_add_new_bus()
      pci_alloc_child_bus()

but we never get that far because pci_hp_add_bridge() bails out with -1.
So the subordinate pointer remains a NULL pointer.

Perhaps we should allocate an empty struct pci_bus instead.
Or check for a NULL subordinate pointer instead of crashing.

I can't really say off the top of my head which solution is appropriate
here, it requires going through the code paths and identifying the
complexity associated with each solution.

Thanks,

Lukas




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux