Re: Question about supporting AMD eGPU hot plug case

Sergei Miroshnichenko <s.miroshnichenko@xxxxxxxxx> · Fri, 5 Mar 2021 19:12:11 +0000

On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
> 
> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
> > On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
> > > + linux-pci
> > > 
> > > On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
> > > > On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
> > > > > On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
> > > > > > On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
> > > > > > > On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
> > > > > > > > ...
> > > > > > > Are you saying that even without hot-plugging, while both
> > > > > > > nvme
> > > > > > > and
> > > > > > > AMD
> > > > > > > card are present
> > > > > > > right from boot, you still get BARs moving and MMIO
> > > > > > > ranges
> > > > > > > reassigned
> > > > > > > for NVME BARs
> > > > > > > just because amdgpu driver will start resize of AMD card
> > > > > > > BARs
> > > > > > > and
> > > > > > > this
> > > > > > > will trigger NVMEs BARs move to
> > > > > > > allow AMD card BARs to cover full range of VIDEO RAM ?
> > > > > > Yes. Unconditionally, because it is unknown beforehand if
> > > > > > NVMe's
> > > > > > BAR
> > > > > > movement will help. In this particular case BAR movement is
> > > > > > not
> > > > > > needed,
> > > > > > but is done anyway.
> > > > > > 
> > > > > > BARs are not moved one by one, but the kernel releases all
> > > > > > the
> > > > > > releasable ones, and then recalculates a new BAR layout to
> > > > > > fit
> > > > > > them
> > > > > > all. Kernel's algorithm is different from BIOS's, so NVME
> > > > > > has
> > > > > > appeared
> > > > > > at a new place.
> > > > > > 
> > > > > > This is triggered by following:
> > > > > > - at boot, if BIOS had assigned not every BAR;
> > > > > > - during pci_resize_resource();
> > > > > > - during pci_rescan_bus() -- after a pciehp event or a
> > > > > > manual
> > > > > > via
> > > > > > sysfs.
> > > > > 
> > > > > By manual via sysfs you mean something like this - 'echo 1 >
> > > > > /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
> > > > > /sys/bus/pci/rescan ' ? I am looking into how most reliably
> > > > > trigger
> > > > > PCI
> > > > > code to call my callbacks even without having external PCI
> > > > > cage
> > > > > for
> > > > > GPU
> > > > > (will take me some time to get it).
> > > > 
> > > > Yeah, this is our way to go when a device can't be physically
> > > > removed
> > > > or unpowered remotely. With just a bit shorter path:
> > > > 
> > > >     sudo sh -c 'echo 1 >
> > > > /sys/bus/pci/devices/0000\:0c\:00.0/remove'
> > > >     sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
> > > > 
> > > > Or, just a second command (rescan) is enough: a BAR movement
> > > > attempt
> > > > will be triggered even if there were no changes in PCI
> > > > topology.
> > > > 
> > > > Serge
> > > > 
> > > 
> > > Hi Segrei
> > > 
> > > Here is a link to initial implementation on top of your tree
> > > (movable_bars_v9.1) -
> > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C637bf3aab3634ab69dc608d8dff0e0b1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505572958565332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F2urbXqCLMzZBKe0lTyAfhgqDqDxiOZN04R9LTVxvI0%3D&amp;reserved=0
> > > I am able to pass one re-scan cycle and can use the card
> > > afterwards
> > > (see
> > > log1.log).
> > > But, according to your prints only BAR5 which is registers BAR
> > > was
> > > updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
> > > 0xfc100000)
> > > while I am interested to test BAR0 (Graphic RAM) move since this
> > > is
> > > where most of the complexity is. Is there a way to hack your code
> > > to
> > > force this ?
> > 
> > Hi Andrey,
> > 
> > Regarding the amdgpu's BAR0 remaining on its place: it seems this
> > is
> > because of fixed BARs starting from fc600000. The kernel tends to
> > group
> > the BARs close to each other, making a bridge window as compact as
> > possible. So the BAR0 had occupied the closest "comfortable" slots
> > 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
> > covering all the BARs:
> > 
> >      pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
> > window]
> > 
> > I'll let you know if I get an idea how to rearrange that manually.
> > 
> > Two GPUs can actually swap their places.
> 
> What do you mean ?

I was thinking: when the scenario of a PCI rescan with two GPUs (as was
described below) will start working, BAR0 of GPU0 can take place of
BAR0 of GPU1 after the first rescan.

> > What also can make a BAR movable -- is rmmod'ing its driver. It
> > could
> > be some hack from within a tmux, like:
> > 
> >    rmmod igb; \
> >    rmmod xhci_hcd; \
> >    rmmod ahci; \
> >    echo 1 > /sys/bus/pci/rescan; \
> >    modprobe igb; \
> >    modprobe xhci_hcd; \
> >    modprobe ahci
> 
> But should I also rmmod amdgpu ? Or modprobing back the other
> drivers 
> should cause (hopefully) BAR0 move in AMD graphic card ?

You have already made the amdgpu movable, so no need to rmmod it --
just those with fixed BARs:

    xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
0xfc6fffff 64bit]
    igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
    igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
    ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff 
pref]
    ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
0xfcb9ffff]
    xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
0xfcba7fff 64bit]
    xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
0xfca07fff 64bit]
    ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
0xfce08fff]

The expected result is they all move closer to the start of PCI address
space.

> > I think pci_release_resource() should not be in
> > amdgpu_device_unmap_mmio() -- the patched kernel will do that
> > itself
> > for BARs the amdgpu_device_bar_fixed() returns false. Even more --
> > the
> > kernel will ensure that all BARs which were working before, are
> > reassigned properly, so it needs them to be assigned before the
> > procedure.
> > The same for pci_assign_unassigned_bus_resources() in
> > amdgpu_device_remap_mmio(): this callback is invoked from
> > pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
> 
> This seems to me in contrast to your documentation (see 
> https://github.com/YADRO-KNS/linux/commit/5bc12ba7c74f1c19c11db29b4807bd32acfac2c2 
> step 1) although step 2 seems also to contradict step 1 with regard
> to 
> BARs release - so now I am a bit confused. Also looking at 
> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical 
> acquisition happens in nvme_dev_unmap.

Ah, there is a difference between pci_release_region() and
pci_release_resource(), so subtle that I had to refresh my memory. You
are right, this has to be explained in the documentation!

$ sudo cat /proc/iomem
...
f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
...
  fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
    fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
      fcf00000-fcf03fff : nvme          -- pci region (reserved by
                                           a driver, has its name).

So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
is not used by the driver anymore -- this actually should be called in
every rescan_prepare().

But the pci_release_resource() tells to the PCI subsystem that the BAR
is "released" from the device and has to be assigned to some address
before using again, and makes the pci_resource_start(pdev,
relased_barno) invalid.

Why the quotes: pci_release_resource() doesn't turn off the BAR,
doesn't write the registers -- this happens later.

I thouht at first that pci_release_resource() is not safe in a
rescan_prepare(), but then double-checked, and found it's fine, just
not needed, as the kernel will do it anyway. And the
pci_bus_check_bars_assigned() to compare the bitmasks of successfully
assigned BARs is called *before* the hook.

> > > When testing with 2 graphic cards and triggering rescan, hard
> > > hang of
> > > the system happens during rescan_prepare of the second card  when
> > > stopping the HW (see log2.log) - I don't understand why this
> > > would
> > > happen as each of them passes fine when they are standalone
> > > tested
> > > and
> > > there should be no interdependence between them as far as i know.
> > > Do you have any idea ?
> > 
> > What happens with two GPUs is unclear for me as well, nothing looks
> > suspicious.
> > 
> > Serge
> >