Re: Question about supporting AMD eGPU hot plug case

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Thu, 11 Mar 2021 17:40:55 -0500

On 2021-03-05 6:13 p.m., Andrey Grodzovsky wrote:

On 2021-03-05 2:12 p.m., Sergei Miroshnichenko wrote:
On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:

On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
+ linux-pci

On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
...
Are you saying that even without hot-plugging, while both
nvme
and
AMD
card are present
right from boot, you still get BARs moving and MMIO
ranges
reassigned
for NVME BARs
just because amdgpu driver will start resize of AMD card
BARs
and
this
will trigger NVMEs BARs move to
allow AMD card BARs to cover full range of VIDEO RAM ?
Yes. Unconditionally, because it is unknown beforehand if
NVMe's
BAR
movement will help. In this particular case BAR movement is
not
needed,
but is done anyway.

BARs are not moved one by one, but the kernel releases all
the
releasable ones, and then recalculates a new BAR layout to
fit
them
all. Kernel's algorithm is different from BIOS's, so NVME
has
appeared
at a new place.

This is triggered by following:
- at boot, if BIOS had assigned not every BAR;
- during pci_resize_resource();
- during pci_rescan_bus() -- after a pciehp event or a
manual
via
sysfs.

By manual via sysfs you mean something like this - 'echo 1 >
/sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
/sys/bus/pci/rescan ' ? I am looking into how most reliably
trigger
PCI
code to call my callbacks even without having external PCI
cage
for
GPU
(will take me some time to get it).

Yeah, this is our way to go when a device can't be physically
removed
or unpowered remotely. With just a bit shorter path:

     sudo sh -c 'echo 1 >
/sys/bus/pci/devices/0000\:0c\:00.0/remove'
     sudo sh -c 'echo 1 > /sys/bus/pci/rescan'

Or, just a second command (rescan) is enough: a BAR movement
attempt
will be triggered even if there were no changes in PCI
topology.

Serge

Hi Segrei

Here is a link to initial implementation on top of your tree
(movable_bars_v9.1) -
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qEC3qIAM8h1vU4gGEgT6sThXsaCuatTI2UjM9Bb8KGI%3D&amp;reserved=0 

I am able to pass one re-scan cycle and can use the card
afterwards
(see
log1.log).
But, according to your prints only BAR5 which is registers BAR
was
updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
0xfc100000)
while I am interested to test BAR0 (Graphic RAM) move since this
is
where most of the complexity is. Is there a way to hack your code
to
force this ?

Hi Andrey,

Regarding the amdgpu's BAR0 remaining on its place: it seems this
is
because of fixed BARs starting from fc600000. The kernel tends to
group
the BARs close to each other, making a bridge window as compact as
possible. So the BAR0 had occupied the closest "comfortable" slots
0xe0000000-0xefffffff, with the resulting bridge window of bus 00
covering all the BARs:

      pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
window]

I'll let you know if I get an idea how to rearrange that manually.

Two GPUs can actually swap their places.

What do you mean ?

I was thinking: when the scenario of a PCI rescan with two GPUs (as was
described below) will start working, BAR0 of GPU0 can take place of
BAR0 of GPU1 after the first rescan.

What also can make a BAR movable -- is rmmod'ing its driver. It
could
be some hack from within a tmux, like:

    rmmod igb; \
    rmmod xhci_hcd; \
    rmmod ahci; \
    echo 1 > /sys/bus/pci/rescan; \
    modprobe igb; \
    modprobe xhci_hcd; \
    modprobe ahci

But should I also rmmod amdgpu ? Or modprobing back the other
drivers
should cause (hopefully) BAR0 move in AMD graphic card ?

You have already made the amdgpu movable, so no need to rmmod it --
just those with fixed BARs:

     xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
0xfc6fffff 64bit]
     igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
     igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
     ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff
pref]
     ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
0xfcb9ffff]
     xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
0xfcba7fff 64bit]
     xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
0xfca07fff 64bit]
     ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
0xfce08fff]

The expected result is they all move closer to the start of PCI address
space.

Ok, I updated as you described. Also I removed PCI conf command to stop
address decoding and restart later as I noticed PCI core does it itself
when needed.
I tested now also with graphic desktop enabled while submitting
3d draw commands and seems like under this scenario everything still
works. Again, this all needs to be tested with VRAM BAR move as then
I believe I will see more issues like handling of MMIO mapped VRAM 
objects (like GART table). In case you do have an AMD card you could 
also maybe give it a try. In the meanwhile I will add support to 
ioremapping of those VRAM objects.

Andrey

Just an update, added support for unmaping/remapping of all VRAM
objects, both user space mmaped and kernel ioremaped. Seems to work
ok but again, without forcing VRAM BAR to move I can't be sure.
Alex, Chsristian - take a look when you have some time to give me some
initial feedback on the amdgpu side.

The code is at 
https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1

Andrey

I think pci_release_resource() should not be in
amdgpu_device_unmap_mmio() -- the patched kernel will do that
itself
for BARs the amdgpu_device_bar_fixed() returns false. Even more --
the
kernel will ensure that all BARs which were working before, are
reassigned properly, so it needs them to be assigned before the
procedure.
The same for pci_assign_unassigned_bus_resources() in
amdgpu_device_remap_mmio(): this callback is invoked from
pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().

This seems to me in contrast to your documentation (see
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F5bc12ba7c74f1c19c11db29b4807bd32acfac2c2&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kO1OlRL8iHMTcijuV0jDpODCtXpCCTpJv6YIn%2FuypNQ%3D&amp;reserved=0 

step 1) although step 2 seems also to contradict step 1 with regard
to
BARs release - so now I am a bit confused. Also looking at
nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical
acquisition happens in nvme_dev_unmap.

Ah, there is a difference between pci_release_region() and
pci_release_resource(), so subtle that I had to refresh my memory. You
are right, this has to be explained in the documentation!

$ sudo cat /proc/iomem
...
f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
...
   fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
     fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
       fcf00000-fcf03fff : nvme          -- pci region (reserved by
                                            a driver, has its name).

So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
is not used by the driver anymore -- this actually should be called in
every rescan_prepare().

But the pci_release_resource() tells to the PCI subsystem that the BAR
is "released" from the device and has to be assigned to some address
before using again, and makes the pci_resource_start(pdev,
relased_barno) invalid.

Why the quotes: pci_release_resource() doesn't turn off the BAR,
doesn't write the registers -- this happens later.

I thouht at first that pci_release_resource() is not safe in a
rescan_prepare(), but then double-checked, and found it's fine, just
not needed, as the kernel will do it anyway. And the
pci_bus_check_bars_assigned() to compare the bitmasks of successfully
assigned BARs is called *before* the hook.

When testing with 2 graphic cards and triggering rescan, hard
hang of
the system happens during rescan_prepare of the second card  when
stopping the HW (see log2.log) - I don't understand why this
would
happen as each of them passes fine when they are standalone
tested
and
there should be no interdependence between them as far as i know.
Do you have any idea ?

What happens with two GPUs is unclear for me as well, nothing looks
suspicious.

Serge