Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case]

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Tue, 2 Mar 2021 12:48:24 -0500

Adding PCI mailing list per Bjorn's advise (sorry, skipped the address 
last time).

On 2021-03-02 12:22 p.m., Christian König wrote:
Hi Andrey,

Can you elaborate more why you need such a lock on a global level and not
enough per driver (per device more specifically) local lock acquired in
rescan_preapare (before BARs shuffling starts) and released in 
rescan_done
(after BARs shuffling done) ? 

because this adds inter device dependencies.

E.g. when we have device A, B and C and each is taking it's own lock we 
end up with 3 locks.

If those device have inter dependencies on their own, for example 
because of DMA-buf we will end up in quite a locking mess.

Apart from that if we don't always keep the order A, B, C but instead 
end up with A, C, B lockdep will certainly start to complain.

Regards,
Christian.

But as far as I understand those locks are agnostic to each other - each 
lock is taken only by it's respective device and not by others. So I 
don't see deadlock danger here but, I do suspect lockdep might throw a 
false warning...

Andrey

Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky:
Per Bjorn's advise adding PCI mailing list.

Christian, please reply on top of this mail and not on top the 
earlier, original one.

Andrey

On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote:
Please add linux-pci@xxxxxxxxxxxxxxx to the cc list.  There's a lot of
useful context in this thread and it needs to be archived.
On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote:
Adding Sergei and Bjorn since you are suggesting some PCI related 
changes.

On 2021-03-02 6:28 a.m., Christian König wrote:
Yeah, I'm thinking about that as well for quite a while now.

I'm not 100% sure what Sergei does in his patch set will work for 
us. He
basically tells the driver to stop any processing then shuffle around
the BARs and then tells him to start again.

Because of the necessity to reprogram bridges there isn't much 
other way
how to do this, but we would need something like taking a lock,
returning and dropping the lock later on. If you do this for multiple
drivers at the same time this results in a locking nightmare.

What we need instead is a upper level PCI layer rw-lock which we can
take on the read side in the driver, similar to our internal reset 
lock.

Can you elaborate more why you need such a lock on a global level 
and not
enough per driver (per device more specifically) local lock acquired in
rescan_preapare (before BARs shuffling starts) and released in 
rescan_done
(after BARs shuffling done) ?

Andrey

Wiring up changing the mapping is the smaller problem in that picture.

Christian.

Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky:
Hi, I started looking into actual implementation for this, as
reference point I am using Sege's NVME implementations (see 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&amp;reserved=0) 

and our own amdgpu_device_resize_fb_bar. Our use case being more
complicated then NVME's arises some questions or thoughts:

Before the PCI core will move the BARs (VRAM and registers) we have
to stop the HW and SW and then unmap all MMIO mapped entitles in the
driver. This includes, registers of all types and  VRAM placed BOs
with CPU visibility. My concern is how to prevent accesses to all
MMIO ranges between iounmap in rescan_preapare and ioremap in
rescan_done. For register accesses we are again back to same issue
we had with device unplug and GPU reset we are discussing with Denis
now. So here at least as first solution just to confirm the feature
works I can introduce low level (in register accessors) RW locking
to avoid access until all register access driver pointers are
remapped post BARs move. What more concerns me is how to - In
rescan_preapare: collect all CPU accessible BOs placed in VRAM (i
can use amdgpu specific list like I currently use for device
unplug), lock them to prevent any further use of them, unpin them
(and unmap). In rescan_preapare: remap them back and unlock them for
further use. Also, there is a difference between kernel BOs where
indeed iounmap, ioremap should be enough and user space mapped BOs
where it seems like in hot unplug, we need in rescan_preapare to
only unamp them, in between drop new page faults with VM_FAULT_RETRY
and post rescan_done allow page faults to go through to refill all
the user page tables with updated MMIO addresses.

Let me know your thoughts.

Andrey

On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote:

On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
...
Are you saying that even without hot-plugging, while both nvme
and
AMD
card are present
right from boot, you still get BARs moving and MMIO ranges
reassigned
for NVME BARs
just because amdgpu driver will start resize of AMD card BARs 
and
this
will trigger NVMEs BARs move to
allow AMD card BARs to cover full range of VIDEO RAM ?
Yes. Unconditionally, because it is unknown beforehand if NVMe's
BAR
movement will help. In this particular case BAR movement is not
needed,
but is done anyway.

BARs are not moved one by one, but the kernel releases all the
releasable ones, and then recalculates a new BAR layout to fit 
them
all. Kernel's algorithm is different from BIOS's, so NVME has
appeared
at a new place.

This is triggered by following:
- at boot, if BIOS had assigned not every BAR;
- during pci_resize_resource();
- during pci_rescan_bus() -- after a pciehp event or a manual via
sysfs.
By manual via sysfs you mean something like this - 'echo 1 >
/sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
/sys/bus/pci/rescan ' ? I am looking into how most reliably 
trigger
PCI
code to call my callbacks even without having external PCI cage 
for
GPU
(will take me some time to get it).
Yeah, this is our way to go when a device can't be physically 
removed
or unpowered remotely. With just a bit shorter path:

    sudo sh -c 'echo 1 > 
/sys/bus/pci/devices/0000\:0c\:00.0/remove'
    sudo sh -c 'echo 1 > /sys/bus/pci/rescan'

Or, just a second command (rescan) is enough: a BAR movement 
attempt
will be triggered even if there were no changes in PCI topology.

Serge

Ok, able to trigger stub callbacks call after doing rescan from
sysfs. Also I see BARs movement

[ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem
0xc0000000-0xcfffffff 64bit pref]
[ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 ->
0xc0000000
[ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem
0xd0000000-0xd01fffff 64bit pref]
[ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 ->
0xd0000000
[ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem
0xf0100000-0xf013ffff]
[ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 ->
0xf0100000

As expected, after the move the device becomes unresponsive as
unmap/ioremap wasn't done.

Andrey