Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case]

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Fri, 5 Mar 2021 18:19:33 -0500

On 2021-03-05 3:36 a.m., Christian König wrote:
Hi Andrey,

Am 03.03.21 um 17:05 schrieb Andrey Grodzovsky:
Ping

Andrey

On 2021-03-02 12:48 p.m., Andrey Grodzovsky wrote:
Adding PCI mailing list per Bjorn's advise (sorry, skipped the 
address last time).

On 2021-03-02 12:22 p.m., Christian König wrote:
Hi Andrey,

Can you elaborate more why you need such a lock on a global level 
and not
enough per driver (per device more specifically) local lock 
acquired in
rescan_preapare (before BARs shuffling starts) and released in 
rescan_done
(after BARs shuffling done) ? 

because this adds inter device dependencies.

E.g. when we have device A, B and C and each is taking it's own lock 
we end up with 3 locks.

If those device have inter dependencies on their own, for example 
because of DMA-buf we will end up in quite a locking mess.

Apart from that if we don't always keep the order A, B, C but 
instead end up with A, C, B lockdep will certainly start to complain.

Regards,
Christian.

But as far as I understand those locks are agnostic to each other - 
each lock is taken only by it's respective device and not by others.

Well they should be agnostic to each other, but are they really? I mean 
we could have inter device lock dependencies because of DMA-buf or good 
knows what.

So I don't see deadlock danger here but, I do suspect lockdep might 
throw a false warning...

Yeah and disabling or ignoring the lockdep warning could cause problems 
because we don't see issues caused by inter device lock dependencies any 
more.

I just have the feeling that this would cause a lot of problems.

Christian.

Sergei, can you please address Christian's suggestion bellow - quote
"
What we need instead is a upper level PCI layer rw-lock which we
can take on the read side in the driver
"

Andrey

Andrey

Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky:
Per Bjorn's advise adding PCI mailing list.

Christian, please reply on top of this mail and not on top the 
earlier, original one.

Andrey

On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote:
Please add linux-pci@xxxxxxxxxxxxxxx to the cc list.  There's a 
lot of
useful context in this thread and it needs to be archived.
On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote:
Adding Sergei and Bjorn since you are suggesting some PCI related 
changes.

On 2021-03-02 6:28 a.m., Christian König wrote:
Yeah, I'm thinking about that as well for quite a while now.

I'm not 100% sure what Sergei does in his patch set will work 
for us. He
basically tells the driver to stop any processing then shuffle 
around
the BARs and then tells him to start again.

Because of the necessity to reprogram bridges there isn't much 
other way
how to do this, but we would need something like taking a lock,
returning and dropping the lock later on. If you do this for 
multiple
drivers at the same time this results in a locking nightmare.

What we need instead is a upper level PCI layer rw-lock which we 
can
take on the read side in the driver, similar to our internal 
reset lock.

Can you elaborate more why you need such a lock on a global level 
and not
enough per driver (per device more specifically) local lock 
acquired in
rescan_preapare (before BARs shuffling starts) and released in 
rescan_done
(after BARs shuffling done) ?

Andrey

Wiring up changing the mapping is the smaller problem in that 
picture.

Christian.

Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky:
Hi, I started looking into actual implementation for this, as
reference point I am using Sege's NVME implementations (see 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&amp;reserved=0) 

and our own amdgpu_device_resize_fb_bar. Our use case being more
complicated then NVME's arises some questions or thoughts:

Before the PCI core will move the BARs (VRAM and registers) we 
have
to stop the HW and SW and then unmap all MMIO mapped entitles 
in the
driver. This includes, registers of all types and VRAM placed BOs
with CPU visibility. My concern is how to prevent accesses to all
MMIO ranges between iounmap in rescan_preapare and ioremap in
rescan_done. For register accesses we are again back to same issue
we had with device unplug and GPU reset we are discussing with 
Denis
now. So here at least as first solution just to confirm the 
feature
works I can introduce low level (in register accessors) RW locking
to avoid access until all register access driver pointers are
remapped post BARs move. What more concerns me is how to - In
rescan_preapare: collect all CPU accessible BOs placed in VRAM (i
can use amdgpu specific list like I currently use for device
unplug), lock them to prevent any further use of them, unpin them
(and unmap). In rescan_preapare: remap them back and unlock 
them for
further use. Also, there is a difference between kernel BOs where
indeed iounmap, ioremap should be enough and user space mapped BOs
where it seems like in hot unplug, we need in rescan_preapare to
only unamp them, in between drop new page faults with 
VM_FAULT_RETRY
and post rescan_done allow page faults to go through to refill all
the user page tables with updated MMIO addresses.

Let me know your thoughts.

Andrey

On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote:

On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
...
Are you saying that even without hot-plugging, while both 
nvme
and
AMD
card are present
right from boot, you still get BARs moving and MMIO ranges
reassigned
for NVME BARs
just because amdgpu driver will start resize of AMD card 
BARs and
this
will trigger NVMEs BARs move to
allow AMD card BARs to cover full range of VIDEO RAM ?
Yes. Unconditionally, because it is unknown beforehand if 
NVMe's
BAR
movement will help. In this particular case BAR movement is 
not
needed,
but is done anyway.

BARs are not moved one by one, but the kernel releases all the
releasable ones, and then recalculates a new BAR layout to 
fit them
all. Kernel's algorithm is different from BIOS's, so NVME has
appeared
at a new place.

This is triggered by following:
- at boot, if BIOS had assigned not every BAR;
- during pci_resize_resource();
- during pci_rescan_bus() -- after a pciehp event or a 
manual via
sysfs.
By manual via sysfs you mean something like this - 'echo 1 >
/sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
/sys/bus/pci/rescan ' ? I am looking into how most reliably 
trigger
PCI
code to call my callbacks even without having external PCI 
cage for
GPU
(will take me some time to get it).
Yeah, this is our way to go when a device can't be physically 
removed
or unpowered remotely. With just a bit shorter path:

    sudo sh -c 'echo 1 > 
/sys/bus/pci/devices/0000\:0c\:00.0/remove'
    sudo sh -c 'echo 1 > /sys/bus/pci/rescan'

Or, just a second command (rescan) is enough: a BAR movement 
attempt
will be triggered even if there were no changes in PCI topology.

Serge

Ok, able to trigger stub callbacks call after doing rescan from
sysfs. Also I see BARs movement

[ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem
0xc0000000-0xcfffffff 64bit pref]
[ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 ->
0xc0000000
[ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem
0xd0000000-0xd01fffff 64bit pref]
[ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 ->
0xd0000000
[ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem
0xf0100000-0xf013ffff]
[ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 ->
0xf0100000

As expected, after the move the device becomes unresponsive as
unmap/ioremap wasn't done.

Andrey