Adding Michal from the compute userspace team for sharing references to the code. Quoting Christian König (2024-11-19 12:00:44) > Am 19.11.24 um 00:37 schrieb Matthew Brost: > > From: Tejas Upadhyay <tejas.upadhyay@xxxxxxxxx> > > > > In order to avoid having userspace to use MI_MEM_FENCE, > > we are adding a mechanism for userspace to generate a > > PCI memory barrier with low overhead (avoiding IOCTL call > > as well as writing to VRAM will adds some overhead). > > > > This is implemented by memory-mapping a page as uncached > > that is backed by MMIO on the dGPU and thus allowing userspace > > to do memory write to the page without invoking an IOCTL. > > We are selecting the MMIO so that it is not accessible from > > the PCI bus so that the MMIO writes themselves are ignored, > > but the PCI memory barrier will still take action as the MMIO > > filtering will happen after the memory barrier effect. > > > > When we detect special defined offset in mmap(), We are mapping > > 4K page which contains the last of page of doorbell MMIO range > > to userspace for same purpose. > > Well that is quite a hack, but don't you still need a memory barrier > instruction? E.g. m_fence? I guess you refer on the userspace usage directions? Yeah, the userspace definitely has to make sure that the write actually propagated to the PCI bus before they can assume the serialization to happen on the GPU. I think the userspace folks should be able to explain how exactly the orchestrate that. Michal, can you or somebody else share the respective lines of code in the userspace driver? At this time, the userspace only enables this on X86, but could also support other more exotic platforms via libpciaccess. > And why don't you expose the real doorbell instead of the last (unused?) > page of the MMIO region? Doorbells are a complete red herring here. Chosen page just happens to be a full 4K MMIO page where any writes coming over PCI bus get dropped (and reads return zero) by the GPU. Such dummy (from CPU point of view) 4K MMIO page allows doing a CPU write that generates a PCI bus transaction, where the transaction itself is essentially a NOP. But as the transaction falls into the MMIO address range, it will trigger a serialization of the incoming traffic in the GPU side, before being ignored. Regards, Joonas