On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote: > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote: > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote: > > > > > The madness is only that device B's mmu notifier might need to wait > > > for fence_B so that the dma operation finishes. Which in turn has to > > > wait for device A to finish first. > > > > So, it sound, fundamentally you've got this graph of operations across > > an unknown set of drivers and the kernel cannot insert itself in > > dma_fence hand offs to re-validate any of the buffers involved? > > Buffers which by definition cannot be touched by the hardware yet. > > > > That really is a pretty horrible place to end up.. > > > > Pinning really is right answer for this kind of work flow. I think > > converting pinning to notifers should not be done unless notifier > > invalidation is relatively bounded. > > > > I know people like notifiers because they give a bit nicer performance > > in some happy cases, but this cripples all the bad cases.. > > > > If pinning doesn't work for some reason maybe we should address that? > > Note that the dma fence is only true for user ptr buffer which predate > any HMM work and thus were using mmu notifier already. You need the > mmu notifier there because of fork and other corner cases. I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it. Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin? (isn't this broken for O_DIRECT as well anyhow?) How does mmu_notifiers help the fork case anyhow? Block fork from progressing? > I probably need to warn AMD folks again that using HMM means that you > must be able to update the GPU page table asynchronously without > fence wait. It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning.. > The issue for AMD is that they already update their GPU page table > using DMA engine. I believe this is still doable if they use a > kernel only DMA engine context, where only kernel can queue up jobs > so that you do not need to wait for unrelated things and you can > prioritize GPU page table update which should translate in fast GPU > page table update without DMA fence. Make sense I'm not sure I saw this in the AMD hmm stuff - it would be good if someone would look at that. Every time I do it looks like the locking is wrong. Jason