On Mon, May 06, 2024 at 07:22:07PM -0700, Tomasz Jeznach wrote: > On Sun, May 5, 2024 at 8:46 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > On Fri, May 03, 2024 at 12:44:09PM -0700, Tomasz Jeznach wrote: > > > > For detach I think yes: > > > > > > > > Inv CPU Detach CPU > > > > > > > > write io_pte Update device descriptor > > > > rcu_read_lock > > > > list_for_each > > > > <make invalidation command> <make description inv cmd> > > > > dma_wmb() dma_wmb() > > > > <doorbell> <cmd doorbell> > > > > rcu_read_unlock > > > > list_del_rcu() > > > > <wipe ASID> > > > > > > > > In this case I think we never miss an invalidation, the list_del is > > > > always after the HW has been fully fenced, so I don't think we can > > > > have any issue. Maybe a suprious invalidation if the ASID gets > > > > re-used, but who cares. > > > > > > > > Attach is different.. > > > > > > > > Inv CPU Attach CPU > > > > > > > > write io_pte > > > > rcu_read_lock > > > > list_for_each // empty > > > > list_add_rcu() > > > > Update device descriptor > > > > <make description inv cmd> > > > > dma_wmb() > > > > <cmd doorbell> > > > > rcu_read_unlock > > > > > > > > As above shows we can "miss" an invalidation. The issue is narrow, the > > > > io_pte could still be sitting in write buffers in "Inv CPU" and not > > > > yet globally visiable. "Attach CPU" could get the device descriptor > > > > installed in the IOMMU and the IOMMU could walk an io_pte that is in > > > > the old state. Effectively this is because there is no release/acquire > > > > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > > > > IOMMU. > > > > > > > > It seems like it should be solvable somehow: > > > > 1) Inv CPU releases all the io ptes > > > > 2) Attach CPU acquires the io ptes before updating the DDT > > > > 3) Inv CPU acquires the RCU list in such a way that either attach > > > > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > > > > 4) Either invalidation works or we release the new iopte to the SMMU > > > > and don't need it. > > > > > > > > But #3 is a really weird statement. smb_mb() on both sides may do the > > > > job?? > > > > > > > > > > Actual attach sequence is slightly different. > > > > > > Inv CPU Attach CPU > > > > > > write io_pte > > > rcu_read_lock > > > list_for_each // empty > > > list_add_rcu() > > > IOTLB.INVAL(PSCID) > > > <make description inv cmd> > > > dma_wmb() > > > <cmd doorbell> > > > rcu_read_unlock > > > > > > I've tried to cover this case with riscv_iommu_iotlb_inval() called > > > before the attached domain is visible to the device. > > > > That invalidation shouldn't do anything. If this is the first attach > > of a PSCID then the PSCID had better already be empty, it won't become > > non-empty until the DDT entry is installed. > > > > And if it is the second attach then the Inv CPU is already taking care > > of things, no need to invalidate at all. > > > > Regardless, there is still a theortical race that the IOPTEs haven't > > been made visible yet because there is still no synchronization with > > the CPU writing them. > > > > So, I don't think this solves any problem. I belive you need the > > appropriate kind of CPU barrier here instead of an invalidation. > > > > Yes. There was a small, but still plausible race w/ IOPTEs visibility > to the IOMMU. > For v5 I'm adding two barriers to the inval/detach flow, I believe > should cover it. > > 1) In riscv_iommu_iotlb_inval() unconditional dma_wmb() to make any > pending writes to PTEs visible to the IOMMU device. This should cover > the case when list_add_rcu() update is not yet visible in the > _iotlb_inval() sequence, for the first time the domain is attached to > the IOMMU. > > Inv CPU Attach CPU > write io_pte > dma_wmb (1) > rcu_read_lock > list_for_each // empty list_add_rcu() > smp_wmb (2) > Update device descriptor > <make description inv cmd> > // PTEs are visible to the HW (*1) > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > 2) In riscv_iommu_bond_link() write memory barrier to ensure list > update is visible before IOMMU descriptor update. If stale data has > been fetched by the HW, inval CPU will run iotlb-invalidation > sequence. There is a possibility that IOMMU will fetch correct PTEs > and will receive unnecessary IOTLB inval, but I don't think anyone > would care. > > Inv CPU Attach CPU > write io_pte list_add_rcu() > smp_wmb (2) > Update device descriptor > <make description inv cmd> > // HW might fetch stale PTEs > dma_wmb() > <cmd doorbell> > dma_wmb (1) > rcu_read_lock > list_for_each // non-empty (*2) > <make iotlb inval cmd> > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > 3) I've also updated riscv_iommu_bond_unlink() to wipe the PSCID cache > on the last domain unlink from the IOMMU. > > Thank you for pointing this out. Let me know if that makes sense. I'm not an expert in barriers, but I think you need the more expensive "mb" in both cases. The inv side is both releasing the write and acquiring the list read. IIRC READ_ONCE is not a full acquire? The Attach side is both releasing the list_add_rcu() and acquiring the iopte. rcu is still a benefit, there is no cache line sharing and there is only one full barrier, not two, like a spinlock. And a big fat comment in both sides explaining this :) Jason