On Thu, Oct 10, 2024 at 12:48:09PM -0700, Tomasz Jeznach wrote: > Introduce device command submission and fault reporting queues, > as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture > Specification. > > Command and fault queues are instantiated in contiguous system memory > local to IOMMU device domain, or mapped from fixed I/O space provided > by the hardware implementation. Detection of the location and maximum > allowed size of the queue utilize WARL properties of queue base control > register. Driver implementation will try to allocate up to 128KB of > system memory, while respecting hardware supported maximum queue size. > > Interrupts allocation is based on interrupt vectors availability and > distributed to all queues in simple round-robin fashion. For hardware > Implementation with fixed event type to interrupt vector assignment > IVEC WARL property is used to discover such mappings. > > Address translation, command and queue fault handling in this change > is limited to simple fault reporting without taking any action. > > Reviewed-by: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx> > Reviewed-by: Zong Li <zong.li@xxxxxxxxxx> > Signed-off-by: Tomasz Jeznach <tjeznach@xxxxxxxxxxxx> > --- > drivers/iommu/riscv/iommu-bits.h | 75 +++++ > drivers/iommu/riscv/iommu.c | 507 ++++++++++++++++++++++++++++++- > drivers/iommu/riscv/iommu.h | 21 ++ > 3 files changed, 601 insertions(+), 2 deletions(-) [...] > +/* Enqueue an entry and wait to be processed if timeout_us > 0 > + * > + * Error handling for IOMMU hardware not responding in reasonable time > + * will be added as separate patch series along with other RAS features. > + * For now, only report hardware failure and continue. > + */ > +static unsigned int riscv_iommu_queue_send(struct riscv_iommu_queue *queue, > + void *entry, size_t entry_size) > +{ > + unsigned int prod; > + unsigned int head; > + unsigned int tail; > + unsigned long flags; > + > + /* Do not preempt submission flow. */ > + local_irq_save(flags); > + > + /* 1. Allocate some space in the queue */ > + prod = atomic_inc_return(&queue->prod) - 1; > + head = atomic_read(&queue->head); > + > + /* 2. Wait for space availability. */ > + if ((prod - head) > queue->mask) { > + if (readx_poll_timeout(atomic_read, &queue->head, > + head, (prod - head) < queue->mask, > + 0, RISCV_IOMMU_QUEUE_TIMEOUT)) > + goto err_busy; > + } else if ((prod - head) == queue->mask) { > + const unsigned int last = Q_ITEM(queue, head); > + > + if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head, > + !(head & ~queue->mask) && head != last, > + 0, RISCV_IOMMU_QUEUE_TIMEOUT)) > + goto err_busy; > + atomic_add((head - last) & queue->mask, &queue->head); > + } > + > + /* 3. Store entry in the ring buffer. */ > + memcpy(queue->base + Q_ITEM(queue, prod) * entry_size, entry, entry_size); > + > + /* 4. Wait for all previous entries to be ready */ > + if (readx_poll_timeout(atomic_read, &queue->tail, tail, prod == tail, > + 0, RISCV_IOMMU_QUEUE_TIMEOUT)) > + goto err_busy; > + > + /* 5. Complete submission and restore local interrupts */ > + dma_wmb(); > + riscv_iommu_writel(queue->iommu, Q_TAIL(queue), Q_ITEM(queue, prod + 1)); Please explain why a dma_wmb() is sufficient to order the memcpy() stores before the tail update. > + atomic_inc(&queue->tail); I think this can be reordered before the relaxed MMIO write to tail, causing other CPUs to exit their polling early. Will