On Tue, Jan 30, 2024 at 05:58:24PM +0000, Mark Rutland wrote: > This patch might be ok (it looks reasonable as an optimization), but I think > the description of wmb() and smp_wmb() is incorrect. I also think that you're > missing an rmb()/smp_rmb()eor equivalent on the reader side. Sorry, the above should have said: an rmb()/smp_rmb() *or* equivalent > > On Mon, Jan 29, 2024 at 06:58:06PM -0800, Fenghua Yu wrote: > > wmb() is used to ensure status in the completion record is written > > after the rest of the completion record, making it visible to the user. > > However, on SMP systems, this may not guarantee visibility across > > different CPUs. > > > > Considering this scenario that event log handler is running on CPU1 while > > user app is polling completion record (cr) status on CPU2: > > > > CPU1 CPU2 > > event log handler user app > > > > 1. cr = 0 (status = 0) > > 2. copy X to user cr except "status" > > 3. wmb() > > 4. copy Y to user cr "status" > > 5. poll status value Y > > 6. read rest cr which is still 0. > > cr handling fails > > 7. cr value X visible now > > > > Although wmb() ensure value Y is written and visible after X is written > > on CPU1, the order is not guaranteed on CPU2. So user app may see status > > value Y while cr value X is still not visible yet on CPU2. This will > > cause reading 0 from the rest of cr and cr handling fails. > > The wmb() on CPU1 ensures the order of the reads, but you need an rmb() on CPU2 Sorry again, the above should have said: The wmb() on CPU1 ensures the order of the *writes* Apologies for any confusion resulting from those mistakes. Mark. > between reading the 'status' and 'rest' parts; otherwise CPU2 (or the > compiler!) is permitted to hoist the read of 'rest' early, before reading from > 'status', and hence you can end up with a sequence that is effectively: > > CPU1 CPU2 > event log handler user app > > 1. cr = 0 (status = 0) > 6a. read rest cr which is still 0. > 2. copy X to user cr except "status" > 3. wmb() > 4. copy Y to user cr "status" > 5. poll status value Y > 6b. cr handling fails > 7. cr value X visible now > > Since this is all to regular cacheable memory, it's *sufficient* to use > smp_wmb() and smp_rmb(), but that's an optimization rather than an ordering > fix. > > Note that on x86_64, TSO means that the stores are in-order (and so smp_wmb() > is just a compiler barrier), and IIUC loads are not reordered w.r.t. other > loads (and so smp_rmb() is also just a compiler barrier). > > > Changing wmb() to smp_wmb() ensures Y is written after X on both CPU1 > > and CPU2. This guarantees that user app can consume cr in right order. > > This implies that smp_wmb() is *stronger* than wmb(), whereas smp_wmb() is > actually *weaker* (e.g. on x86_64 wmb() is an sfence, whereas smp_wmb() is a > barrier()). > > Thanks, > Mark. > > > > > Fixes: b022f59725f0 ("dmaengine: idxd: add idxd_copy_cr() to copy user completion record during page fault handling") > > Suggested-by: Nikhil Rao <nikhil.rao@xxxxxxxxx> > > Tested-by: Tony Zhu <tony.zhu@xxxxxxxxx> > > Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > --- > > drivers/dma/idxd/cdev.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c > > index 77f8885cf407..9b7388a23cbe 100644 > > --- a/drivers/dma/idxd/cdev.c > > +++ b/drivers/dma/idxd/cdev.c > > @@ -681,9 +681,10 @@ int idxd_copy_cr(struct idxd_wq *wq, ioasid_t pasid, unsigned long addr, > > * Ensure that the completion record's status field is written > > * after the rest of the completion record has been written. > > * This ensures that the user receives the correct completion > > - * record information once polling for a non-zero status. > > + * record information on any CPU once polling for a non-zero > > + * status. > > */ > > - wmb(); > > + smp_wmb(); > > status = *(u8 *)cr; > > if (put_user(status, (u8 __user *)addr)) > > left += status_size; > > -- > > 2.37.1 > > > > >