On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote: > On 2021/6/25 15:30, Michael S. Tsirkin wrote: > > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: > >> On 2021/6/25 14:32, Michael S. Tsirkin wrote: > >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > >>>> Currently r->queue[] is cleared after r->consumer_head is moved > >>>> forward, which makes the __ptr_ring_empty() checking called in > >>>> page_pool_refill_alloc_cache() unreliable if the checking is done > >>>> after the r->queue clearing and before the consumer_head moving > >>>> forward. > >>> > >>> > >>> Well the documentation for __ptr_ring_empty clearly states is > >>> is not guaranteed to be reliable. > >> > >> Yes, this patch does not make __ptr_ring_empty() strictly reliable > >> without taking the r->consumer_lock, as the disscuission in [1]. > >> > >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@xxxxxxxxxx/#24207011 > >> > >>> > >>> * > >>> * NB: This is only safe to call if ring is never resized. > >>> * > >>> * However, if some other CPU consumes ring entries at the same time, the value > >>> * returned is not guaranteed to be correct. > >>> * > >>> * In this case - to avoid incorrectly detecting the ring > >>> * as empty - the CPU consuming the ring entries is responsible > >>> * for either consuming all ring entries until the ring is empty, > >>> * or synchronizing with some other CPU and causing it to > >>> * re-test __ptr_ring_empty and/or consume the ring enteries > >>> * after the synchronization point. > >>> * > >>> > >>> Is it then the case that page_pool_refill_alloc_cache violates > >>> this requirement? How? > >> > >> As my understanding: > >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid > >> taking r->consumer_lock, when the above data race happens, it will > >> exit out and allocate page from the page allocator instead of reusing > >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() > >> is more reliable. > > > > Question is how do we know it's more reliable? > > It would be nice if we did actually made it more reliable, > > as it is we are just shifting races around. > > > > > >>> > >>> It looks like you are trying to make the guarantee stronger and ensure > >>> no false positives. > >>> > >>> If yes please document this as such, update the comment so all > >>> code can be evaluated with the eye towards whether the new stronger > >>> guarantee is maintained. In particular I think I see at least one > >>> issue with this immediately. > >>> > >>> > >>>> Move the r->queue[] clearing after consumer_head moving forward > >>>> to make __ptr_ring_empty() checking more reliable. > >>>> > >>>> As a side effect of above change, a consumer_head checking is > >>>> avoided for the likely case, and it has noticeable performance > >>>> improvement when it is tested using the ptr_ring_test selftest > >>>> added in the previous patch. > >>>> > >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > >>>> to test the case of single thread doing both the enqueuing and > >>>> dequeuing: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 4648 ms 4464 ms +3.9% > >>>> X86 2562 ms 2401 ms +6.2% > >>>> > >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > >>>> to test the case of one thread doing enqueuing and another thread > >>>> doing dequeuing concurrently, also known as single-producer/single- > >>>> consumer: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > >>>> > >>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> > >>>> --- > >>>> V2: Add performance data. > >>>> --- > >>>> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > >>>> 1 file changed, 16 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > >>>> index 808f9d3..db9c282 100644 > >>>> --- a/include/linux/ptr_ring.h > >>>> +++ b/include/linux/ptr_ring.h > >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > >>>> * to work correctly. > >>>> */ > >>>> - int consumer_head = r->consumer_head; > >>>> - int head = consumer_head++; > >>>> + int consumer_head = r->consumer_head + 1; > >>>> > >>>> /* Once we have processed enough entries invalidate them in > >>>> * the ring all at once so producer can reuse their space in the ring. > >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> */ > >>>> if (unlikely(consumer_head - r->consumer_tail >= r->batch || > >>>> consumer_head >= r->size)) { > >>>> + int tail = r->consumer_tail; > >>>> + > >>>> + if (unlikely(consumer_head >= r->size)) { > >>>> + r->consumer_tail = 0; > >>>> + WRITE_ONCE(r->consumer_head, 0); > >>>> + } else { > >>>> + r->consumer_tail = consumer_head; > >>>> + WRITE_ONCE(r->consumer_head, consumer_head); > >>>> + } > >>>> + > >>>> /* Zero out entries in the reverse order: this way we touch the > >>>> * cache line that producer might currently be reading the last; > >>>> * producer won't make progress and touch other cache lines > >>>> * besides the first one until we write out all entries. > >>>> */ > >>>> - while (likely(head >= r->consumer_tail)) > >>>> - r->queue[head--] = NULL; > >>>> - r->consumer_tail = consumer_head; > >>>> - } > >>>> - if (unlikely(consumer_head >= r->size)) { > >>>> - consumer_head = 0; > >>>> - r->consumer_tail = 0; > >>>> + while (likely(--consumer_head >= tail)) > >>>> + r->queue[consumer_head] = NULL; > >>>> + > >>>> + return; > >>> > >>> > >>> So if now we need this to be reliable then > >>> we also need smp_wmb before writing r->queue[consumer_head], > >>> there could be other gotchas. > >> > >> Yes, This patch does not make it strictly reliable. > >> T think I could mention that in the commit log? > > > > OK so it's not that it makes it more reliable - this patch simply makes > > a possible false positive less likely while making a false negative > > more likely. Our assumption is that a false negative is cheaper then? > > > > How do we know that it is? > > > > And even if we prove the ptr_ring itself is faster now, > > how do we know what affects callers in a better way a > > false positive or a false negative? > > > > I would rather we worked on actually making it reliable > > e.g. if we can guarantee no false positives, that would be > > a net win. > I thought deeper about the case you mentioned above, it > seems for the above to happen, the consumer_head need to > be rolled back to zero and incremented to the point when > caller of __ptr_ring_empty() is still *not* able to see the > r->queue[] which has been set to NULL in __ptr_ring_discard_one(). > > It seems smp_wmb() only need to be done once when consumer_head > is rolled back to zero, and maybe that is enough to make sure the > case you mentioned is fixed too? > > And the smp_wmb() is only done once in a round of producing/ > consuming, so the performance impact should be minimized?(of > course we need to test it too). Sorry I don't really understand the question here. I think I agree it's enough to do one smp_wmb between the write of r->queue and write of consumer_head to help guarantee no false positives. What other code changes are necessary I can't yet say without more a deeper code review. > > > >> > >>> > >>>> } > >>>> + > >>>> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > >>>> WRITE_ONCE(r->consumer_head, consumer_head); > >>>> } > >>>> -- > >>>> 2.7.4 > >>> > >>> > >>> . > >>> > > > > > > . > >