On 2021/6/25 15:30, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: >> On 2021/6/25 14:32, Michael S. Tsirkin wrote: >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >>>> Currently r->queue[] is cleared after r->consumer_head is moved >>>> forward, which makes the __ptr_ring_empty() checking called in >>>> page_pool_refill_alloc_cache() unreliable if the checking is done >>>> after the r->queue clearing and before the consumer_head moving >>>> forward. >>> >>> >>> Well the documentation for __ptr_ring_empty clearly states is >>> is not guaranteed to be reliable. >> >> Yes, this patch does not make __ptr_ring_empty() strictly reliable >> without taking the r->consumer_lock, as the disscuission in [1]. >> >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@xxxxxxxxxx/#24207011 >> >>> >>> * >>> * NB: This is only safe to call if ring is never resized. >>> * >>> * However, if some other CPU consumes ring entries at the same time, the value >>> * returned is not guaranteed to be correct. >>> * >>> * In this case - to avoid incorrectly detecting the ring >>> * as empty - the CPU consuming the ring entries is responsible >>> * for either consuming all ring entries until the ring is empty, >>> * or synchronizing with some other CPU and causing it to >>> * re-test __ptr_ring_empty and/or consume the ring enteries >>> * after the synchronization point. >>> * >>> >>> Is it then the case that page_pool_refill_alloc_cache violates >>> this requirement? How? >> >> As my understanding: >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid >> taking r->consumer_lock, when the above data race happens, it will >> exit out and allocate page from the page allocator instead of reusing >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() >> is more reliable. > > Question is how do we know it's more reliable? > It would be nice if we did actually made it more reliable, > as it is we are just shifting races around. > > >>> >>> It looks like you are trying to make the guarantee stronger and ensure >>> no false positives. >>> >>> If yes please document this as such, update the comment so all >>> code can be evaluated with the eye towards whether the new stronger >>> guarantee is maintained. In particular I think I see at least one >>> issue with this immediately. >>> >>> >>>> Move the r->queue[] clearing after consumer_head moving forward >>>> to make __ptr_ring_empty() checking more reliable. >>>> >>>> As a side effect of above change, a consumer_head checking is >>>> avoided for the likely case, and it has noticeable performance >>>> improvement when it is tested using the ptr_ring_test selftest >>>> added in the previous patch. >>>> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >>>> to test the case of single thread doing both the enqueuing and >>>> dequeuing: >>>> >>>> arch unpatched patched delta >>>> arm64 4648 ms 4464 ms +3.9% >>>> X86 2562 ms 2401 ms +6.2% >>>> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >>>> to test the case of one thread doing enqueuing and another thread >>>> doing dequeuing concurrently, also known as single-producer/single- >>>> consumer: >>>> >>>> arch unpatched patched delta >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >>>> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> >>>> --- >>>> V2: Add performance data. >>>> --- >>>> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- >>>> 1 file changed, 16 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h >>>> index 808f9d3..db9c282 100644 >>>> --- a/include/linux/ptr_ring.h >>>> +++ b/include/linux/ptr_ring.h >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >>>> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty >>>> * to work correctly. >>>> */ >>>> - int consumer_head = r->consumer_head; >>>> - int head = consumer_head++; >>>> + int consumer_head = r->consumer_head + 1; >>>> >>>> /* Once we have processed enough entries invalidate them in >>>> * the ring all at once so producer can reuse their space in the ring. >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >>>> */ >>>> if (unlikely(consumer_head - r->consumer_tail >= r->batch || >>>> consumer_head >= r->size)) { >>>> + int tail = r->consumer_tail; >>>> + >>>> + if (unlikely(consumer_head >= r->size)) { >>>> + r->consumer_tail = 0; >>>> + WRITE_ONCE(r->consumer_head, 0); >>>> + } else { >>>> + r->consumer_tail = consumer_head; >>>> + WRITE_ONCE(r->consumer_head, consumer_head); >>>> + } >>>> + >>>> /* Zero out entries in the reverse order: this way we touch the >>>> * cache line that producer might currently be reading the last; >>>> * producer won't make progress and touch other cache lines >>>> * besides the first one until we write out all entries. >>>> */ >>>> - while (likely(head >= r->consumer_tail)) >>>> - r->queue[head--] = NULL; >>>> - r->consumer_tail = consumer_head; >>>> - } >>>> - if (unlikely(consumer_head >= r->size)) { >>>> - consumer_head = 0; >>>> - r->consumer_tail = 0; >>>> + while (likely(--consumer_head >= tail)) >>>> + r->queue[consumer_head] = NULL; >>>> + >>>> + return; >>> >>> >>> So if now we need this to be reliable then >>> we also need smp_wmb before writing r->queue[consumer_head], >>> there could be other gotchas. >> >> Yes, This patch does not make it strictly reliable. >> T think I could mention that in the commit log? > > OK so it's not that it makes it more reliable - this patch simply makes > a possible false positive less likely while making a false negative > more likely. Our assumption is that a false negative is cheaper then? > > How do we know that it is? > > And even if we prove the ptr_ring itself is faster now, > how do we know what affects callers in a better way a > false positive or a false negative? > > I would rather we worked on actually making it reliable > e.g. if we can guarantee no false positives, that would be > a net win. I thought deeper about the case you mentioned above, it seems for the above to happen, the consumer_head need to be rolled back to zero and incremented to the point when caller of __ptr_ring_empty() is still *not* able to see the r->queue[] which has been set to NULL in __ptr_ring_discard_one(). It seems smp_wmb() only need to be done once when consumer_head is rolled back to zero, and maybe that is enough to make sure the case you mentioned is fixed too? And the smp_wmb() is only done once in a round of producing/ consuming, so the performance impact should be minimized?(of course we need to test it too). > >> >>> >>>> } >>>> + >>>> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ >>>> WRITE_ONCE(r->consumer_head, consumer_head); >>>> } >>>> -- >>>> 2.7.4 >>> >>> >>> . >>> > > > . >