Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 04/05/2023 04.42, Yunsheng Lin wrote:
On 2023/4/29 0:16, Jesper Dangaard Brouer wrote:
  void page_pool_release_page(struct page_pool *pool, struct page *page)
  {
+	unsigned int flags = READ_ONCE(pool->p.flags);
  	dma_addr_t dma;
-	int count;
+	u32 release_cnt;
+	u32 hold_cnt;
if (!(pool->p.flags & PP_FLAG_DMA_MAP))
  		/* Always account for inflight pages, even if we didn't
@@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
  skip_dma_unmap:
  	page_pool_clear_pp_info(page);
- /* This may be the last page returned, releasing the pool, so
-	 * it is not safe to reference pool afterwards.
-	 */
-	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
-	trace_page_pool_state_release(pool, page, count);

There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)"
and flags checking, if page_pool_destroy() is called concurrently during that
time window, it seems we will have a pp instance leaking problem here?


Nope, that is resolved by the code changes in page_pool_destroy(), see below.

It seems it is very hard to aovid this kind of corner case when using both
flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp
instance can be freed.
Can we use something like biased reference counting, which used by frag support
in page pool? So that we only need to check only one variable and avoid cache
bouncing as much as possible.


See below, I believe we are doing an equivalent refcnt bias trick, that
solves these corner cases in page_pool_destroy().
In short: hold_cnt is increased, prior to setting PP_FLAG_SHUTDOWN.
Thus, if this code READ_ONCE flags without PP_FLAG_SHUTDOWN, we know it
will not be the last to release pool->pages_state_release_cnt.
Below: Perhaps, we should add a RCU grace period to make absolutely
sure, that this code completes before page_pool_destroy() call completes.


+	if (flags & PP_FLAG_SHUTDOWN)
+		hold_cnt = pp_read_hold_cnt(pool);
+

I would like to avoid above code, and I'm considering using call_rcu(),
which I think will resolve the race[0] this code deals with.
As I explained here[0], this code deals with another kind of race.

[0] https://lore.kernel.org/all/f671f5da-d9bc-a559-2120-10c3491e6f6d@xxxxxxxxxx/

+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	trace_page_pool_state_release(pool, page, release_cnt);
+
+	/* In shutdown phase, last page will free pool instance */
+	if (flags & PP_FLAG_SHUTDOWN)
+		page_pool_free_attempt(pool, hold_cnt, release_cnt);
  }
  EXPORT_SYMBOL(page_pool_release_page);


...

void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
@@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
void page_pool_destroy(struct page_pool *pool)
  {
+	unsigned int flags;
+	u32 release_cnt;
+	u32 hold_cnt;
+
  	if (!pool)
  		return;
@@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool)
  	if (!page_pool_release(pool))
  		return;
- pool->defer_start = jiffies;
-	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
+	/* PP have pages inflight, thus cannot immediately release memory.
+	 * Enter into shutdown phase, depending on remaining in-flight PP
+	 * pages to trigger shutdown process (on concurrent CPUs) and last
+	 * page will free pool instance.
+	 *
+	 * There exist two race conditions here, we need to take into
+	 * account in the following code.
+	 *
+	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
+	 *    process, which can then be stalled forever.
+	 *
+	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    page, which triggered shutdown process and freed pool
+	 *    instance. Thus, its not safe to dereference *pool afterwards.
+	 *
+	 * Handling races by holding a fake in-flight count, via
+	 * artificially bumping pages_state_hold_cnt, which assures pool
+	 * isn't freed under us.  For race(1) its safe to recheck ptr_ring
+	 * (it will not free pool). Race(2) cannot happen, and we can
+	 * release fake in-flight count as last step.
+	 */
+	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
+	smp_store_release(&pool->pages_state_hold_cnt, hold_cnt);

I assume the smp_store_release() is used to ensure the correct order
between the above store operations?
There is data dependency between those two store operations, do we
really need the smp_store_release() here?

+	barrier();
+	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;

Do we need a stronger barrier like smp_rmb() to prevent cpu from
executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN"
before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)"
even if there is a smp_store_release() barrier here?

I do see you point and how it is related to your above comment for
page_pool_release_page().

I think we need to replace barrier() with synchronize_rcu().
Meaning we add a RCU grace period to "wait" for above code (in
page_pool_release_page) that read the old flags value to complete.


+	smp_store_release(&pool->p.flags, flags);

When doing a synchronize_rcu(), I assume this smp_store_release() is
overkill, right?
Will a WRITE_ONCE() be sufficient?

Hmm, the synchronize_rcu(), shouldn't that be *after* storing the flags?

+
+	/* Concurrent CPUs could have returned last pages into ptr_ring */
+	page_pool_empty_ring(pool);
- INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
-	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	page_pool_free_attempt(pool, hold_cnt, release_cnt);
  }
  EXPORT_SYMBOL(page_pool_destroy);




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux