Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Wed, 24 May 2023 18:42:35 +0200






On 24/05/2023 14.00, Yunsheng Lin wrote:
On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote:
  void page_pool_destroy(struct page_pool *pool)
  {
+	unsigned int flags;
+	u32 release_cnt;
+	u32 hold_cnt;
+
  	if (!pool)
  		return;
  
@@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
  	if (!page_pool_release(pool))
  		return;
  
-	pool->defer_start = jiffies;
-	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
+	/* PP have pages inflight, thus cannot immediately release memory.
+	 * Enter into shutdown phase, depending on remaining in-flight PP
+	 * pages to trigger shutdown process (on concurrent CPUs) and last
+	 * page will free pool instance.
+	 *
+	 * There exist two race conditions here, we need to take into
+	 * account in the following code.
+	 *
+	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
+	 *    process, which can then be stalled forever.
+	 *
+	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    page, which triggered shutdown process and freed pool
+	 *    instance. Thus, its not safe to dereference *pool afterwards.
+	 *
+	 * Handling races by holding a fake in-flight count, via artificially
+	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
+	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
+	 * transition safely into the shutdown phase.
+	 *
+	 * After safely transition into this state the races are resolved.  For
+	 * race(1) its safe to recheck and empty ptr_ring (it will not free
+	 * pool). Race(2) cannot happen, and we can release fake in-flight count
+	 * as last step.
+	 */
+	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
+	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
+	synchronize_rcu();
+
+	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
+	WRITE_ONCE(pool->p.flags, flags);
+	synchronize_rcu();

Hmm, synchronize_rcu() can be quite expensive; why do we need two of
them? Should be fine to just do one after those two writes, as long as
the order of those writes is correct (which WRITE_ONCE should ensure)?

I am not sure rcu is the right scheme to fix the problem, as rcu is usually
for one doing freeing/updating and many doing reading, while the case we
try to fix here is all doing the reading and trying to do the freeing.

And there might still be data race here as below:
      cpu0 calling page_pool_destroy()                cpu1 caling page_pool_release_page()

WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
       WRITE_ONCE(pool->p.flags, flags);
            synchronize_rcu();
                                                              atomic_inc_return()

         release_cnt = atomic_inc_return();
       page_pool_free_attempt(pool, release_cnt);
         rcu call page_pool_free_rcu()

				                     if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
                                                                page_pool_free_attempt()

As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0
will see the inflight being zero and triger the rcu to free the pp,
and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will
access pool->pages_state_hold_cnt in __page_pool_inflight(), causing
a use-after-free problem?



Also, if we're adding this (blocking) operation in the teardown path we
risk adding latency to that path (network interface removal,
BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
improvement anymore, as opposed to just keeping the workqueue but
dropping the warning?

we might be able to remove the workqueue from the destroy path, a
workqueue might be still needed to be trigered to call page_pool_free()
in non-atomic context instead of calling page_pool_free() directly in
page_pool_release_page(), as page_pool_release_page() might be called
in atomic context and page_pool_free() requires a non-atomic context
for put_device() and pool->disconnect using the mutex_lock() in
mem_allocator_disconnect().


I thought the call_rcu() callback provided the right context, but
skimming call_rcu() I think it doesn't.  Argh, I think you are right, we
cannot avoid the workqueue, as we need the non-atomic context.

Thanks for catching and pointing this out :-)

--Jesper