On 2023/5/4 21:48, Jesper Dangaard Brouer wrote: > On 04/05/2023 04.42, Yunsheng Lin wrote: >> On 2023/4/29 0:16, Jesper Dangaard Brouer wrote: >>> void page_pool_release_page(struct page_pool *pool, struct page *page) >>> { >>> + unsigned int flags = READ_ONCE(pool->p.flags); >>> dma_addr_t dma; >>> - int count; >>> + u32 release_cnt; >>> + u32 hold_cnt; >>> if (!(pool->p.flags & PP_FLAG_DMA_MAP)) >>> /* Always account for inflight pages, even if we didn't >>> @@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) >>> skip_dma_unmap: >>> page_pool_clear_pp_info(page); >>> - /* This may be the last page returned, releasing the pool, so >>> - * it is not safe to reference pool afterwards. >>> - */ >>> - count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); >>> - trace_page_pool_state_release(pool, page, count); >> >> There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)" >> and flags checking, if page_pool_destroy() is called concurrently during that >> time window, it seems we will have a pp instance leaking problem here? >> > > Nope, that is resolved by the code changes in page_pool_destroy(), see below. Maybe I did not describe the data race clearly enough. CPU 0 CPU1 . . unsigned int flags = READ_ONCE(pool->p.flags); . . page_pool_destroy() . atomic_inc_return(&pool->pages_state_release_cnt) . . . if (flags & PP_FLAG_SHUTDOWN) page_pool_free_attempt(); The above data race may cause a pp instance leaking problem: CPU0 is releasing the last page for a pp and it did not see the pool->p.flags with the PP_FLAG_SHUTDOWN set because page_pool_destroy() is called after reading pool->p.flags, so page_pool_free_attempt() is not called to free pp. CPU1 calling the page_pool_destroy() also did not free pp as CPU0 had not done the atomic_inc_return() for pool->pages_state_release_cnt yet. Or did I miss something obvious here? > >> It seems it is very hard to aovid this kind of corner case when using both >> flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp >> instance can be freed. >> Can we use something like biased reference counting, which used by frag support >> in page pool? So that we only need to check only one variable and avoid cache >> bouncing as much as possible. >> > > See below, I believe we are doing an equivalent refcnt bias trick, that > solves these corner cases in page_pool_destroy(). > In short: hold_cnt is increased, prior to setting PP_FLAG_SHUTDOWN. > Thus, if this code READ_ONCE flags without PP_FLAG_SHUTDOWN, we know it > will not be the last to release pool->pages_state_release_cnt. It is not exactly the kind of refcnt bias trick in my mind, I was thinking about using pool->pages_state_hold_cnt as refcnt bias and merge it to pool->pages_state_release_cnt as needed, maybe I need to try to implement that to see if it turn out to be what I want it to be. > Below: Perhaps, we should add a RCU grace period to make absolutely > sure, that this code completes before page_pool_destroy() call completes. > > >>> + if (flags & PP_FLAG_SHUTDOWN) >>> + hold_cnt = pp_read_hold_cnt(pool); >>> + > > I would like to avoid above code, and I'm considering using call_rcu(), > which I think will resolve the race[0] this code deals with. > As I explained here[0], this code deals with another kind of race. Yes, I understand that. I even went to check if the below tracepoint trace_page_pool_state_release() was causing a use-after-free problem as it is passing 'pool':) > > [0] https://lore.kernel.org/all/f671f5da-d9bc-a559-2120-10c3491e6f6d@xxxxxxxxxx/ > >>> + release_cnt = atomic_inc_return(&pool->pages_state_release_cnt); >>> + trace_page_pool_state_release(pool, page, release_cnt); >>> + >>> + /* In shutdown phase, last page will free pool instance */ >>> + if (flags & PP_FLAG_SHUTDOWN) >>> + page_pool_free_attempt(pool, hold_cnt, release_cnt); >>> } >>> EXPORT_SYMBOL(page_pool_release_page); >>> >> >> ... >> >>> void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *), >>> @@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi); >>> void page_pool_destroy(struct page_pool *pool) >>> { >>> + unsigned int flags; >>> + u32 release_cnt; >>> + u32 hold_cnt; >>> + >>> if (!pool) >>> return; >>> @@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool) >>> if (!page_pool_release(pool)) >>> return; >>> - pool->defer_start = jiffies; >>> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; >>> + /* PP have pages inflight, thus cannot immediately release memory. >>> + * Enter into shutdown phase, depending on remaining in-flight PP >>> + * pages to trigger shutdown process (on concurrent CPUs) and last >>> + * page will free pool instance. >>> + * >>> + * There exist two race conditions here, we need to take into >>> + * account in the following code. >>> + * >>> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * pages into the ptr_ring. Thus, it missed triggering shutdown >>> + * process, which can then be stalled forever. >>> + * >>> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * page, which triggered shutdown process and freed pool >>> + * instance. Thus, its not safe to dereference *pool afterwards. >>> + * >>> + * Handling races by holding a fake in-flight count, via >>> + * artificially bumping pages_state_hold_cnt, which assures pool >>> + * isn't freed under us. For race(1) its safe to recheck ptr_ring >>> + * (it will not free pool). Race(2) cannot happen, and we can >>> + * release fake in-flight count as last step. >>> + */ >>> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; >>> + smp_store_release(&pool->pages_state_hold_cnt, hold_cnt); >> >> I assume the smp_store_release() is used to ensure the correct order >> between the above store operations? >> There is data dependency between those two store operations, do we >> really need the smp_store_release() here? >> >>> + barrier(); >>> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; >> >> Do we need a stronger barrier like smp_rmb() to prevent cpu from >> executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN" >> before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)" >> even if there is a smp_store_release() barrier here? >> > I do see you point and how it is related to your above comment for > page_pool_release_page(). > > I think we need to replace barrier() with synchronize_rcu(). > Meaning we add a RCU grace period to "wait" for above code (in > page_pool_release_page) that read the old flags value to complete. > > >>> + smp_store_release(&pool->p.flags, flags); > > When doing a synchronize_rcu(), I assume this smp_store_release() is > overkill, right? > Will a WRITE_ONCE() be sufficient? > > Hmm, the synchronize_rcu(), shouldn't that be *after* storing the flags? Yes. As my understanding, we probably do not need any of those *_ONCE() and barrier when using rcu. But I am not really convinced that we need to go for rcu yet.