Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 16 Jan 2025 17:09:50 +0100

On 16/01/2025 13.52, Yunsheng Lin wrote:
On 2025/1/16 0:29, Jesper Dangaard Brouer wrote:

On 10/01/2025 14.06, Yunsheng Lin wrote:
[...]
In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, use some pre-allocated item blocks to record inflight
pages including the ones which are handed over to network stack,
so the page_pool can do the DMA unmmapping for those pages when
page_pool_destroy() is called. As the pre-allocated item blocks
need to be large enough to avoid performance degradation, add a
'item_fast_empty' stat to indicate the unavailability of the
pre-allocated item blocks.

...

+
+static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
+                             netmem_ref netmem,
+                             bool destroyed)
+{
+    struct page_pool_item *item;
+    dma_addr_t dma;
+
+    if (!pool->dma_map)
+        /* Always account for inflight pages, even if we didn't
+         * map them
+         */
+        return;
+
+    dma = page_pool_get_dma_addr_netmem(netmem);
+    item = netmem_get_pp_item(netmem);
+
+    /* dma unmapping is always needed when page_pool_destory() is not called
+     * yet.
+     */
+    DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
+    if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
+        return;
+
+    /* When page is unmapped, it cannot be returned to our pool */
+    dma_unmap_page_attrs(pool->p.dev, dma,
+                 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
+                 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
+    page_pool_set_dma_addr_netmem(netmem, 0);
+    page_pool_item_clear_mapped(item);
+}
+

I have a hard time reading/reviewing/maintaining below code, without
some design description.  This code needs more comments on what is the
*intend* and design it's trying to achieve.

 From patch description the only hint I have is:
  "use some pre-allocated item blocks to record inflight pages"

E.g. Why is it needed/smart to hijack the page->pp pointer?

Mainly because there is no space available for keeping tracking of inflight
pages, using page->pp can only find the page_pool owning the page, but page_pool
is not able to keep track of the inflight page when the page is handled by
networking stack.

By using page_pool_item as below, the state is used to tell if a specific
item is being used/dma mapped or not by scanning all the item blocks in
pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
will point to that page so that dma unmapping can be done for that page
when page_pool_destroy() is called, otherwise free items sit in the
pool->hold_items or pool->release_items by using 'lentry':

struct page_pool_item {
	unsigned long state;

	union {
		netmem_ref pp_netmem;
		struct llist_node lentry;
	};
};

pahole  -C page_pool_item vmlinux
struct page_pool_item {
	/* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
	 * indicate the state of current item.
	 */	
	long unsigned int          encoded_next;     /*     0     8 */
	union {
		netmem_ref         pp_netmem;        /*     8     8 */
		struct llist_node  lentry;           /*     8     8 */
	};                                           /*     8     8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};

When a page is added to the page_pool, a item is deleted from pool->hold_items
or pool->release_items and set the 'pp_netmem' pointing to that page and set
'state' accordingly in order to keep track of that page.

When a page is deleted from the page_pool, it is able to tell which page_pool
this page belong to by using the below function, and after clearing the 'state',
the item is added back to pool->release_items so that the item is reused for new
pages.

To understand below, I'm listing struct page_pool_item_block for other
reviewers:

pahole  -C page_pool_item_block vmlinux
struct page_pool_item_block {
	struct page_pool *         pp;               /*     0     8 */
	struct list_head           list;             /*     8    16 */
	unsigned int               flags;            /*    24     4 */
	refcount_t                 ref;              /*    28     4 */
	struct page_pool_item      items[];          /*    32     0 */

	/* size: 32, cachelines: 1, members: 5 */
	/* last cacheline: 32 bytes */
};

static inline struct page_pool_item_block *
page_pool_item_to_block(struct page_pool_item *item)
{
	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);

This trick requires some comments explaining what is going on!
Please correct me if I'm wrong: Here you a masking off the lower bits of
the pointer to page_pool_item *item, as you know that a struct
page_pool_item_block is stored in the top of a struct page.  This trick
is like a "container_of" for going from page_pool_item to
page_pool_item_block, right?

I do notice that you have a comment above struct page_pool_item_block
(that says "item_block is always PAGE_SIZE"), which is nice, but to be
more explicit/clear:
 I want a big comment block (placed above the main code here) that
explains the design and intention behind this newly invented
"item-block" scheme, like e.g. the connection between
page_pool_item_block and page_pool_item. Like the advantage/trick that
allows page->pp pointer to be an "item" and be mapped back to a "block"
to find the page_pool object it belongs to.  Don't write *what* the code
does, but write about the intended purpose and design reasons behind the
code.

}

  static inline struct page_pool *page_pool_get_pp(struct page *page)
  {
       return page_pool_item_to_block(page->pp_item)->pp;
  }

+static void __page_pool_item_init(struct page_pool *pool, struct page *page)
+{

Function name is confusing.  First I though this was init'ing a single
item, but looking at the code it is iterating over ITEMS_PER_PAGE.

Maybe it should be called page_pool_item_block_init ?

The __page_pool_item_init() is added to make the below
page_pool_item_init() function more readable or maintainable, changing
it to page_pool_item_block_init doesn't seems consistent?

You (of-cause) also have to rename the other function, I though that was
implicitly understood.

BUT does my suggested rename make sense?  What I'm seeing is that all
the *items* in the "block" is getting inited. But we are also setting up
the "block" (e.g.  "block->pp=pool").

+    struct page_pool_item_block *block = page_address(page);
+    struct page_pool_item *items = block->items;
+    unsigned int i;
+
+    list_add(&block->list, &pool->item_blocks);
+    block->pp = pool;
+
+    for (i = 0; i < ITEMS_PER_PAGE; i++) {
+        page_pool_item_init_state(&items[i]);
+        __llist_add(&items[i].lentry, &pool->hold_items);
+    }
+}
+
+static int page_pool_item_init(struct page_pool *pool)
+{
+#define PAGE_POOL_MIN_INFLIGHT_ITEMS        512
+    struct page_pool_item_block *block;
+    int item_cnt;
+
+    INIT_LIST_HEAD(&pool->item_blocks);
+    init_llist_head(&pool->hold_items);
+    init_llist_head(&pool->release_items);
+
+    item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
+        PAGE_POOL_MIN_INFLIGHT_ITEMS;
+    while (item_cnt > 0) {
+        struct page *page;
+
+        page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
+        if (!page)
+            goto err;
+
+        __page_pool_item_init(pool, page);
+        item_cnt -= ITEMS_PER_PAGE;
+    }
+
+    return 0;
+err:
+    list_for_each_entry(block, &pool->item_blocks, list)
+        put_page(virt_to_page(block));

This one also have used-after-free problem as the page_pool_item_uninit
in the previous version.

+
+    return -ENOMEM;
+}
+