Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20

Robin Murphy <robin.murphy@xxxxxxx> · Tue, 6 Aug 2024 15:23:18 +0100

On 06/08/2024 10:51 am, Jesper Dangaard Brouer wrote:
[...]
The iommu_group will release whether the page_pool is using it or not, 
so if once page_pool_return_page() was called(why does this occur when 
the device is reloaded and packets are transmitted?) , this crash will 
happen.

I try the follow patch, but doesn't work :(


The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
freed, make sense to me.

The question is if API iommu_group_get() and iommu_group_put() is the
correct API to use in this case?

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f4444b4e39e6..d03a87407ca8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -21,6 +21,7 @@
  #include <linux/poison.h>
  #include <linux/ethtool.h>
  #include <linux/netdevice.h>
+#include <linux/iommu.h>


The page_pool already have a system/workqueue that waits for inflight
"packet" pages, and calls struct device API get_device() and put_device().

Why didn't the patch add code together with struct device API?
Like this:

Now do the one where there is no IOMMU, and dma_unmap_page() corrupts 
random unrelated memory because the mapped DMA address was relative to 
dev->dma_range_map which has since become NULL.

In other words, no, hacking one particular IOMMU API symptom does not 
solve the fundamental lifecycle problem that you have here.

Thanks,
Robin.


$ git diff
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2abe6e919224..686ff1d31aff 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
         /* Driver calling page_pool_create() also call 
page_pool_destroy() */
         refcount_set(&pool->user_cnt, 1);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_get(pool->p.dev);
                 get_device(pool->p.dev);
+       }

         return 0;
  }
@@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
  {
         ptr_ring_cleanup(&pool->ring, NULL);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_put(pool->p.dev->iommu_group);
                 put_device(pool->p.dev);
+       }


--Jesper

  #include <trace/events/page_pool.h>
 > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
page_pool_params *params, int cpuid)
         if (err)
                 goto err_uninit;

+       if (pool->dma_map)
+               iommu_group_get(pool->p.dev);
+
         return pool;

  err_uninit:
@@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)

         page_pool_scrub(pool);
         inflight = page_pool_inflight(pool, true);
-       if (!inflight)
+       if (!inflight) {
                 __page_pool_destroy(pool);
+               if (pool->dma_map)
+ iommu_group_put(pool->p.dev->iommu_group);
+       }

         return inflight;
  }


The page_pool bumps refcnt via get_device() + put_device() on the DMA
'struct device', to avoid it going away, but I guess there is also 
some
IOMMU code that we need to make sure doesn't go away (until all 
inflight
pages are returned) ???


[ 4407.212119] process_one_work+0x164/0x3e0
[ 4407.216116]  worker_thread+0x310/0x420
[ 4407.219851]  kthread+0x120/0x130
[ 4407.223066]  ret_from_fork+0x10/0x20
[ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
[ 4407.232697] ---[ end trace 0000000000000000 ]---


The hns3 driver use page pool like this, just call once when the 
driver
initialize:

static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
{
      struct page_pool_params pp_params = {
          .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
                  PP_FLAG_DMA_SYNC_DEV,
          .order = hns3_page_order(ring),
          .pool_size = ring->desc_num * hns3_buf_size(ring) /
                  (PAGE_SIZE << hns3_page_order(ring)),
          .nid = dev_to_node(ring_to_dev(ring)),
          .dev = ring_to_dev(ring),
          .dma_dir = DMA_FROM_DEVICE,
          .offset = 0,
          .max_len = PAGE_SIZE << hns3_page_order(ring),
      };

      ring->page_pool = page_pool_create(&pp_params);
      if (IS_ERR(ring->page_pool)) {
          dev_warn(ring_to_dev(ring), "page pool creation failed: 
%ld\n",
               PTR_ERR(ring->page_pool));
          ring->page_pool = NULL;
      }
}

And call page_pool_destroy(ring->page_pool)  when the driver 
uninitialized.


We use two devices, the net port connect directory, and the step 
of the
test case like below:

1. enable a vf of '7d:00.0':  echo 1 >
/sys/class/net/eno1/device/sriov_numvfs

2. use iperf to produce some flows(the problem happens to the side 
which
runs 'iperf -s')

3. use ifconfig down/up to the vf

4. kill iperf

5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs

6. run 1~5 with another port bd:00.0

7. repeat 1~6


And when running this test case, we can found another related 
message (I
replaced pr_warn() to dev_warn()):

pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
949, 98 inflight 1449 sec


Even when stop the traffic, stop the test case, disable the vf, this
message is still being printed.

We must run the test case for about two hours to reproduce the 
problem.
Is there some advise to solve or debug the problem?