On 2024/7/22 1:34, Alexander Duyck wrote: > On Fri, Jul 19, 2024 at 2:36 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote: >> >> Basing on the lib/objpool.c, change it to something like a >> ptrpool, so that we can utilize that to test the correctness >> and performance of the page_frag. >> >> The testing is done by ensuring that the fragment allocated >> from a frag_frag_cache instance is pushed into a ptrpool >> instance in a kthread binded to a specified cpu, and a kthread >> binded to a specified cpu will pop the fragment from the >> ptrpool and free the fragment. >> >> We may refactor out the common part between objpool and ptrpool >> if this ptrpool thing turns out to be helpful for other place. >> >> CC: Alexander Duyck <alexander.duyck@xxxxxxxxx> >> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> >> --- >> mm/Kconfig.debug | 8 + >> mm/Makefile | 1 + >> mm/page_frag_test.c | 393 ++++++++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 402 insertions(+) >> create mode 100644 mm/page_frag_test.c > > I might have missed it somewhere. Is there any reason why this isn't > in the selftests/mm/ directory? Seems like that would be a better fit > for this. > >> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug >> index afc72fde0f03..1ebcd45f47d4 100644 >> --- a/mm/Kconfig.debug >> +++ b/mm/Kconfig.debug >> @@ -142,6 +142,14 @@ config DEBUG_PAGE_REF >> kernel code. However the runtime performance overhead is virtually >> nil until the tracepoints are actually enabled. >> >> +config DEBUG_PAGE_FRAG_TEST > > This isn't a "DEBUG" feature. This is a test feature. > >> + tristate "Test module for page_frag" >> + default n >> + depends on m && DEBUG_KERNEL > > I am not sure it is valid to have a tristate depend on being built as a module. Perhaps I was copying the wrong pattern from TEST_OBJPOOL in lib/Kconfig.debug. Perhaps mm/dmapool_test.c and DMAPOOL_TEST* *was more appropriate pattern for test module for page_frag? > > I think if you can set it up as a selftest it will have broader use as > you could compile it against any target kernel going forward and add > it as a module rather than having to build it as a part of a debug > kernel. It seems tools/testing/selftests/mm/* are all about userspace testing tool, and testing kernel module seems to be in the same directory with the code to be tested? > >> + help >> + This builds the "page_frag_test" module that is used to test the >> + correctness and performance of page_frag's implementation. >> + >> config DEBUG_RODATA_TEST >> bool "Testcase for the marking rodata read-only" ... >> + >> + /* >> + * here we allocate percpu-slot & objs together in a single >> + * allocation to make it more compact, taking advantage of >> + * warm caches and TLB hits. in default vmalloc is used to >> + * reduce the pressure of kernel slab system. as we know, >> + * minimal size of vmalloc is one page since vmalloc would >> + * always align the requested size to page size >> + */ >> + if (gfp & GFP_ATOMIC) >> + slot = kmalloc_node(size, gfp, cpu_to_node(i)); >> + else >> + slot = __vmalloc_node(size, sizeof(void *), gfp, >> + cpu_to_node(i), >> + __builtin_return_address(0)); > > When would anyone ever call this with atomic? This is just for your > test isn't it? > >> + if (!slot) >> + return -ENOMEM; >> + >> + memset(slot, 0, size); >> + pool->cpu_slots[i] = slot; >> + >> + objpool_init_percpu_slot(pool, slot); >> + } >> + >> + return 0; >> +} ... >> +/* release whole objpool forcely */ >> +static void objpool_free(struct objpool_head *pool) >> +{ >> + if (!pool->cpu_slots) >> + return; >> + >> + /* release percpu slots */ >> + objpool_fini_percpu_slots(pool); >> +} >> + > > Why add all this extra objpool overhead? This seems like overkill for > what should be a simple test. Seems like you should just need a simple > array located on one of your CPUs. I'm not sure what is with all the > extra overhead being added here. As mentioned in the commit log: "We may refactor out the common part between objpool and ptrpool if this ptrpool thing turns out to be helpful for other place." The next thing I am trying to do is to use ptrpool to optimization the pcp for mm subsystem. so I would rather not tailor the ptrpool for page_frag_test, and it doesn't seem to affect the testing that much. > >> +static struct objpool_head ptr_pool; >> +static int nr_objs = 512; >> +static atomic_t nthreads; >> +static struct completion wait; >> +static struct page_frag_cache test_frag; >> + >> +static int nr_test = 5120000; >> +module_param(nr_test, int, 0); >> +MODULE_PARM_DESC(nr_test, "number of iterations to test"); >> + >> +static bool test_align; >> +module_param(test_align, bool, 0); >> +MODULE_PARM_DESC(test_align, "use align API for testing"); >> + >> +static int test_alloc_len = 2048; >> +module_param(test_alloc_len, int, 0); >> +MODULE_PARM_DESC(test_alloc_len, "alloc len for testing"); >> + >> +static int test_push_cpu; >> +module_param(test_push_cpu, int, 0); >> +MODULE_PARM_DESC(test_push_cpu, "test cpu for pushing fragment"); >> + >> +static int test_pop_cpu; >> +module_param(test_pop_cpu, int, 0); >> +MODULE_PARM_DESC(test_pop_cpu, "test cpu for popping fragment"); >> + >> +static int page_frag_pop_thread(void *arg) >> +{ >> + struct objpool_head *pool = arg; >> + int nr = nr_test; >> + >> + pr_info("page_frag pop test thread begins on cpu %d\n", >> + smp_processor_id()); >> + >> + while (nr > 0) { >> + void *obj = objpool_pop(pool); >> + >> + if (obj) { >> + nr--; >> + page_frag_free(obj); >> + } else { >> + cond_resched(); >> + } >> + } >> + >> + if (atomic_dec_and_test(&nthreads)) >> + complete(&wait); >> + >> + pr_info("page_frag pop test thread exits on cpu %d\n", >> + smp_processor_id()); >> + >> + return 0; >> +} >> + >> +static int page_frag_push_thread(void *arg) >> +{ >> + struct objpool_head *pool = arg; >> + int nr = nr_test; >> + >> + pr_info("page_frag push test thread begins on cpu %d\n", >> + smp_processor_id()); >> + >> + while (nr > 0) { >> + void *va; >> + int ret; >> + >> + if (test_align) { >> + va = page_frag_alloc_align(&test_frag, test_alloc_len, >> + GFP_KERNEL, SMP_CACHE_BYTES); >> + >> + WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1), >> + "unaligned va returned\n"); >> + } else { >> + va = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL); >> + } >> + >> + if (!va) >> + continue; >> + >> + ret = objpool_push(va, pool); >> + if (ret) { >> + page_frag_free(va); >> + cond_resched(); >> + } else { >> + nr--; >> + } >> + } >> + >> + pr_info("page_frag push test thread exits on cpu %d\n", >> + smp_processor_id()); >> + >> + if (atomic_dec_and_test(&nthreads)) >> + complete(&wait); >> + >> + return 0; >> +} >> + > > So looking over these functions they seem to overlook how the network > stack works in many cases. One of the main motivations for the page > frags approach is page recycling. For example with GRO enabled the > headers allocated to record the frags might be freed for all but the > first. As such you can end up with 17 fragments being allocated, and > 16 freed within the same thread as NAPI will just be recycling the > buffers. > > With this setup it doesn't seem very likely to be triggered since you > are operating in two threads. One test you might want to look at > adding is a test where you are allocating and freeing in the same > thread at a fairly constant rate to test against the "ideal" scenario. I am not sure if the above is still the "ideal" scenario, as you mentioned that most drivers are turning to use page_pool for rx, the page frag is really mostly for tx or skb->data for rx. >