On 2024/7/16 20:58, Yunsheng Lin wrote: ... > > Option 1 assuming nc->remaining as a negative value does not seems to > make it a more maintainable solution than option 2. How about something > like below if using a negative value to enable some optimization like LEA > does not have a noticeable performance difference? Suppose the below as option 3, it seems the option 3 has better performance than option 2, and option 2 has better performance than option 1 using the ko introduced in patch 1. Option 1: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=5120000' (500 runs): 17.757768 task-clock (msec) # 0.001 CPUs utilized ( +- 0.17% ) 5 context-switches # 0.288 K/sec ( +- 0.28% ) 0 cpu-migrations # 0.007 K/sec ( +- 12.36% ) 82 page-faults # 0.005 M/sec ( +- 0.06% ) 46128280 cycles # 2.598 GHz ( +- 0.17% ) 60938595 instructions # 1.32 insn per cycle ( +- 0.02% ) 14783794 branches # 832.525 M/sec ( +- 0.02% ) 20393 branch-misses # 0.14% of all branches ( +- 0.13% ) 24.556644680 seconds time elapsed ( +- 0.07% ) Option 2: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=5120000' (500 runs): 18.443508 task-clock (msec) # 0.001 CPUs utilized ( +- 0.61% ) 6 context-switches # 0.342 K/sec ( +- 0.57% ) 0 cpu-migrations # 0.025 K/sec ( +- 4.89% ) 82 page-faults # 0.004 M/sec ( +- 0.06% ) 47901207 cycles # 2.597 GHz ( +- 0.61% ) 60985019 instructions # 1.27 insn per cycle ( +- 0.05% ) 14787177 branches # 801.755 M/sec ( +- 0.05% ) 21099 branch-misses # 0.14% of all branches ( +- 0.14% ) 24.413183804 seconds time elapsed ( +- 0.06% ) Option 3: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=5120000' (500 runs): 17.847031 task-clock (msec) # 0.001 CPUs utilized ( +- 0.23% ) 5 context-switches # 0.305 K/sec ( +- 0.55% ) 0 cpu-migrations # 0.017 K/sec ( +- 6.86% ) 82 page-faults # 0.005 M/sec ( +- 0.06% ) 46355974 cycles # 2.597 GHz ( +- 0.23% ) 60848779 instructions # 1.31 insn per cycle ( +- 0.03% ) 14758941 branches # 826.969 M/sec ( +- 0.03% ) 20728 branch-misses # 0.14% of all branches ( +- 0.15% ) 24.376161069 seconds time elapsed ( +- 0.06% ) > > struct page_frag_cache { > /* encoded_va consists of the virtual address, pfmemalloc bit and order > * of a page. > */ > unsigned long encoded_va; > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32) > __u16 remaining; > __u16 pagecnt_bias; > #else > __u32 remaining; > __u32 pagecnt_bias; > #endif > }; > > void *__page_frag_alloc_va_align(struct page_frag_cache *nc, > unsigned int fragsz, gfp_t gfp_mask, > unsigned int align_mask) > { > unsigned int size = page_frag_cache_page_size(nc->encoded_va); > unsigned int remaining; > > remaining = nc->remaining & align_mask; > if (unlikely(remaining < fragsz)) { > if (unlikely(fragsz > PAGE_SIZE)) { > /* > * The caller is trying to allocate a fragment > * with fragsz > PAGE_SIZE but the cache isn't big > * enough to satisfy the request, this may > * happen in low memory conditions. > * We don't release the cache page because > * it could make memory pressure worse > * so we simply return NULL here. > */ > return NULL; > } > > if (!__page_frag_cache_refill(nc, gfp_mask)) > return NULL; > > size = page_frag_cache_page_size(nc->encoded_va); > remaining = size; > } > > nc->pagecnt_bias--; > nc->remaining = remaining - fragsz; > > return encoded_page_address(nc->encoded_va) + (size - remaining); > } > >