Re: [RFC v11 08/14] mm: page_frag: some minor refactoring before adding new API

Yunsheng Lin <linyunsheng@xxxxxxxxxx> · Tue, 23 Jul 2024 21:19:46 +0800

On 2024/7/22 23:32, Alexander Duyck wrote:
> On Mon, Jul 22, 2024 at 5:55 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>>
>> On 2024/7/22 7:40, Alexander H Duyck wrote:
>>> On Fri, 2024-07-19 at 17:33 +0800, Yunsheng Lin wrote:
>>>> Refactor common codes from __page_frag_alloc_va_align()
>>>> to __page_frag_cache_refill(), so that the new API can
>>>> make use of them.
>>>>
>>>> CC: Alexander Duyck <alexander.duyck@xxxxxxxxx>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
>>>> ---
>>>>  include/linux/page_frag_cache.h |  2 +-
>>>>  mm/page_frag_cache.c            | 93 +++++++++++++++++----------------
>>>>  2 files changed, 49 insertions(+), 46 deletions(-)
>>>>
>>>> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
>>>> index 12a16f8e8ad0..5aa45de7a9a5 100644
>>>> --- a/include/linux/page_frag_cache.h
>>>> +++ b/include/linux/page_frag_cache.h
>>>> @@ -50,7 +50,7 @@ static inline void *encoded_page_address(unsigned long encoded_va)
>>>>
>>>>  static inline void page_frag_cache_init(struct page_frag_cache *nc)
>>>>  {
>>>> -    nc->encoded_va = 0;
>>>> +    memset(nc, 0, sizeof(*nc));
>>>>  }
>>>>
>>>
>>> I do not like requiring the entire structure to be reset as a part of
>>> init. If encoded_va is 0 then we have reset the page and the flags.
>>> There shouldn't be anything else we need to reset as remaining and bias
>>> will be reset when we reallocate.
>>
>> The argument is about aoviding one checking for fast path by doing the
>> memset in the slow path, which you might already know accroding to your
>> comment in previous version.
>>
>> It is just sometimes hard to understand your preference for maintainability
>> over performance here as sometimes your comment seems to perfer performance
>> over maintainability, like the LEA trick you mentioned and offset count-down
>> before this patchset. It would be good to be more consistent about this,
>> otherwise it is sometimes confusing when doing the refactoring.
> 
> The use of a negative offset is arguably more maintainable in my mind
> rather than being a performance trick. Essentially if you use the
> negative value you can just mask off the upper bits and it is the
> offset in the page. As such it is actually easier for me to read
> versus "remaining" which is an offset from the end of the page.
> Assuming you read the offset in hex anyway.

Reading the above doesn't seems maintainable to me:(

> 
>>>
>>>>  static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)
>>>> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
>>>> index 7928e5d50711..d9c9cad17af7 100644
>>>> --- a/mm/page_frag_cache.c
>>>> +++ b/mm/page_frag_cache.c
>>>> @@ -19,6 +19,28 @@
>>>>  #include <linux/page_frag_cache.h>
>>>>  #include "internal.h"
>>>>
>>>> +static struct page *__page_frag_cache_recharge(struct page_frag_cache *nc)
>>>> +{
>>>> +    unsigned long encoded_va = nc->encoded_va;
>>>> +    struct page *page;
>>>> +
>>>> +    page = virt_to_page((void *)encoded_va);
>>>> +    if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
>>>> +            return NULL;
>>>> +
>>>> +    if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
>>>> +            VM_BUG_ON(compound_order(page) !=
>>>> +                      encoded_page_order(encoded_va));
>>>> +            free_unref_page(page, encoded_page_order(encoded_va));
>>>> +            return NULL;
>>>> +    }
>>>> +
>>>> +    /* OK, page count is 0, we can safely set it */
>>>> +    set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>>>> +
>>>> +    return page;
>>>> +}
>>>> +
>>>>  static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>>                                           gfp_t gfp_mask)
>>>>  {
>>>> @@ -26,6 +48,14 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>>      struct page *page = NULL;
>>>>      gfp_t gfp = gfp_mask;
>>>>
>>>> +    if (likely(nc->encoded_va)) {
>>>> +            page = __page_frag_cache_recharge(nc);
>>>> +            if (page) {
>>>> +                    order = encoded_page_order(nc->encoded_va);
>>>> +                    goto out;
>>>> +            }
>>>> +    }
>>>> +
>>>
>>> This code has no business here. This is refill, you just dropped
>>> recharge in here which will make a complete mess of the ordering and be
>>> confusing to say the least.
>>>
>>> The expectation was that if we are calling this function it is going to
>>> overwrite the virtual address to NULL on failure so we discard the old
>>> page if there is one present. This changes that behaviour. What you
>>> effectively did is made __page_frag_cache_refill into the recharge
>>> function.
>>
>> The idea is to reuse the below for both __page_frag_cache_refill() and
>> __page_frag_cache_recharge(), which seems to be about maintainability
>> to not having duplicated code. If there is a better idea to avoid that
>> duplicated code while keeping the old behaviour, I am happy to change
>> it.
>>
>>         /* reset page count bias and remaining to start of new frag */
>>         nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>         nc->remaining = PAGE_SIZE << order;
>>
> 
> The only piece that is really reused here is the pagecnt_bias
> assignment. What is obfuscated away is that the order is gotten
> through one of two paths. Really order isn't order here it is size.
> Which should have been fetched already. What you end up doing with
> this change is duplicating a bunch of code throughout the function.
> You end up having to fetch size multiple times multiple ways. here you
> are generating it with order. Then you have to turn around and get it
> again at the start of the function, and again after calling this
> function in order to pull it back out.

I am assuming you would like to reserve old behavior as below?

	if(!encoded_va) {
refill:
		__page_frag_cache_refill()
	}

	if(remaining < fragsz) {
		if(!__page_frag_cache_recharge())
			goto refill;
	}

As we are adding new APIs, are we expecting new APIs also duplicate
the above pattern?

> 
>>>
>>>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>>      gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>>>>                 __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>>>> @@ -35,7 +65,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>>      if (unlikely(!page)) {
>>>>              page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
>>>>              if (unlikely(!page)) {
>>>> -                    nc->encoded_va = 0;
>>>> +                    memset(nc, 0, sizeof(*nc));
>>>>                      return NULL;
>>>>              }
>>>>
>>>
>>> The memset will take a few more instructions than the existing code
>>> did. I would prefer to keep this as is if at all possible.
>>
>> It will not take more instructions for arm64 as it has 'stp' instruction for
>> __HAVE_ARCH_MEMSET is set.
>> There is something similar for x64?
> 
> The x64 does not last I knew without getting into the SSE/AVX type
> stuff. This becomes two seperate 8B store instructions.

I can check later when I get hold of a x64 server.
But doesn't it make sense to have one extra 8B store instructions in slow path
to avoid a checking in fast path?

> 
>>>
>>>> @@ -45,6 +75,16 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>>      nc->encoded_va = encode_aligned_va(page_address(page), order,
>>>>                                         page_is_pfmemalloc(page));
>>>>
>>>> +    /* Even if we own the page, we do not use atomic_set().
>>>> +     * This would break get_page_unless_zero() users.
>>>> +     */
>>>> +    page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
>>>> +
>>>> +out:
>>>> +    /* reset page count bias and remaining to start of new frag */
>>>> +    nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>>> +    nc->remaining = PAGE_SIZE << order;
>>>> +
>>>>      return page;
>>>>  }
>>>>
>>>
>>> Why bother returning a page at all? It doesn't seem like you don't use
>>> it anymore. It looks like the use cases you have for it in patch 11/12
>>> all appear to be broken from what I can tell as you are adding page as
>>> a variable when we don't need to be passing internal details to the
>>> callers of the function when just a simple error return code would do.
>>
>> It would be good to be more specific about the 'broken' part here.
> 
> We are passing internals to the caller. Basically this is generally
> frowned upon for many implementations of things as the general idea is
> that the internal page we are using should be a pseudo-private value.

It is implementation detail and it is about avoid calling virt_to_page()
as mentioned below, I am not sure why it is referred as 'broken', it would
be better to provide more doc about why it is bad idea here, as using
'pseudo-private ' wording doesn't seems to justify the 'broken' part here.

> I understand that you have one or two callers that need it for the use
> cases you have in patches 11/12, but it also seems like you are just
> passing it regardless. For example I noticed in a few cases you added
> the page pointer in 12 to handle the return value, but then just used
> it to check for NULL. My thought would be that rather than returning
> the page here you would be better off just returning 0 or an error and
> then doing the virt_to_page translation for all the cases where the
> page is actually needed since you have to go that route for a cached
> page anyway.

Yes, it is about aovid calling virt_to_page() as much as possible.