Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

Yunsheng Lin <linyunsheng@xxxxxxxxxx> · Wed, 3 Jul 2024 20:33:27 +0800

On 2024/7/2 22:55, Alexander H Duyck wrote:

...

> 
>>>
>>>> +#define PAGE_FRAG_CACHE_ORDER_MASK		GENMASK(7, 0)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT		BIT(8)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT	8
>>>> +
>>>> +static inline struct encoded_va *encode_aligned_va(void *va,
>>>> +						   unsigned int order,
>>>> +						   bool pfmemalloc)
>>>> +{
>>>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> -	__u16 offset;
>>>> -	__u16 size;
>>>> +	return (struct encoded_va *)((unsigned long)va | order |
>>>> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>>  #else
>>>> -	__u32 offset;
>>>> +	return (struct encoded_va *)((unsigned long)va |
>>>> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>> +#endif
>>>> +}
>>>> +
> 
> This is missing any and all protection of the example you cited. If I
> want to pass order as a 32b value I can and I can corrupt the virtual
> address. Same thing with pfmemalloc. Lets only hope you don't hit an
> architecture where a bool is a u8 in size as otherwise that shift is
> going to wipe out the value, and if it is a u32 which is usually the
> case lets hope they stick to the values of 0 and 1.

I explicitly checked that the protection is not really needed due to
performance consideration:
1. For the 'pfmemalloc' part, it does always stick to the values of 0
   and 1 as below:
https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053

2. For the 'order' part, its range can only be within 0~3.

> 
>>>> +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va)
>>>> +{
>>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> +	return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
>>>> +#else
>>>> +	return 0;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va)
>>>> +{
>>>> +	return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
>>>> +}
>>>> +
>>>
>>> My advice is that if you just make encoded_va an unsigned long this
>>> just becomes some FIELD_GET and bit operations.
>>
>> As above.
>>
> 
> The code you mentioned had one extra block of bits that was in it and
> had strict protections on what went into and out of those bits. You
> don't have any of those protections.

As above, the protection masking/checking is explicitly avoided due
to performance consideration and reasons as above for encoded_va.

But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch
possible future mistake.

> 
> I suggest you just use a long and don't bother storing this as a
> pointer.
> 

...

>>>> -
>>>> +	remaining = nc->remaining & align_mask;
>>>> +	remaining -= fragsz;
>>>> +	if (unlikely(remaining < 0)) {
>>>
>>> Now this is just getting confusing. You essentially just added an
>>> additional addition step and went back to the countdown approach I was
>>> using before except for the fact that you are starting at 0 whereas I
>>> was actually moving down through the page.
>>
>> Does the 'additional addition step' mean the additional step to calculate
>> the offset using the new 'remaining' field? I guess that is the disadvantage
>> by changing 'offset' to 'remaining', but it also some advantages too:
>>
>> 1. it is better to replace 'offset' with 'remaining', which
>>    is the remaining size for the cache in a 'page_frag_cache'
>>    instance, we are able to do a single 'fragsz > remaining'
>>    checking for the case of cache not being enough, which should be
>>    the fast path if we ensure size is zoro when 'va' == NULL by
>>    memset'ing 'struct page_frag_cache' in page_frag_cache_init()
>>    and page_frag_cache_drain().
>> 2. It seems more convenient to implement the commit/probe API too
>>    when using 'remaining' instead of 'offset' as those API needs
>>    the remaining size of the page_frag_cache anyway.
>>
>> So it is really a trade-off between using 'offset' and 'remaining',
>> it is like the similar argument about trade-off between allocating
>> fragment 'countdown' and 'countup' way.
>>
>> About confusing part, as the nc->remaining does mean how much cache
>> is left in the 'nc', and nc->remaining does start from
>> PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what
>> you meant by 'countdown', but it is different from the 'offset countdown'
>> before this patchset as my understanding.
>>
>>>
>>> What I would suggest doing since "remaining" is a negative offset
>>> anyway would be to look at just storing it as a signed negative number.
>>> At least with that you can keep to your original approach and would
>>> only have to change your check to be for "remaining + fragsz <= 0".
>>
>> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like
>> below?
>> nc->remaining = -PAGE_SIZE or
>> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE
> 
> Yes. Basically if we used it as a signed value then you could just work
> your way up and do your aligned math still.

For the aligned math, I am not sure how can 'align_mask' be appiled to
a signed value yet. It seems that after masking nc->remaining leans
towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for
a unsigned value, for example:

nc->remaining = -4094;
nc->remaining &= -64;

It seems we got -4096 for above, is that what we are expecting?

> 
>> struct page_frag_cache {
>>         struct encoded_va *encoded_va;
>>
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
>>         u16 pagecnt_bias;
>>         s16 remaining;
>> #else
>>         u32 pagecnt_bias;
>>         s32 remaining;
>> #endif
>> };
>>
>> If I understand above correctly, it seems we really need a better name
>> than 'remaining' to reflect that.
> 
> It would effectively work like a countdown. Instead of T - X in this
> case it is size - X.
> 
>>> With that you can still do your math but it becomes an addition instead
>>> of a subtraction.
>>
>> And I am not really sure what is the gain here by using an addition
>> instead of a subtraction here.
>>
> 
> The "remaining" as it currently stands is doing the same thing so odds
> are it isn't too big a deal. It is just that the original code was
> already somewhat confusing and this is just making it that much more
> complex.
> 
>>>> +		page = virt_to_page(encoded_va);
>>>>  		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
>>>>  			goto refill;
>>>>  
>>>> -		if (unlikely(nc->pfmemalloc)) {
>>>> -			free_unref_page(page, compound_order(page));
>>>> +		if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
>>>> +			VM_BUG_ON(compound_order(page) !=
>>>> +				  encoded_page_order(encoded_va));
>>>> +			free_unref_page(page, encoded_page_order(encoded_va));
>>>>  			goto refill;
>>>>  		}
>>>>  
>>>>  		/* OK, page count is 0, we can safely set it */
>>>>  		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>>>>  
>>>> -		/* reset page count bias and offset to start of new frag */
>>>> +		/* reset page count bias and remaining of new frag */
>>>>  		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>>> -		offset = 0;
>>>> -		if (unlikely(fragsz > PAGE_SIZE)) {
>>>> +		nc->remaining = remaining = page_frag_cache_page_size(encoded_va);
>>>> +		remaining -= fragsz;
>>>> +		if (unlikely(remaining < 0)) {
>>>>  			/*
>>>>  			 * The caller is trying to allocate a fragment
>>>>  			 * with fragsz > PAGE_SIZE but the cache isn't big
>>>
>>> I find it really amusing that you went to all the trouble of flipping
>>> the logic just to flip it back to being a countdown setup. If you were
>>> going to bother with all that then why not just make the remaining
>>> negative instead? You could save yourself a ton of trouble that way and
>>> all you would need to do is flip a few signs.
>>
>> I am not sure I understand the 'a ton of trouble' part here, if 'flipping
>> a few signs' does save a ton of trouble here, I would like to avoid 'a
>> ton of trouble' here, but I am not really understand the gain here yet as
>> mentioned above.
> 
> It isn't about flipping the signs. It is more the fact that the logic
> has now become even more complex then it originally was. With my work
> backwards approach the advantage was that the offset could be updated
> and then we just recorded everything and reported it up. Now we have to

I am supposing the above is referring to 'offset countdown' way
before this patchset, right?

> keep a temporary "remaining" value, generate our virtual address and
> store that to a temp variable, we can record the new "remaining" value,
> and then we can report the virtual address. If we get the ordering on

Yes, I noticed it when coding too, that is why current virtual address is
generated in page_frag_cache_current_va() basing on nc->remaining instead
of the local variable 'remaining' before assigning the local variable
'remaining' to nc->remaining. But I am not sure I understand how using a
signed negative number for 'remaining' will change the above steps. If
not, I still fail to see the gain of using a signed negative number for
'nc->remaining'.

> the steps 2 and 3 in this it will cause issues as we will be pointing
> to the wrong values. It is something I can see someone easily messing
> up.

Yes, agreed. It would be good to be more specific about how to avoid
the above problem using a signed negative number for 'remaining' as
I am not sure how it can be done yet.

>