Re: [PATCH net-next v2 09/15] mm: page_frag: reuse MSB of 'size' field for pfmemalloc

Yunsheng Lin <linyunsheng@xxxxxxxxxx> · Thu, 18 Apr 2024 17:39:04 +0800

On 2024/4/17 23:11, Alexander H Duyck wrote:
> On Wed, 2024-04-17 at 21:19 +0800, Yunsheng Lin wrote:
>> On 2024/4/17 0:22, Alexander H Duyck wrote:
>>> On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
>>>> The '(PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)' case is for the
>>>> system with page size less than 32KB, which is 0x8000 bytes
>>>> requiring 16 bits space, change 'size' to 'size_mask' to avoid
>>>> using the MSB, and change 'pfmemalloc' field to reuse the that
>>>> MSB, so that we remove the orginal space needed by 'pfmemalloc'.
>>>>
>>>> For another case, the MSB of 'offset' is reused for 'pfmemalloc'.
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
>>>> ---
>>>>  include/linux/page_frag_cache.h | 13 ++++++++-----
>>>>  mm/page_frag_cache.c            |  5 +++--
>>>>  2 files changed, 11 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
>>>> index fe5faa80b6c3..40a7d6da9ef0 100644
>>>> --- a/include/linux/page_frag_cache.h
>>>> +++ b/include/linux/page_frag_cache.h
>>>> @@ -12,15 +12,16 @@ struct page_frag_cache {
>>>>  	void *va;
>>>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>>  	__u16 offset;
>>>> -	__u16 size;
>>>> +	__u16 size_mask:15;
>>>> +	__u16 pfmemalloc:1;
>>>>  #else
>>>> -	__u32 offset;
>>>> +	__u32 offset:31;
>>>> +	__u32 pfmemalloc:1;
>>>>  #endif
>>>
>>> This seems like a really bad idea. Using a bit-field like this seems
>>> like a waste as it means that all the accesses now have to add
>>> additional operations to access either offset or size. It wasn't as if
>>> this is an oversized struct, or one that we are allocating a ton of. As
>>> such I am not sure why we need to optmize for size like this.
>>
>> For the old 'struct page_frag' use case, there is one 'struct page_frag'
>> for every socket and task_struct, there may be tens of thousands of
>> them even in a 32 bit sysmem, which might mean a lof of memory even for
>> a single byte saving, see patch 13.
>>
> 
> Yeah, I finally had time to finish getting through the patch set last
> night. Sorry for quick firing reviews but lately I haven't had much
> time to work on upstream work, and as you mentioned last time I only
> got to 3 patches so I was trying to speed through.
> 
> I get that you are trying to reduce the size but in the next patch you
> actually end up overshooting that on x86_64 systems. I am assuming that
> is to try to account for the 32b use case? On 64b I am pretty sure you
> don't get any gain since a long followed by two u16s and an int will
> still be 16B. What we probably need to watch out for is the
> optimization for size versus having to add instructions to extract and
> insert the data back into the struct.
> 
> Anyway as far as this layout I am not sure it is the best way to go.
> You are combining pfmemalloc with either size *OR* offset, and then

Does it really matter if pfmemalloc is conbined with size or offset?
as we are using bitfield for pfmemalloc.

> combining the pagecnt_bias with the va. I'm wondering if it wouldn't
> make more sense to look at putting together the structure something
> like:
> 
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> typedef u16 page_frag_bias_t;
> #else
> typedef u32 page_frag_bias_t;
> #endif
> 
> struct page_frag_cache {
> 	/* page address and offset */
> 	void *va;

Generally I am agreed with combining the virtual address with the
offset for the reason you mentioned below.

> 	page_frag_bias_t pagecnt_bias;
> 	u8 pfmemalloc;
> 	u8 page_frag_order;
> }

The issue with the 'page_frag_order' I see is that we might need to do
a 'PAGE << page_frag_order' to get actual size, and we might also need
to do 'size - 1' to get the size_mask if we want to mask out the offset
from the 'va'.

For page_frag_order, we need to:
size = PAGE << page_frag_order
size_mask = size - 1

For size_mask, it seem we only need to do:
size = size_mask + 1

And as PAGE_FRAG_CACHE_MAX_SIZE = 32K, which can be fitted into 15 bits
if we use size_mask instead of size.

Does it make sense to use below, so that we only need to use bitfield
for SIZE < PAGE_FRAG_CACHE_MAX_SIZE in 32 bits system? And 'struct
page_frag' is using a similar '(BITS_PER_LONG > 32)' checking trick.

struct page_frag_cache {
	/* page address and offset */
	void *va;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
	u16 pagecnt_bias;
	u16 size_mask:15;
	u16 pfmemalloc:1;
#else
	u32 pagecnt_bias;
	u16 size_mask;
	u16 pfmemalloc;
#endif
};

> 
> The basic idea would be that we would be able to replace the size mask
> with just a shift value representing the page order of the page being
> fragmented. With that we can reduce the size to just a single byte. In
> addition we could probably leave it there regardless of build as the
> order should be initialized to 0 when this is allocated to it would be
> correct even in the case where it isn't used (and there isn't much we
> can do about the hole anyway).
> 
> In addition by combining the virtual address with the offset we can
> just use the combined result for what we need. The only item that has
> to be worked out is how to deal with the end of a page in the count up
> case. However the combination seems the most logical one since they are
> meant to be combined ultimately anyway. It does put limits on when we
> can align things as we don't want to align ourselves into the next

I guess it means we need to mask out the offset 'va' before doing the
aligning operation and 'offset + fragsz > size' checking, right?

> page, but I think it makes more sense then the other limits that had to
> be put on allocations and such in order to allow us to squeeze
> pagecnt_bias into the virtual address.
> 
> Anyway I pulled in your patches and plan to do a bit of testing, after
> I figure out what the nvme disk ID regression is I am seeing. My main
> concern can be summed up as the NIC driver use case
> (netdev/napi_alloc_frag callers) versus the socket/vhost use case. The
> main thing in the case of the NIC driver callers is that we have a need
> for isolation and guarantees that we won't lose cache line alignment. I
> think those are the callers you are missing in your benchmarks, but
> arguably that might be something you cannot test as I don't know what
> NICs you have access to and if you have any that are using those calls.

I guess we just need to replace the API used by socket/vhost with the one
used by netdev/napi_alloc_frag callers in mm/page_frag_test.c in patch 1,
which is introduced to test performance of page_frag implementation, see:

https://lore.kernel.org/all/20240415131941.51153-2-linyunsheng@xxxxxxxxxx/

> .
>