On 2024/4/17 23:11, Alexander H Duyck wrote: > On Wed, 2024-04-17 at 21:19 +0800, Yunsheng Lin wrote: >> On 2024/4/17 0:22, Alexander H Duyck wrote: >>> On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote: >>>> The '(PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)' case is for the >>>> system with page size less than 32KB, which is 0x8000 bytes >>>> requiring 16 bits space, change 'size' to 'size_mask' to avoid >>>> using the MSB, and change 'pfmemalloc' field to reuse the that >>>> MSB, so that we remove the orginal space needed by 'pfmemalloc'. >>>> >>>> For another case, the MSB of 'offset' is reused for 'pfmemalloc'. >>>> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> >>>> --- >>>> include/linux/page_frag_cache.h | 13 ++++++++----- >>>> mm/page_frag_cache.c | 5 +++-- >>>> 2 files changed, 11 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h >>>> index fe5faa80b6c3..40a7d6da9ef0 100644 >>>> --- a/include/linux/page_frag_cache.h >>>> +++ b/include/linux/page_frag_cache.h >>>> @@ -12,15 +12,16 @@ struct page_frag_cache { >>>> void *va; >>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) >>>> __u16 offset; >>>> - __u16 size; >>>> + __u16 size_mask:15; >>>> + __u16 pfmemalloc:1; >>>> #else >>>> - __u32 offset; >>>> + __u32 offset:31; >>>> + __u32 pfmemalloc:1; >>>> #endif >>> >>> This seems like a really bad idea. Using a bit-field like this seems >>> like a waste as it means that all the accesses now have to add >>> additional operations to access either offset or size. It wasn't as if >>> this is an oversized struct, or one that we are allocating a ton of. As >>> such I am not sure why we need to optmize for size like this. >> >> For the old 'struct page_frag' use case, there is one 'struct page_frag' >> for every socket and task_struct, there may be tens of thousands of >> them even in a 32 bit sysmem, which might mean a lof of memory even for >> a single byte saving, see patch 13. >> > > Yeah, I finally had time to finish getting through the patch set last > night. Sorry for quick firing reviews but lately I haven't had much > time to work on upstream work, and as you mentioned last time I only > got to 3 patches so I was trying to speed through. > > I get that you are trying to reduce the size but in the next patch you > actually end up overshooting that on x86_64 systems. I am assuming that > is to try to account for the 32b use case? On 64b I am pretty sure you > don't get any gain since a long followed by two u16s and an int will > still be 16B. What we probably need to watch out for is the > optimization for size versus having to add instructions to extract and > insert the data back into the struct. > > Anyway as far as this layout I am not sure it is the best way to go. > You are combining pfmemalloc with either size *OR* offset, and then Does it really matter if pfmemalloc is conbined with size or offset? as we are using bitfield for pfmemalloc. > combining the pagecnt_bias with the va. I'm wondering if it wouldn't > make more sense to look at putting together the structure something > like: > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > typedef u16 page_frag_bias_t; > #else > typedef u32 page_frag_bias_t; > #endif > > struct page_frag_cache { > /* page address and offset */ > void *va; Generally I am agreed with combining the virtual address with the offset for the reason you mentioned below. > page_frag_bias_t pagecnt_bias; > u8 pfmemalloc; > u8 page_frag_order; > } The issue with the 'page_frag_order' I see is that we might need to do a 'PAGE << page_frag_order' to get actual size, and we might also need to do 'size - 1' to get the size_mask if we want to mask out the offset from the 'va'. For page_frag_order, we need to: size = PAGE << page_frag_order size_mask = size - 1 For size_mask, it seem we only need to do: size = size_mask + 1 And as PAGE_FRAG_CACHE_MAX_SIZE = 32K, which can be fitted into 15 bits if we use size_mask instead of size. Does it make sense to use below, so that we only need to use bitfield for SIZE < PAGE_FRAG_CACHE_MAX_SIZE in 32 bits system? And 'struct page_frag' is using a similar '(BITS_PER_LONG > 32)' checking trick. struct page_frag_cache { /* page address and offset */ void *va; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32) u16 pagecnt_bias; u16 size_mask:15; u16 pfmemalloc:1; #else u32 pagecnt_bias; u16 size_mask; u16 pfmemalloc; #endif }; > > The basic idea would be that we would be able to replace the size mask > with just a shift value representing the page order of the page being > fragmented. With that we can reduce the size to just a single byte. In > addition we could probably leave it there regardless of build as the > order should be initialized to 0 when this is allocated to it would be > correct even in the case where it isn't used (and there isn't much we > can do about the hole anyway). > > In addition by combining the virtual address with the offset we can > just use the combined result for what we need. The only item that has > to be worked out is how to deal with the end of a page in the count up > case. However the combination seems the most logical one since they are > meant to be combined ultimately anyway. It does put limits on when we > can align things as we don't want to align ourselves into the next I guess it means we need to mask out the offset 'va' before doing the aligning operation and 'offset + fragsz > size' checking, right? > page, but I think it makes more sense then the other limits that had to > be put on allocations and such in order to allow us to squeeze > pagecnt_bias into the virtual address. > > Anyway I pulled in your patches and plan to do a bit of testing, after > I figure out what the nvme disk ID regression is I am seeing. My main > concern can be summed up as the NIC driver use case > (netdev/napi_alloc_frag callers) versus the socket/vhost use case. The > main thing in the case of the NIC driver callers is that we have a need > for isolation and guarantees that we won't lose cache line alignment. I > think those are the callers you are missing in your benchmarks, but > arguably that might be something you cannot test as I don't know what > NICs you have access to and if you have any that are using those calls. I guess we just need to replace the API used by socket/vhost with the one used by netdev/napi_alloc_frag callers in mm/page_frag_test.c in patch 1, which is introduced to test performance of page_frag implementation, see: https://lore.kernel.org/all/20240415131941.51153-2-linyunsheng@xxxxxxxxxx/ > . >