On 2024/7/2 22:55, Alexander H Duyck wrote: ... > >>> >>>> +#define PAGE_FRAG_CACHE_ORDER_MASK GENMASK(7, 0) >>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT BIT(8) >>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT 8 >>>> + >>>> +static inline struct encoded_va *encode_aligned_va(void *va, >>>> + unsigned int order, >>>> + bool pfmemalloc) >>>> +{ >>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) >>>> - __u16 offset; >>>> - __u16 size; >>>> + return (struct encoded_va *)((unsigned long)va | order | >>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT); >>>> #else >>>> - __u32 offset; >>>> + return (struct encoded_va *)((unsigned long)va | >>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT); >>>> +#endif >>>> +} >>>> + > > This is missing any and all protection of the example you cited. If I > want to pass order as a 32b value I can and I can corrupt the virtual > address. Same thing with pfmemalloc. Lets only hope you don't hit an > architecture where a bool is a u8 in size as otherwise that shift is > going to wipe out the value, and if it is a u32 which is usually the > case lets hope they stick to the values of 0 and 1. I explicitly checked that the protection is not really needed due to performance consideration: 1. For the 'pfmemalloc' part, it does always stick to the values of 0 and 1 as below: https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053 2. For the 'order' part, its range can only be within 0~3. > >>>> +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va) >>>> +{ >>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) >>>> + return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va; >>>> +#else >>>> + return 0; >>>> +#endif >>>> +} >>>> + >>>> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va) >>>> +{ >>>> + return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va; >>>> +} >>>> + >>> >>> My advice is that if you just make encoded_va an unsigned long this >>> just becomes some FIELD_GET and bit operations. >> >> As above. >> > > The code you mentioned had one extra block of bits that was in it and > had strict protections on what went into and out of those bits. You > don't have any of those protections. As above, the protection masking/checking is explicitly avoided due to performance consideration and reasons as above for encoded_va. But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch possible future mistake. > > I suggest you just use a long and don't bother storing this as a > pointer. > ... >>>> - >>>> + remaining = nc->remaining & align_mask; >>>> + remaining -= fragsz; >>>> + if (unlikely(remaining < 0)) { >>> >>> Now this is just getting confusing. You essentially just added an >>> additional addition step and went back to the countdown approach I was >>> using before except for the fact that you are starting at 0 whereas I >>> was actually moving down through the page. >> >> Does the 'additional addition step' mean the additional step to calculate >> the offset using the new 'remaining' field? I guess that is the disadvantage >> by changing 'offset' to 'remaining', but it also some advantages too: >> >> 1. it is better to replace 'offset' with 'remaining', which >> is the remaining size for the cache in a 'page_frag_cache' >> instance, we are able to do a single 'fragsz > remaining' >> checking for the case of cache not being enough, which should be >> the fast path if we ensure size is zoro when 'va' == NULL by >> memset'ing 'struct page_frag_cache' in page_frag_cache_init() >> and page_frag_cache_drain(). >> 2. It seems more convenient to implement the commit/probe API too >> when using 'remaining' instead of 'offset' as those API needs >> the remaining size of the page_frag_cache anyway. >> >> So it is really a trade-off between using 'offset' and 'remaining', >> it is like the similar argument about trade-off between allocating >> fragment 'countdown' and 'countup' way. >> >> About confusing part, as the nc->remaining does mean how much cache >> is left in the 'nc', and nc->remaining does start from >> PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what >> you meant by 'countdown', but it is different from the 'offset countdown' >> before this patchset as my understanding. >> >>> >>> What I would suggest doing since "remaining" is a negative offset >>> anyway would be to look at just storing it as a signed negative number. >>> At least with that you can keep to your original approach and would >>> only have to change your check to be for "remaining + fragsz <= 0". >> >> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like >> below? >> nc->remaining = -PAGE_SIZE or >> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE > > Yes. Basically if we used it as a signed value then you could just work > your way up and do your aligned math still. For the aligned math, I am not sure how can 'align_mask' be appiled to a signed value yet. It seems that after masking nc->remaining leans towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for a unsigned value, for example: nc->remaining = -4094; nc->remaining &= -64; It seems we got -4096 for above, is that what we are expecting? > >> struct page_frag_cache { >> struct encoded_va *encoded_va; >> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32) >> u16 pagecnt_bias; >> s16 remaining; >> #else >> u32 pagecnt_bias; >> s32 remaining; >> #endif >> }; >> >> If I understand above correctly, it seems we really need a better name >> than 'remaining' to reflect that. > > It would effectively work like a countdown. Instead of T - X in this > case it is size - X. > >>> With that you can still do your math but it becomes an addition instead >>> of a subtraction. >> >> And I am not really sure what is the gain here by using an addition >> instead of a subtraction here. >> > > The "remaining" as it currently stands is doing the same thing so odds > are it isn't too big a deal. It is just that the original code was > already somewhat confusing and this is just making it that much more > complex. > >>>> + page = virt_to_page(encoded_va); >>>> if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) >>>> goto refill; >>>> >>>> - if (unlikely(nc->pfmemalloc)) { >>>> - free_unref_page(page, compound_order(page)); >>>> + if (unlikely(encoded_page_pfmemalloc(encoded_va))) { >>>> + VM_BUG_ON(compound_order(page) != >>>> + encoded_page_order(encoded_va)); >>>> + free_unref_page(page, encoded_page_order(encoded_va)); >>>> goto refill; >>>> } >>>> >>>> /* OK, page count is 0, we can safely set it */ >>>> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); >>>> >>>> - /* reset page count bias and offset to start of new frag */ >>>> + /* reset page count bias and remaining of new frag */ >>>> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; >>>> - offset = 0; >>>> - if (unlikely(fragsz > PAGE_SIZE)) { >>>> + nc->remaining = remaining = page_frag_cache_page_size(encoded_va); >>>> + remaining -= fragsz; >>>> + if (unlikely(remaining < 0)) { >>>> /* >>>> * The caller is trying to allocate a fragment >>>> * with fragsz > PAGE_SIZE but the cache isn't big >>> >>> I find it really amusing that you went to all the trouble of flipping >>> the logic just to flip it back to being a countdown setup. If you were >>> going to bother with all that then why not just make the remaining >>> negative instead? You could save yourself a ton of trouble that way and >>> all you would need to do is flip a few signs. >> >> I am not sure I understand the 'a ton of trouble' part here, if 'flipping >> a few signs' does save a ton of trouble here, I would like to avoid 'a >> ton of trouble' here, but I am not really understand the gain here yet as >> mentioned above. > > It isn't about flipping the signs. It is more the fact that the logic > has now become even more complex then it originally was. With my work > backwards approach the advantage was that the offset could be updated > and then we just recorded everything and reported it up. Now we have to I am supposing the above is referring to 'offset countdown' way before this patchset, right? > keep a temporary "remaining" value, generate our virtual address and > store that to a temp variable, we can record the new "remaining" value, > and then we can report the virtual address. If we get the ordering on Yes, I noticed it when coding too, that is why current virtual address is generated in page_frag_cache_current_va() basing on nc->remaining instead of the local variable 'remaining' before assigning the local variable 'remaining' to nc->remaining. But I am not sure I understand how using a signed negative number for 'remaining' will change the above steps. If not, I still fail to see the gain of using a signed negative number for 'nc->remaining'. > the steps 2 and 3 in this it will cause issues as we will be pointing > to the wrong values. It is something I can see someone easily messing > up. Yes, agreed. It would be good to be more specific about how to avoid the above problem using a signed negative number for 'remaining' as I am not sure how it can be done yet. >