On 2024/7/10 23:28, Alexander H Duyck wrote: ... >>>> >>>>> >>>>> What I would suggest doing since "remaining" is a negative offset >>>>> anyway would be to look at just storing it as a signed negative number. >>>>> At least with that you can keep to your original approach and would >>>>> only have to change your check to be for "remaining + fragsz <= 0". >>>> >>>> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like >>>> below? >>>> nc->remaining = -PAGE_SIZE or >>>> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE >>> >>> Yes. Basically if we used it as a signed value then you could just work >>> your way up and do your aligned math still. >> >> For the aligned math, I am not sure how can 'align_mask' be appiled to >> a signed value yet. It seems that after masking nc->remaining leans >> towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for >> a unsigned value, for example: >> >> nc->remaining = -4094; >> nc->remaining &= -64; >> >> It seems we got -4096 for above, is that what we are expecting? > > No, you have to do an addition and then the mask like you were before > using __ALIGN_KERNEL. > > As it stands I realized your alignment approach in this patch is > broken. You should be aligning the remaining at the start of this > function and then storing it before you call > page_frag_cache_current_va. Instead you do it after so the start of > your block may not be aligned to the requested mask if you have > multiple callers sharing this function or if you are passing an > unaligned size in the request. Yes, you seems right about it, the intention is to do the below as patch 3 does in v10: aligned_remaining = nc->remaining & align_mask; remaining = aligned_remaining - fragsz; nc->remaining = remaining; return encoded_page_address(nc->encoded_va) + (size - aligned_remaining); > >>> >>>> struct page_frag_cache { >>>> struct encoded_va *encoded_va; >>>> >>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32) >>>> u16 pagecnt_bias; >>>> s16 remaining; >>>> #else >>>> u32 pagecnt_bias; >>>> s32 remaining; >>>> #endif >>>> }; >>>> >>>> If I understand above correctly, it seems we really need a better name >>>> than 'remaining' to reflect that. >>> >>> It would effectively work like a countdown. Instead of T - X in this >>> case it is size - X. >>> >>>>> With that you can still do your math but it becomes an addition instead >>>>> of a subtraction. >>>> >>>> And I am not really sure what is the gain here by using an addition >>>> instead of a subtraction here. >>>> >>> >>> The "remaining" as it currently stands is doing the same thing so odds >>> are it isn't too big a deal. It is just that the original code was >>> already somewhat confusing and this is just making it that much more >>> complex. >>> >>>>>> + page = virt_to_page(encoded_va); >>>>>> if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) >>>>>> goto refill; >>>>>> >>>>>> - if (unlikely(nc->pfmemalloc)) { >>>>>> - free_unref_page(page, compound_order(page)); >>>>>> + if (unlikely(encoded_page_pfmemalloc(encoded_va))) { >>>>>> + VM_BUG_ON(compound_order(page) != >>>>>> + encoded_page_order(encoded_va)); >>>>>> + free_unref_page(page, encoded_page_order(encoded_va)); >>>>>> goto refill; >>>>>> } >>>>>> >>>>>> /* OK, page count is 0, we can safely set it */ >>>>>> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); >>>>>> >>>>>> - /* reset page count bias and offset to start of new frag */ >>>>>> + /* reset page count bias and remaining of new frag */ >>>>>> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; >>>>>> - offset = 0; >>>>>> - if (unlikely(fragsz > PAGE_SIZE)) { >>>>>> + nc->remaining = remaining = page_frag_cache_page_size(encoded_va); >>>>>> + remaining -= fragsz; >>>>>> + if (unlikely(remaining < 0)) { >>>>>> /* >>>>>> * The caller is trying to allocate a fragment >>>>>> * with fragsz > PAGE_SIZE but the cache isn't big >>>>> >>>>> I find it really amusing that you went to all the trouble of flipping >>>>> the logic just to flip it back to being a countdown setup. If you were >>>>> going to bother with all that then why not just make the remaining >>>>> negative instead? You could save yourself a ton of trouble that way and >>>>> all you would need to do is flip a few signs. >>>> >>>> I am not sure I understand the 'a ton of trouble' part here, if 'flipping >>>> a few signs' does save a ton of trouble here, I would like to avoid 'a >>>> ton of trouble' here, but I am not really understand the gain here yet as >>>> mentioned above. >>> >>> It isn't about flipping the signs. It is more the fact that the logic >>> has now become even more complex then it originally was. With my work >>> backwards approach the advantage was that the offset could be updated >>> and then we just recorded everything and reported it up. Now we have to >> >> I am supposing the above is referring to 'offset countdown' way >> before this patchset, right? >> >>> keep a temporary "remaining" value, generate our virtual address and >>> store that to a temp variable, we can record the new "remaining" value, >>> and then we can report the virtual address. If we get the ordering on >> >> Yes, I noticed it when coding too, that is why current virtual address is >> generated in page_frag_cache_current_va() basing on nc->remaining instead >> of the local variable 'remaining' before assigning the local variable >> 'remaining' to nc->remaining. But I am not sure I understand how using a >> signed negative number for 'remaining' will change the above steps. If >> not, I still fail to see the gain of using a signed negative number for >> 'nc->remaining'. >> >>> the steps 2 and 3 in this it will cause issues as we will be pointing >>> to the wrong values. It is something I can see someone easily messing >>> up. >> >> Yes, agreed. It would be good to be more specific about how to avoid >> the above problem using a signed negative number for 'remaining' as >> I am not sure how it can be done yet. >> > > My advice would be to go back to patch 3 and figure out how to do this > re-ordering without changing the alignment behaviour. The old code > essentially aligned both the offset and fragsz by combining the two and > then doing the alignment. Since you are doing a count up setup you will I am not sure I understand 'aligned both the offset and fragsz ' part, as 'fragsz' being aligned or not seems to depend on last caller' align_mask, for the same page_frag_cache instance, suppose offset = 32768 initially for the old code: Step 1: __page_frag_alloc_align() is called with fragsz=7 and align_mask=~0u the offset after this is 32761, the true fragsz is 7 too. Step 2: __page_frag_alloc_align() is called with fragsz=7 and align_mask=-16 the offset after this is 32752, the true fragsz is 9, which does not seems to be aligned. The above is why I added the below paragraph in the doc to make the semantic more explicit: "Depending on different aligning requirement, the page_frag API caller may call page_frag_alloc*_align*() to ensure the returned virtual address or offset of the page is aligned according to the 'align/alignment' parameter. Note the size of the allocated fragment is not aligned, the caller needs to provide an aligned fragsz if there is an alignment requirement for the size of the fragment." And existing callers of page_frag aligned API does seems to follow the above rule last time I checked. Or did I miss something obvious here? > need to align the remaining, then add fragsz, and then I guess you > could store remaining and then subtract fragsz from your final virtual > address to get back to where the starting offset is actually located. remaining = __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask); remaining += fragsz; nc->remaining = remaining; return encoded_page_address(nc->encoded_va) + (size + remaining) - fragsz; If yes, I am not sure what is the point of doing the above yet, it just seem to make thing more complicated and harder to understand. > > Basically your "remaining" value isn't a safe number to use for an > offset since it isn't aligned to your starting value at any point. Does using 'aligned_remaining' local variable to make it more obvious seem reasonable to you? > > As far as the negative value, it is more about making it easier to keep > track of what is actually going on. Basically we can use regular > pointer math and as such I suspect the compiler is having to do extra > instructions to flip your value negative before it can combine the > values via something like the LEA (load effective address) assembler > call. I am not an asm expert here, I am not sure I understand the optimization trick here.