On 3/17/2023 9:00 PM, Ryan Roberts wrote: > On 17/03/2023 08:19, Yin, Fengwei wrote: >> >> >> On 3/17/2023 4:00 PM, Ryan Roberts wrote: >>> On 17/03/2023 06:33, Yin, Fengwei wrote: >>>> >>>> >>>> On 3/17/2023 11:44 AM, Matthew Wilcox wrote: >>>>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote: >>>>>> >>>>>> >>>>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote: >>>>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote: >>>>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote: >>>>>>>>>> I think you are changing behavior here - is this intentional? Previously this >>>>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The >>>>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted >>>>>>>>>> pages are mapped old. But now a whole range will be mapped the same. >>>>>>>>> >>>>>>>>> Yes. You are right here. >>>>>>>>> >>>>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we >>>>>>>>> can avoid to handle vmf->address == addr specially. It's OK to >>>>>>>>> drop prefault and change the logic here a little bit to: >>>>>>>>> if (arch_wants_old_prefaulted_pte()) >>>>>>>>> entry = pte_mkold(entry); >>>>>>>>> else >>>>>>>>> entry = pte_sw_mkyong(entry); >>>>>>>>> >>>>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr >>>>>>>>> because HW will set the ACCESS bit in page table entry. >>>>>>>>> >>>>>>>>> Add Will Deacon in case I missed something here. Thanks. >>>>>>>> >>>>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag >>>>>>>> management. In that case it's done by SW, so I would imagine that by setting >>>>>>>> this to old initially, we will get a second fault to set the access bit, which >>>>>>>> will slow things down. I wonder if you will need to split this into (up to) 3 >>>>>>>> calls to set_ptes()? >>>>>>> >>>>>>> I don't think we should do that. The limited information I have from >>>>>>> various microarchitectures is that the PTEs must differ only in their >>>>>>> PFN bits in order to use larger TLB entries. That includes the Accessed >>>>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same >>>>>>> folio, at least initially. >>>>>>> >>>>>>> That said, we should still do this conditionally. We'll prefault some >>>>>>> other folios too. So I think this should be: >>>>>>> >>>>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address); >>>>>>> >>>>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access >>>>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old". >>>>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit >>>>>> on ARM64 with hardware access flag. >>>>>> >>>>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old" >>>>>> and let hardware to update it to "yong"? >>>>> >>>>> Because we're tracking the entire folio as a single entity. So we're >>>>> better off avoiding the extra pagefaults to update the accessed bit, >>>>> which won't actually give us any information (vmscan needs to know "were >>>>> any of the accessed bits set", not "how many of them were set"). >>>> There is no extra pagefaults to update the accessed bit. There are three cases here: >>>> 1. hardware support access flag and cheap from "old" to "yong" without extra fault >>>> 2. hardware support access flag and expensive from "old" to "yong" without extra fault >>>> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive) >>>> >>>> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in >>>> page fault. >>>> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault. >>>> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware >>>> access bit requires to set PTEs "old". >>> >>> Your logic makes sense, but it doesn't take into account the HPA >>> micro-architectural feature present in some ARM CPUs. HPA can transparently >>> coalesce multiple pages into a single TLB entry when certain conditions are met >>> (roughly; upto 4 pages physically and virtually contiguous and all within a >>> 4-page natural alignment). But as Matthew says, this works out better when all >>> pte attributes (including access and dirty) match. Given the reason for setting >>> the prefault pages to old is so that vmscan can do a better job of finding cold >>> pages, and given vmscan will now be looking for folios and not individual pages >>> (I assume?), I agree with Matthew that we should make whole folios young or old. >>> It will marginally increase our chances of the access and dirty bits being >>> consistent across the whole 4-page block that the HW tries to coalesce. If we >>> unconditionally make everything old, the hw will set accessed for the single >>> page that faulted, and we therefore don't have consistency for that 4-page block. >> My concern was that the benefit of "old" PTEs for ARM64 with hardware access bit >> will be lost. The workloads (application launch latency and direct reclaim according >> to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80) can show regression with this >> series. Thanks. > > My (potentially incorrect) understanding of the reason that marking the > prefaulted ptes as old was because it made it easier/quicker for vmscan to > identify those prefaulted pages and reclaim them under memory pressure. I > _assume_ now that we have large folios, that vmscan will be trying to pick > folios for reclaim, not individual subpages within the folio? In which case, > vmscan will only consider the folio as old if _all_ pages within are old. So > marking all the pages of a folio young vs marking 1 page in the folio young > won't make a difference from this perspective. But it will make a difference > from the perspective a HPA. (Please Matthew or somebody else, correct me if my > understanding is incorrect!) Thanks a lot for your patient explanation. I got the point here. For the first access, we mark the all PTEs of folio "yong". So later access will get large TLB. Regards Yin, Fengwei > >> >> BTW, with TLB merge feature, should hardware update coalesce multiple pages access >> bit together? otherwise, it's avoidable that only one page access is set by hardware >> finally. > > No, the HW will only update the access flag for the single page that is > accessed. So yes, in the long run the value of the flags across the 4-page block > will diverge - that's why I said "marginal" above. > >> >> Regards >> Yin, Fengwei >> >>> >>>> >>>>> >>>>> Anyway, hopefully Ryan can test this and let us know if it fixes the >>>>> regression he sees. >>>> I highly suspect the regression Ryan saw is not related with this but another my >>>> stupid work. I will send out the testing patch soon. Thanks. >>> >>> I tested a version of this where I made everything unconditionally young, >>> thinking it might be the source of the perf regression, before I reported it. It >>> doesn't make any difference. So I agree the regression is somewhere else. >>> >>> Thanks, >>> Ryan >>> >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>> >