On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@xxxxxxxxxx> wrote: > > > > > > Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes: > > > > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > > >> Barry Song <21cnbao@xxxxxxxxx> writes: > > >> > > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@xxxxxxxxxx> wrote: > > >>>> > > >>>> > > >>>> Barry Song <21cnbao@xxxxxxxxx> writes: > > >>>> > > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > >>>>>> > > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > > >>>>>> <baolin.wang@xxxxxxxxxxxxxxxxx> wrote: > > >> [...] > > >> > > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > > >>>>>> page, B might > > >>>>>> be a correct choice. > > >>>>> > > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to > > >>>>> the head of lru list or new generation, it needs a long time to slide back > > >>>>> to the inactive list tail or to the candidate generation of mglru. the time > > >>>>> should have been large enough for tlb to be flushed. If the page is really > > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an > > >>>>> access flag in the long time in which the page is re-moved to the tail > > >>>>> as the page can be accessed multiple times if it is really hot. > > >>>> > > >>>> This might not be true if you have external hardware sharing the page > > >>>> tables with software through either HMM or hardware supported ATS > > >>>> though. > > >>>> > > >>>> In those cases I think it's much more likely hardware can still be > > >>>> accessing the page even after a context switch on the CPU say. So those > > >>>> pages will tend to get reclaimed even though hardware is still actively > > >>>> using them which would be quite expensive and I guess could lead to > > >>>> thrashing as each page is reclaimed and then immediately faulted back > > >>>> in. > > > > > > That's possible, but the chance should be relatively low. At least on > > > x86, I have not heard of this issue. > > > > Personally I've never seen any x86 system that shares page tables with > > external devices, other than with HMM. More on that below. > > > > >>> i am not quite sure i got your point. has the external hardware sharing cpu's > > >>> pagetable the ability to set access flag in page table entries by > > >>> itself? if yes, > > >>> I don't see how our approach will hurt as folio_referenced can notify the > > >>> hardware driver and the driver can flush its own tlb. If no, i don't see > > >>> either as the external hardware can't set access flags, that means we > > >>> have ignored its reference and only knows cpu's access even in the current > > >>> mainline code. so we are not getting worse. > > >>> > > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can > > >>> also broadcast to external hardware, then external hardware sees the > > >>> cleared access flag, thus, it can set access flag in page table when the > > >>> hardware access it? If this is the case, I feel what you said is true. > > >> Perhaps it would help if I gave a concrete example. Take for example > > >> the > > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > > >> two ways depending on the specific HW implementation. > > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > > >> TLB > > >> invalidations. If BTM is not supported it relies on SW to explicitly > > >> forward TLB invalidations via MMU notifiers. > > > > > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > > > > Lucky you :-) > > > > ARM64 SMMU architecture specification supports the possibilty of both, > > as does the driver. Not that I think whether or not BTM is supported has > > much relevance to this issue. > > > > >> Now consider the case where some external device is accessing mappings > > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > > >> clear the access flag without a TLB invalidate the access flag in the > > >> CPU page table will not get updated because it's already set in the SMMU > > >> TLB. > > >> As an aside access flag updates happen in one of two ways. If the > > >> SMMU > > >> HW supports hardware translation table updates (HTTU) then hardware will > > >> manage updating access/dirty flags as required. If this is not supported > > >> then SW is relied on to update these flags which in practice means > > >> taking a minor fault. But I don't think that is relevant here - in > > >> either case without a TLB invalidate neither of those things will > > >> happen. > > >> I suppose drivers could implement the clear_flush_young() MMU > > >> notifier > > >> callback (none do at the moment AFAICT) but then won't that just lead to > > >> the opposite problem - that every page ever used by an external device > > >> remains active and unavailable for reclaim because the access flag never > > >> gets cleared? I suppose they could do the flush then which would ensure > > > > > > Yes, I think so too. The reason there is currently no problem, perhaps > > > I think, there are no actual use cases at the moment? At least on our > > > Alibaba's fleet, SMMU and MMU do not share page tables now. > > > > We have systems that do. > > Just curious: do those systems run the Linux kernel? If so, are pages > shared with SMMU pinned? If not, then how are IO PFs handled after > pages are reclaimed? it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in I/O PF, so finally it runs the same codes to get page back just like CPU's PF. years ago, we recommended a pin solution, but obviously there were lots of push backs: https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@xxxxxxxxxxxxx/ Thanks Barry