> On Oct 12, 2021, at 4:20 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Tue, Oct 12, 2021 at 10:31:45AM -0700, Nadav Amit wrote: >> >> >>> On Oct 12, 2021, at 3:16 AM, Peter Xu <peterx@xxxxxxxxxx> wrote: >>> >>> On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote: >>>> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, >>>> struct mm_struct *mm = vma->vm_mm; >>>> pgd_t *pgd; >>>> unsigned long next; >>>> - unsigned long start = addr; >>>> unsigned long pages = 0; >>>> + struct mmu_gather tlb; >>>> >>>> BUG_ON(addr >= end); >>>> pgd = pgd_offset(mm, addr); >>>> flush_cache_range(vma, addr, end); >>>> inc_tlb_flush_pending(mm); >>>> + tlb_gather_mmu(&tlb, mm); >>>> + tlb_start_vma(&tlb, vma); >>> >>> Pure question: >>> >>> I actually have no idea why tlb_start_vma() is needed here, as protection range >>> can be just a single page, but anyway.. I do see that tlb_start_vma() contains >>> a whole-vma flush_cache_range() when the arch needs it, then does it mean that >>> besides the inc_tlb_flush_pending() to be dropped, so as to the other call to >>> flush_cache_range() above? >> >> Good point. >> >> tlb_start_vma() and tlb_end_vma() are required since some archs do not >> batch TLB flushes across VMAs (e.g., ARM). > > Sorry I didn't follow here - as change_protection() is per-vma anyway, so I > don't see why it needs to consider vma crossing. > > In all cases, it'll be great if you could add some explanation into commit > message on why we need tlb_{start|end}_vma(), as I think it could not be > obvious to all people. tlb_start_vma() is required when we switch from flush_tlb_range() because certain properties of the VMA (e.g., executable) are needed on certain arch. That’s the reason flush_tlb_range() requires the VMA that is invalidated to be provided. Regardless, there is an interface and that is the way it is used. I am not inclined to break it, even if it was possible, for unclear performance benefits. As I discussed offline with Andrea and David, switching to tlb_gather_mmu() interface has additional advantages than batching and avoiding unnecessary flushes on PTE permission promotion (as done in patch 2). If a single PTE is updated out of a bigger range, currently flush_tlb_range() would flush the whole range instead of the single page. In addition, once I fix this patch-set, if you update a THP, you would (at least on x86) be able to flush a single PTE instead of flushing 512 entries (which would actually be done using a full TLB flush). I would say that as I mentioned in a different thread, and was not upfront about before, one of the motivations of mine behind this patch is that I need a vectored UFFDIO_WRITEPROTECTV interface for performance. Nevertheless, I think these two patches stand by themselves and have independent value.