On 6/26/23 22:46, Lorenzo Stoakes wrote: > When mprotect() is used to make unwritable VMAs writable, they have the > VM_ACCOUNT flag applied and memory accounted accordingly. > > If the VMA has had no pages faulted in and is then made unwritable once > again, it will remain accounted for, despite not being capable of extending > memory usage. > > Consider:- > > ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0); > mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE); > mprotect(ptr + page_size, page_size, PROT_READ); In the original Mike's example there were actual pages populated, in that case we still won't merge the vma's, right? Guess that can't be helped. > The first mprotect() splits the range into 3 VMAs and the second fails to > merge the three as the middle VMA has VM_ACCOUNT set and the others do not, > rendering them unmergeable. > > This is unnecessary, since no pages have actually been allocated and the > middle VMA is not capable of utilising more memory, thereby introducing > unnecessary VMA fragmentation (and accounting for more memory than is > necessary). > > Since we cannot efficiently determine which pages map to an anonymous VMA, > we have to be very conservative - determining whether any pages at all have > been faulted in, by checking whether vma->anon_vma is NULL. > > We can see that the lack of anon_vma implies that no anonymous pages are > present as evidenced by vma_needs_copy() utilising this on fork to > determine whether page tables need to be copied. > > The only place where anon_vma is set NULL explicitly is on fork with > VM_WIPEONFORK set, however since this flag is intended to cause the child > process to not CoW on a given memory range, it is right to interpret this > as indicating the VMA has no faulted-in anonymous memory mapped. > > If the VMA was forked without VM_WIPEONFORK set, then anon_vma_fork() will > have ensured that a new anon_vma is assigned (and correctly related to its > parent anon_vma) should any pages be CoW-mapped. > > The overall operation is safe against races as we hold a write lock against > mm->mmap_lock. > > If we could efficiently look up the VMA's faulted-in pages then we would > unaccount all those pages not yet faulted in. However as the original > comment alludes this simply isn't currently possible, so we remain > conservative and account all pages or none at all. > > Signed-off-by: Lorenzo Stoakes <lstoakes@xxxxxxxxx> So in practice programs will likely do the PROT_WRITE in order to actually populate the area, so this won't trigger as I commented above. But it can still help in some cases and is cheap to do, so: Acked-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > mm/mprotect.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 6f658d483704..9461c936082b 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -607,8 +607,11 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > /* > * If we make a private mapping writable we increase our commit; > * but (without finer accounting) cannot reduce our commit if we > - * make it unwritable again. hugetlb mapping were accounted for > - * even if read-only so there is no need to account for them here > + * make it unwritable again except in the anonymous case where no > + * anon_vma has yet been assigned. > + * > + * hugetlb mapping were accounted for even if read-only so there is > + * no need to account for them here. > */ > if (newflags & VM_WRITE) { > /* Check space limits when area turns into data. */ > @@ -622,6 +625,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > return -ENOMEM; > newflags |= VM_ACCOUNT; > } > + } else if ((oldflags & VM_ACCOUNT) && vma_is_anonymous(vma) && > + !vma->anon_vma) { > + newflags &= ~VM_ACCOUNT; > } > > /* > @@ -652,6 +658,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > } > > success: > + if ((oldflags & VM_ACCOUNT) && !(newflags & VM_ACCOUNT)) > + vm_unacct_memory(nrpages); > + > /* > * vm_flags and vm_page_prot are protected by the mmap_lock > * held in write mode.