On Tue, May 17, 2016 at 4:26 AM, Ashish Srivastava <ashish0srivastava0@xxxxxxxxx> wrote: > Yes, the original repro was using a custom allocator but I was seeing the > issue with malloc'd memory as well on my (ARMv7) platform. > I agree that the repro code won't reliably work so have modified the repro > code attached to the bug to use file backed memory. Ah, I was going to ask if you were doing this on some platform other than x86. I followed your reasoning, but when I tested the unpatched kernel, I couldn't reproduce the problem. I used perf to count page faults and still didn't see a difference. > That really is the root cause of the problem. I can make the following > change in the kernel that can make the slow writes problem go away. > This makes vma_set_page_prot return the value of vma_wants_writenotify to > the caller after setting vma->vmpage_prot. > > In vma_set_page_prot: > -void vma_set_page_prot(struct vm_area_struct *vma) > +bool vma_set_page_prot(struct vm_area_struct *vma) > { > unsigned long vm_flags = vma->vm_flags; > > vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags); > if (vma_wants_writenotify(vma)) { > vm_flags &= ~VM_SHARED; > vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, > vm_flags); > + return 1; > } > + return 0; > } > > In mprotect_fixup: > > * held in write mode. > */ > vma->vm_flags = newflags; > - dirty_accountable = vma_wants_writenotify(vma); > - vma_set_page_prot(vma); > + dirty_accountable = vma_set_page_prot(vma); > > change_protection(vma, start, end, vma->vm_page_prot, > dirty_accountable, 0) > > Thanks! > Ashish > > On Mon, May 16, 2016 at 7:05 PM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> > wrote: >> >> On Fri, May 06, 2016 at 03:01:12PM -0700, Andrew Morton wrote: >> > >> > (switched to email. Please respond via emailed reply-to-all, not via >> > the >> > bugzilla web interface). >> > >> > Great bug report, thanks. >> > >> > I assume the breakage was caused by >> > >> > commit 64e455079e1bd7787cc47be30b7f601ce682a5f6 >> > Author: Peter Feiner <pfeiner@xxxxxxxxxx> >> > AuthorDate: Mon Oct 13 15:55:46 2014 -0700 >> > Commit: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >> > CommitDate: Tue Oct 14 02:18:28 2014 +0200 >> > >> > mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY >> > cleared >> > >> > >> > Could someone (Peter, Kirill?) please take a look? >> > >> > On Fri, 06 May 2016 13:15:19 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx >> > wrote: >> > >> > > https://bugzilla.kernel.org/show_bug.cgi?id=117731 >> > > >> > > Bug ID: 117731 >> > > Summary: Doing mprotect for PROT_NONE and then for >> > > PROT_READ|PROT_WRITE reduces CPU write B/W on >> > > buffer >> > > Product: Memory Management >> > > Version: 2.5 >> > > Kernel Version: 3.18 and beyond >> > > Hardware: All >> > > OS: Linux >> > > Tree: Mainline >> > > Status: NEW >> > > Severity: high >> > > Priority: P1 >> > > Component: Other >> > > Assignee: akpm@xxxxxxxxxxxxxxxxxxxx >> > > Reporter: ashish0srivastava0@xxxxxxxxx >> > > Regression: No >> > > >> > > Created attachment 215401 >> > > --> https://bugzilla.kernel.org/attachment.cgi?id=215401&action=edit >> > > Repro code >> >> The code is somewhat broken: malloc doesn't guarantee to return >> page-aligned pointer. And in my case it leads -EINVAL from mprotect(). >> >> Do you have a custom malloc()? >> >> > > This is a regression that is present in kernel 3.18 and beyond and not >> > > in >> > > previous ones. >> > > Attached is a simple repro case. It measures the time taken to write >> > > and then >> > > read all pages in a buffer, then it does mprotect for PROT_NONE and >> > > then >> > > mprotect for PROT_READ|PROT_WRITE, then it again measures time taken >> > > to write >> > > and then read all pages in a buffer. The 2nd time taken is much larger >> > > (20 to >> > > 30 times) than the first one. >> > > >> > > I have looked at the code in the kernel tree that is causing this and >> > > it is >> > > because writes are causing faults, as pte_mkwrite is not being done >> > > during >> > > mprotect_fixup for PROT_READ|PROT_WRITE. >> > > >> > > This is the code inside mprotect_fixup in a tree v3.16.35 or older: >> > > /* >> > > * vm_flags and vm_page_prot are protected by the mmap_sem >> > > * held in write mode. >> > > */ >> > > vma->vm_flags = newflags; >> > > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, >> > > vm_get_page_prot(newflags)); >> > > >> > > if (vma_wants_writenotify(vma)) { >> > > vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED); >> > > dirty_accountable = 1; >> > > } >> > > This is the code in the same region inside mprotect_fixup in a recent >> > > tree: >> > > /* >> > > * vm_flags and vm_page_prot are protected by the mmap_sem >> > > * held in write mode. >> > > */ >> > > vma->vm_flags = newflags; >> > > dirty_accountable = vma_wants_writenotify(vma); >> > > vma_set_page_prot(vma); >> > > >> > > The difference is the setting of dirty_accountable. result of >> > > vma_wants_writenotify does not depend on vma->vm_flags alone but also >> > > depends >> > > on vma->vm_page_prot and following code will make it return 0 because >> > > in newer >> > > code we are setting dirty_accountable before setting >> > > vma->vm_page_prot. >> > > /* The open routine did something to the protections that >> > > pgprot_modify >> > > * won't preserve? */ >> > > if (pgprot_val(vma->vm_page_prot) != >> > > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags))) >> > > return 0; >> >> The test-case will never hit this, as normal malloc() returns anonymous >> memory, which is handled by the first check in vma_wants_writenotify(). >> >> The only case when the case can change anything for you is if your >> malloc() return file-backed memory. Which is possible, I guess, with >> custom malloc(). >> >> > > Now, suppose we change code by calling vma_set_page_prot before >> > > setting >> > > dirty_accountable: >> > > vma->vm_flags = newflags; >> > > vma_set_page_prot(vma); >> > > dirty_accountable = vma_wants_writenotify(vma); >> > > Still, dirty_accountable will be 0. This is because following code in >> > > vma_set_page_prot modifies vma->vm_page_prot without modifying >> > > vma->vm_flags: >> > > if (vma_wants_writenotify(vma)) { >> > > vm_flags &= ~VM_SHARED; >> > > vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, >> > > vm_flags); >> > > } >> > > so this check in vma_wants_writenotify will again return 0: >> > > /* The open routine did something to the protections that >> > > pgprot_modify >> > > * won't preserve? */ >> > > if (pgprot_val(vma->vm_page_prot) != >> > > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags))) >> > > return 0; >> > > So dirty_accountable is still 0. >> > > >> > > This code in change_pte_range decides whether to call pte_mkwrite or >> > > not: >> > > /* Avoid taking write faults for known dirty pages */ >> > > if (dirty_accountable && pte_dirty(ptent) && >> > > (pte_soft_dirty(ptent) || >> > > !(vma->vm_flags & VM_SOFTDIRTY))) { >> > > ptent = pte_mkwrite(ptent); >> > > } >> > > If dirty_accountable is 0 even though the pte was dirty already, >> > > pte_mkwrite >> > > will not be done. >> > > >> > > I think the correct solution should be that dirty_accountable be set >> > > with the >> > > value of vma_wants_writenotify queried before vma->vm_page_prot is set >> > > with >> > > VM_SHARED removed from flags. One way to do so could be to have >> > > vma_set_page_prot return the value of dirty_accountable that it can >> > > set right >> > > after vma_wants_writenotify check. Another way could be to do >> > > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, >> > > vm_get_page_prot(newflags)); >> > > and then set dirty_accountable based on vma_wants_writenotify and then >> > > call >> > > vma_set_page_prot. >> >> Looks like a good catch, but I'm not sure if it's the root cause of your >> problem. >> >> -- >> Kirill A. Shutemov > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>