On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote: > Hello everyone, > > On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote: > > From: Jérôme Glisse <jglisse@xxxxxxxxxx> > > > > This patchset is on top of my patchset to add context information to > > mmu notifier [1] you can find a branch with everything [2]. I have not > > tested it but i wanted to get the discussion started. I believe it is > > correct but i am not sure what kind of kvm test i can run to exercise > > this. > > > > The idea is that since kvm will invalidate the secondary MMUs within > > invalidate_range callback then the change_pte() optimization is lost. > > With this patchset everytime core mm is using set_pte_at_notify() and > > thus change_pte() get calls then we can ignore the invalidate_range > > callback altogether and only rely on change_pte callback. > > > > Note that this is only valid when either going from a read and write > > pte to a read only pte with same pfn, or from a read only pte to a > > read and write pte with different pfn. The other side of the story > > is that the primary mmu pte is clear with ptep_clear_flush_notify > > before the call to change_pte. > > If it's cleared with ptep_clear_flush_notify, change_pte still won't > work. The above text needs updating with > "ptep_clear_flush". set_pte_at_notify is all about having > ptep_clear_flush only before it or it's the same as having a range > invalidate preceding it. > > With regard to the code, wp_page_copy() needs > s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify. > > change_pte relies on the ptep_clear_flush preceding the > set_pte_at_notify that will make sure if the secondary MMU mapping > randomly disappears between ptep_clear_flush and set_pte_at_notify, > gup_fast will wait and block on the PT lock until after > set_pte_at_notify is completed before trying to re-establish a > secondary MMU mapping. > > So then we've only to worry about what happens because we left the > secondary MMU mapping potentially intact despite we flushed the > primary MMU mapping with ptep_clear_flush (as opposed to > ptep_clear_flush_notify which would teardown the secondary MMU mapping > too). So all the above is moot since as you pointed out in the other email ptep_clear_flush_notify does not invalidate kvm secondary mmu hence. > > In you wording above at least the "with a different pfn" is > superflous. I think it's ok if the protection changes from read-only > to read-write and the pfn remains the same. Like when we takeover a > page because it's not shared anymore (fork child quit). > > It's also ok to change pfn if the mapping is read-only and remains > read-only, this is what KSM does in replace_page. Yes i thought this was obvious i will reword and probably just do a list of every case that is fine. > > The read-write to read-only case must not change pfn to avoid losing > coherency from the secondary MMU point of view. This isn't so much > about change_pte itself, but the fact that the page-copy generally > happens well before the pte mangling starts. This case never presents > itself in the code because KSM is first write protecting the page and > only later merging it, regardless of change_pte or not. > > The important thing is that the secondary MMU must be updated first > (unlike the invalidates) to be sure the secondary MMU already points > to the new page when the pfn changes and the protection changes from > read-only to read-write (COW fault). The primary MMU cannot read/write > to the page anyway while we update the secondary MMU because we did > ptep_clear_flush() before calling set_pte_at_notify(). So this > ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures > whenever the CPU can access the memory, the access is synchronous > with the secondary MMUs because they've all been updated already. > > If (in set_pte_at_notify) we were to call change_pte() after > set_pte_at() what would happen is that the CPU could write to the page > through a TLB fill without page fault while the secondary MMUs still > read the old memory in the old readonly page. Yeah, between do you have any good workload for me to test this ? I was thinking of running few same VM and having KSM work on them. Is there some way to trigger KVM to fork ? As the other case is breaking COW after fork. Cheers, Jérôme