On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote: > On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote: > > We are talking about non-COW anon pages here -- they can't be mapped > > more than once. So why not just identify them by checking > > page_mapcount == 1 and then unconditionally reuse them? (This is > > probably where I've missed things.) > > The problem in depending on page_mapcount to decide if it's COW or > non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP > may elevate the count of a COW anon page that become a non-COW anon > page. > > This is Jann's idea not mine. > > The problem is we have an unprivileged long term GUP like vmsplice > that facilitates elevating the page count indefinitely, until the > parent finally writes a secret to it. Theoretically a short term pin > would do it too so it's not just vmpslice, but the short term pin > would be incredibly more challenging to become a concern since it'd > kill a phone battery and flash before it can read any data. > > So what happens with your page_mapcount == 1 check is that it doesn't > mean non-COW (we thought it did until it didn't for the long term gup > pin in vmsplice). > > Jann's testcases does fork() and set page_mapcount 2 and page_count to > 2, vmsplice, take unprivileged infinitely long GUP pin to set > page_count to 3, queue the page in the pipe with page_count elevated, > munmap to drop page_count to 2 and page_mapcount to 1. > > page_mapcount is 1, so you'd think the page is non-COW and owned by > the parent, but the child can still read it so it's very much still > wp_page_copy material if the parent tries to modify it. Otherwise the > child can read the content. > > This was supposed to be solvable by just doing the COW in gup(write=0) > case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly > sure why that didn't fly and it had to be reverted by Peter in > a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was > happening I was side tracked by urgent issues and I didn't manage to > look back of how we ended up with the big hammer page_count == 1 check > instead to decide if to call wp_page_reuse or wp_page_shared. > > So anyway, the only thing that is clear to me is that keeping the > child from reading the page_mapcount == 1 pages of the parent, is the > only reason why wp_page_reuse(vmf) will only be called on > page_count(page) == 1 and not on page_mapcount(page) == 1. > > It's also the reason why your page_mapcount assumption will risk to > reintroduce the issue, and I only wish we could put back page_mapcount > == 1 back there. > > Still even if we put back page_mapcount there, it is not ok to leave > the page fault with stale TLB entries and to rely on the fact > wp_page_shared won't run. It'd also avoid the problem but I think if > you leave stale TLB entries in change_protection just like NUMA > balancing does, it also requires a catcher just like NUMA balancing > has, or it'd truly work by luck. > > So until we can put a page_mapcount == 1 check back there, the > page_count will be by definition unreliable because of the speculative > lookups randomly elevating all non zero page_counts at any time in the > background on all pages, so you will never be able to tell if a page > is true COW or if it's just a spurious COW because of a speculative > lookup. It is impossible to differentiate a speculative lookup from a > vmsplice ref in a child. Thanks for the details. In your patch, do we need to take wrprotect_rwsem in handle_userfault() as well? Otherwise, it seems userspace would have to synchronize between its wrprotect ioctl and fault handler? i.e., the fault hander needs to be aware that the content of write- protected pages can actually change before the iotcl returns.