Hello Kirill and Vlastimil, On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > >> Hello, > >> > >> This was a well known issue for more than a decade, but until a few > >> months ago we relied on the compiler to stick to atomic accesses and > >> updates while walking and updating pagetables. > >> > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > >> gup_pmd_range uses READ_ONCE as well. > >> > >> This convert more racy VM places to avoid depending on the expected > >> compiler behavior to achieve kernel runtime correctness. > >> > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > >> (practically not needed) and it also prevents gcc to emit code that > >> risks getting confused if the memory unexpectedly changes under it > >> (unlikely to ever be needed). > >> > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > >> most obvious places, but before wasting too much time at doing a full > >> audit I thought it was safer to post it and get some comment. More > >> updates can be posted incrementally anyway. > > > > The intention is described well to my eyes. > > > > Do I understand correctly, that it's attempt to get away with modifying > > vma's fields under down_read(mmap_sem)? The issue is that we already get away with it, but we do it without READ/WRITE_ONCE. The patch should changes nothing, it should only reduce the dependency on the compiler to do what we expect. > If that's the intention, then IMHO it's not that well described. It > talks about "racy VM places" but e.g. the __mm_populate() changes are > for code protected by down_read(). So what's going on here? expand_stack can move anonymous vma vm_end up or vm_start/pgoff down, while we hold the mmap_sem for writing. See the location of the three WRITE_ONCE in the patch. So whenever we deal with a vma that we don't know if it's filebacked (filebacked vmas cannot growsup/down) and that we don't know if it has VM_GROWSDOWN/UP set, we shall use READ_ONCE to access vm_start/end/pgoff. This is the only thing the patch is about, and it should make no runtime difference at all, but then the WRITE_ONCE in native_set_pte also should make no runtime difference just like the READ_ONCE in gup_pmd_range should make no runtime difference. I mean we don't trust the compiler with gup_fast but then we trust it with expand_stack vs find_vma.