On Mon 04-03-19 13:12:10, Kirill A. Shutemov wrote: > On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote: > > Hello Kirill and Vlastimil, > > > > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > > > >> Hello, > > > >> > > > >> This was a well known issue for more than a decade, but until a few > > > >> months ago we relied on the compiler to stick to atomic accesses and > > > >> updates while walking and updating pagetables. > > > >> > > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > > > >> gup_pmd_range uses READ_ONCE as well. > > > >> > > > >> This convert more racy VM places to avoid depending on the expected > > > >> compiler behavior to achieve kernel runtime correctness. > > > >> > > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > > > >> (practically not needed) and it also prevents gcc to emit code that > > > >> risks getting confused if the memory unexpectedly changes under it > > > >> (unlikely to ever be needed). > > > >> > > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > > > >> most obvious places, but before wasting too much time at doing a full > > > >> audit I thought it was safer to post it and get some comment. More > > > >> updates can be posted incrementally anyway. > > > > > > > > The intention is described well to my eyes. > > > > > > > > Do I understand correctly, that it's attempt to get away with modifying > > > > vma's fields under down_read(mmap_sem)? > > > > The issue is that we already get away with it, but we do it without > > READ/WRITE_ONCE. The patch should changes nothing, it should only > > reduce the dependency on the compiler to do what we expect. > > Yes, it is pre-existing problem. And yes, complier may screw this up. > The patch may reduce dependency on the compiler, but it doesn't mean it > reduces chance of race. > > Consider your changes into __mm_populate() and populate_vma_page_range(). > You put READ_ONCE() in both functions. But populate_vma_page_range() gets > called from __mm_populate(). Before your change compiler may optimize the > code and load from the memory once for a field. With your changes complier > will issue two loads. > > It *increases* chances of the race, not reduces them. > > The current locking scheme doesn't allow modifying VMA field without > down_write(mmap_sem). > > We do have hacks[1] that try to bypass the limitation, but AFAIK we never > had a solid explanation why this should work. Sparkling READ_ONCE() > doesn't help with this, but makes it appears legitimate. I do agree with Kirill here. Sprinkling {READ,WRITE}_ONCE around just doesn't solve anything. I am pretty sure that people will not think about it and we will end up in a similar half covered situation in few years again. I would rather remove all those hacks and use a saner locking scheme instead. > [1] I believe we also touch vm_flags without proper locking to set/clear > VM_LOCKED. -- Michal Hocko SUSE Labs