Re: [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 19 Oct 2015 23:25:56 +0200

On 10/19/2015 10:10 PM, Kirill A. Shutemov wrote:
On Mon, Oct 19, 2015 at 12:53:17PM -0700, Hugh Dickins wrote:
On Mon, 19 Oct 2015, Kirill A. Shutemov wrote:
On Mon, Oct 19, 2015 at 04:20:05AM -0700, Hugh Dickins wrote:
Note how munlock_vma_pages_range() via __munlock_pagevec() does
TestClearPageMlocked() without (or "between") pte or page lock. But the pte
lock is being taken after clearing VM_LOCKED, so perhaps it's safe against
try_to_unmap_one...

A mind-trick I found helpful for understanding the barriers here, is
to imagine that the munlocker repeats its "vma->vm_flags &= ~VM_LOCKED"
every time it takes the pte lock: it does not actually do that, it
doesn't need to of course; but that does help show that ~VM_LOCKED
must be visible to anyone getting that pte lock afterwards.

How can you make sure that any other codepath that changes vm_flags would
not make (vm_flags & VM_LOCKED) temporary true while dealing with other
flags?

Compiler can convert things like "vma->vm_flags &= ~VM_FOO;" to whatever
it wants as long as end result is the same. It's very unlikely that it
will generate code to set all bits to one and then clear all which should
be cleared, but it's theoretically possible.

I think Linus would be very vocal about such compiler implementation. 
And I can imagine a lot of things in the kernel would break by those 
spuriously set bits. There must be a lot of stuff that's "theoretically 
possible within the standard" but no sane compiler does. I believe even 
compiler guys are not that insane. IIRC we've seen bugs like this and 
they were always treated as bugs and fixed.
The example I've heard often used for theoretically possible but insane 
stuff is that the compiler could make code randomly write over anything 
that's not volatile, as long as it restored the original values upon 
e.g. returning from the function. That just can't happen.

I think that's in the realm of the fanciful.  But yes, it quite often
turns out that what I think is fanciful, is something that Paul has
heard compiler writers say they want to do, even if he has managed
to discourage them from doing it so far.

Paul always has links to pdfs with this kind of horror. ;)

But more to the point, when you write of "end result", the compiler
would have no idea that releasing mmap_sem is the point at which
end result must be established:

Isn't releasing a lock one of those "release" barriers where previously
issued writes must become visible before the unlock takes place?

wouldn't it have to establish end
result before the next unlock operation, and before the end of the
compilation unit?

Now I'm lost in what you mean.

pte unlock being the relevant unlock operation
in this case, at least with my patch if not without.

Hm so IIUC Kirill's point is that try_to_unmap_one() is checking 
VM_LOCKED under pte lock, but somebody else might be modifying vm_flags 
under mmap_sem, and thus we have no protection.

I think we need to have at lease WRITE_ONCE() everywhere we update
vm_flags and READ_ONCE() where we read it without mmap_sem taken.

It wouldn't hurt to check if seeing a stale value or using non-atomic 
RMW can be a problem somewhere. In this case it's testing, not changing, 
so RMW is not an issue. But the check shouldn't consider arbitrary 
changes made by a potentially crazy compiler.

Not a series I'll embark upon myself,
and the patch at hand doesn't make things worse.

I think it does.

So what's the alternative? Hm could we keep the trylock on mmap_sem 
under pte lock? The ordering is wrong, but it's a trylock, so no danger 
of deadlock?

The patch changes locking rules for ->vm_flags without proper preparation
and documentation. It will strike back one day.
I know we have few other cases when we access ->vm_flags without mmap_sem,
but this doesn't justify introducing one more potentially weak codepath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>