Re: [PATCH 0/4] Some more lock_page work..

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 14 Oct 2020 09:53:35 -0700

On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
>
> So we remove strict serialization against truncation in
> filemap_map_pages() if the mapping is private.

Well, that particular patch does.

But as mentioned, that's the wrong way of doing it. It's just that the
right way - which keeps the serialization - requires some changes to
how we actually install the pte.

Right now that goes through alloc_set_pte(), and that one is basically
written with the page lock in mind. So the current logic is

 - lock page

 - check page->mapping etc that is now stable

 - lock page tables if required (alloc_set_pte -> pte_alloc_one_map ->
pte_offset_map_lock)

 - insert pte

but that whole alloc_set_pte() thing is actually much too generic for
what "map_pages()" really wants, and I think we could re-organize it.

In particular, what I _think_ we could do is:

 - lock the page tables

 - check that the page isn't locked

 - increment the page mapcount (atomic and ordered)

 - check that the page still isn't locked

 - insert pte

without taking the page lock. And the reason that's safe is that *if*
we're racing with something that is about the remove the page mapping,
then *that* code will

 (a) hold the page lock

 (b) before it removes the mapping, it has to remove it from any
existing mappings, which involves checking the mapcount and going off
and getting the page table locks to remove any ptes.

but my patch did *not* do that, because you have to re-organize things a bit.

Note that the above still keeps the "optimistic" model - if the page
is locked, we'll skip that pte, and we'll end up doing the heavy thing
(which will then take the pte lock if required). Which is why we
basically get away with that "only test page lock, don't take it"
approach with some ordering.

But my theory is that the actual truncate case is *so* rare that we
basically don't need to even worry about it, and that once the
(common) page fault case doesn't need the page lock, then that whole
"fall back to slow case" simply doesn't happen in practice.

> IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for
> such mappings. The "page->mapping != mapping" check makes the race window
> smaller, but doesn't remove it.

Yes, that patch I posted as an RFC does exactly that. But I think we
can keep the serialization if we just make slightly more involved
changes.

And they aren't necessarily a _lot_ more involved. In fact, I think we
may already hold the page table lock due to doing that
"pte_alloc_one_map()" thing over all of filemap_map_pages(). So I
think the only _real_ problem is that I think we increment the
page_mapcount() too late in alloc_set_pte().

And yeah, I realize this is a bit handwavy. But that's why I sent
these things out as an RFC. To see if people can shoot holes in this,
and maybe do that proper patch instead of my "let's get discussion
going" one.

Also, I really do think that our current "private mappings are
coherent with fiel changes" is a big QoI issue: it's the right thing
to do. So I wouldn't actually really want to take advantage of
"private mappings don't really _have_ to be coherent according to
POSIX". That's a lazy cop-out.

So I dislike my patch, and don't think it's really acceptable, but as
a "let's try this" I think it's a good first step.

> I'm not sure all codepaths are fine with this. For instance, looks like
> migration will back off such pages: __unmap_and_move() doesn't know how to
> deal with mapped pages with ->mapping == NULL.

So as outlined above, I think we can avoid the ->mapping == NULL case,
and we can remove all the races.

But it would still do new things, like incremnt the page mapcount
without holding the page lock. I think that's ok - we already
_decrement_ it without holding the lock, so it's not that the count is
stable without locking - but at a minimum we'd have that "need to make
sure the memory ordering between testing the page lock initially,
incrementing the count, and re-testing is ok". And maybe some code
that expects things to be stable.

So I'm not at all claiming that things are obviously fine. I'm happy
to see that Hugh started some practical stress-tests on that RFC
patch, and I think we can improve on the situation.

               Linus