Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages

John Hubbard <jhubbard@xxxxxxxxxx> · Wed, 24 Jun 2020 17:11:30 -0700

On 2020-06-24 16:20, Jason Gunthorpe wrote:
...
I think Yang explained it - the page is removed from the mappings but
freeing it does not happen because page_ref_freeze() does not succeed
due to the pin.

Presumably the mappings can reconnect to the same physical page if
it is re-faulted to avoid any data corruption.

So, the issue here is the mappings are trashed while the page remains
- and trashing the mapping triggers a mmu notifier which upsets i915.

What's less clear is why the comment and the commit description
only talk about reclaim, when there are additional things that call
try_to_unmap(), including:

     migrate_vma_unmap()
     split_huge_page_to_list() --> unmap_page()

It looks like the same unmap first then abort if the refcount is still
elevated design as shrink_page_list() ?

Yes. I was just wondering why the documentation here seems to ignore the
other, non-reclaim cases. Anyway, though...

I do like this code change, though. And I *think* it's actually safe to
do this, as it stays away from writeback or other filesystem activity.
But let me double check that, in case I'm forgetting something.

...OK, I've checked, and I like it a little bit less now. Mainly for
structural reasons, though. I think it would work correctly. But
here's a concern: try_to_unmap() should only fail to unmap if there is a
reason to not unmap. Having a page be pinned for dma is a reason to not
*free* a page, and it's also a reason to be careful about writeback and
page buffers for writeback and such. But I'm not sure that it's a reason
to fail to remove mappings.

True, most (all?) of the reasons that we remove mappings, generally are
for things that are not allowed while a page is dma-pinned...at least,
today. But still, there's nothing fundamental about a mapping that
should prevent it from coming or going while a page is undergoing
dma.

So, it's merely a convenient, now-misnamed location in the call stack
to fail out. That's not great. It might be better, as Jason hints at
below, to fail out a little earlier, instead. That would lead to a more
places to call page_maybe_dma_pinned(), but that's not a real problem,
because it's still a small number of places.

After writing all of that...I don't feel strongly about it, because
TTU is kind of synonymous with "I'm about to do a dma-pin-unfriendly
operation".

Maybe some of the more experienced fs or mm people have strong opinions
one way or the other?

It would be nice to have an explanation why it is OK now to change
it..

Yes. Definitely good to explain that in the commit log. I think
it's triggered by the existence of page_maybe_dma_pinned(). Until
that was added, figuring out if dma was involved required basically
just guesswork. Now we have a way to guess much more accurately. :)

I don't know, but could it be that try_to_unmap() has to be done
before checking the refcount as each mapping is included in the
refcount? ie we couldn't know a DMA pin was active in advance?

Now that we have your pin stuff we can detect a DMA pin without doing
all the unmaps?

Once something calls pin_user_page*(), then the pages will be marked
as dma-pinned, yes. So no, there is no need to wait until try_to_unmap()
to find out.

A final note: depending on where page_maybe_dma_pinned() ends up
getting called, this might prevent a fair number of the problems that
Jan originally reported [1], and that I also reported separately!

Well, not all of the problems, and only after the filesystems get
converted to call pin_user_pages() (working on that next), but...I think
it would actually avoid the crash our customer reported back in early
2018. Even though we don't have the full file lease + pin_user_pages()
solution in place.

That's because reclaim is what triggers the problems that we saw. And
with this patch, we bail out of reclaim early.

[1] https://www.spinics.net/lists/linux-mm/msg142700.html

thanks,
--
John Hubbard
NVIDIA