Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

John Hubbard <jhubbard@xxxxxxxxxx> · Wed, 12 Dec 2018 13:56:00 -0800

On 12/12/18 1:30 PM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
>>>
>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>> Another crazy idea, why not treating GUP as another mapping of the page
>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
>>>>> have a mix of both private and file page thus only if it is a read only
>>>>> GUP) that would get added to the list of existing mapping.
>>>>>
>>>>> So the flow would be:
>>>>>     somefunction_thatuse_gup()
>>>>>     {
>>>>>         ...
>>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
>>>>>         ...
>>>>>     }
>>>>>
>>>>>     GUP(vma, ..., fake_anon, fake_vma)
>>>>>     {
>>>>>         if (vma->flags == ANON) {
>>>>>             // Add the fake anon vma to the anon vma chain as a child
>>>>>             // of current vma
>>>>>         } else {
>>>>>             // Add the fake vma to the mapping tree
>>>>>         }
>>>>>
>>>>>         // The existing GUP except that now it inc mapcount and not
>>>>>         // refcount
>>>>>         GUP_old(..., &nanonymous, &nfiles);
>>>>>
>>>>>         atomic_add(&fake_anon->refcount, nanonymous);
>>>>>         atomic_add(&fake_vma->refcount, nfiles);
>>>>>
>>>>>         return nanonymous + nfiles;
>>>>>     }
>>>>
>>>> Thanks for your idea! This is actually something like I was suggesting back
>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
>>>> people pointing out:
>>>>
>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
>>>> to get necessary locks to insert new entry into the VMA tree in that
>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
>>>>
>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
>>>> Then on IO completion you need to queue work to unpin the pages again as you
>>>> cannot remove the fake VMA directly from interrupt context where the IO is
>>>> completed.
>>>>
>>>> You are right that the cost could be amortized if gup() is called for
>>>> multiple consecutive pages however for small IOs there's no help...
>>>>
>>>> So this approach doesn't look like a win to me over using counter in struct
>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>> struct page so that we can fit that gup counter there as well. I know that
>>>> it may be easier said than done...
>>>
>>> So i want back to the drawing board and first i would like to ascertain
>>> that we all agree on what the objectives are:
>>>
>>>     [O1] Avoid write back from a page still being written by either a
>>>          device or some direct I/O or any other existing user of GUP.
>>>          This would avoid possible file system corruption.
>>>
>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>>>          considered clean by core mm (buffer head have been remove and
>>>          with some file system this turns into an ugly mess).
>>>
>>>     [O3] DAX and the device block problems, ie with DAX the page map in
>>>          userspace is the same as the block (persistent memory) and no
>>>          filesystem nor block device understand page as block or pinned
>>>          block.
>>>
>>> For [O3] i don't think any pin count would help in anyway. I believe
>>> that the current long term GUP API that does not allow GUP of DAX is
>>> the only sane solution for now.
>>
>> No, that's not a sane solution, it's an emergency hack.
>>
>>> The real fix would be to teach file-
>>> system about DAX/pinned block so that a pinned block is not reuse
>>> by filesystem.
>>
>> We already have taught filesystems about pinned dax pages, see
>> dax_layout_busy_page(). As much as possible I want to eliminate the
>> concept of "dax pages" as a special case that gets sprinkled
>> throughout the mm.
> 
> So thinking on O3 issues what about leveraging the recent change i
> did to mmu notifier. Add a event for truncate or any other file
> event that need to invalidate the file->page for a range of offset.
> 
> Add mmu notifier listener to GUP user (except direct I/O) so that
> they invalidate they hardware mapping or switch the hardware mapping
> to use a crappy page. When such event happens what ever user do to
> the page through that driver is broken anyway. So it is better to
> be loud about it then trying to make it pass under the radar.
> 
> This will put the burden on broken user and allow you to properly
> recycle your DAX page.
> 
> Think of it as revoke through mmu notifier.
> 
> So patchset would be:
>     enum mmu_notifier_event {
> +       MMU_NOTIFY_TRUNCATE,
>     };
> 
> +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
> 

That part looks good.

> Then for each user of GUP (except direct I/O or other very short
> term GUP):

but, why is there a difference between how we handle long- and
short-term callers? Aren't we just leaving a harder-to-reproduce race
condition, if we ignore the short-term gup callers?

So, how does activity (including direct IO and other short-term callers)
get quiesced (stopped, and guaranteed not to restart or continue), so 
that truncate or umount can continue on?

> 
>     Patch 1: register mmu notifier
>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
>              when that happens update the device page table or
>              usage to point to a crappy page and do put_user_page
>              on all previously held page

Minor point, this sequence should be done within a wrapper around existing 
get_user_pages(), such as get_user_pages_revokable() or something.

thanks,
-- 
John Hubbard
NVIDIA

> 
> So this would solve the revoke side of thing without adding a burden
> on GUP user like direct I/O. Many existing user of GUP already do
> listen to mmu notifier and already behave properly. It is just about
> making every body list to that. Then we can even add the mmu notifier
> pointer as argument to GUP just to make sure no new user of GUP forget
> about registering a notifier (argument as a teaching guide not as a
> something actively use).
> 
> 
> So does that sounds like a plan to solve your concern with long term
> GUP user ? This does not depend on DAX or anything it would apply to
> any file back pages.
> 
> 
> Cheers,
> Jérôme
>