Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 14 Dec 2018 17:00:12 +1100

On Wed, Dec 12, 2018 at 09:02:29PM -0500, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > > it may be easier said than done...
> > > > > 
> > > > > So i want back to the drawing board and first i would like to ascertain
> > > > > that we all agree on what the objectives are:
> > > > > 
> > > > >     [O1] Avoid write back from a page still being written by either a
> > > > >          device or some direct I/O or any other existing user of GUP.
> > 
> > IOWs, you need to mark pages being written to by a GUP as
> > PageWriteback, so all attempts to write the page will block on
> > wait_on_page_writeback() before trying to write the dirty page.
> 
> No you don't and you can't for the simple reasons is that the GUP
> of some device driver can last days, weeks, months, years ... so
> it is not something you want to do. Here is what happens today:
>     - user space submit directio read from a file and writing to
>       virtual address and the problematic case is when that virtual
>       address is actualy a mmap of a file itself
>     - kernel do GUP on the virtual address, if the page has write
>       permission in the CPU page table mapping then the page
>       refcount is incremented and the page is return to directio
>       kernel code that do memcpy
> 
>       It means that the page already went through page_mkwrite so
>       all is fine from fs point of view.
>       If page does not have write permission then a page fault is
>       triggered and page_mkwrite will happen and prep the page
>       accordingly

Yes, the short term GUP references do the right thing. They aren't
the issue - the problem is the long term GUP references that dirty
clean pages without first having called ->page_mkwrite.

> In the above scheme a page write back might happens after we looked
> up the page from the CPU page table and before directio finish with
> memcpy so that the page content during the write back might not be
> stable. This is a small window for things to go bad and i do not
> think we know if anybody ever experience a bug because of that.
> 
> For other GUP users the flow is the same except that device driver
> that keep the page around and do continuous dma to it might last
> days, weeks, months, years ... so for those the race window is big
> enough for bad things to happen. Jan have report of such bugs.

i.e. this case.

GUP faults the page, gets marked dirty, time passes, page
writeback occurs, it's now mapped clean, time passes, another RDMA
hits those pages, it calls set_page_dirty() again and things go
boom.

Basically, you are saying that the problem here is that writeback
of a dirty page occurred while there was an active GUP, and that
you want us to ....

> So what i am proposing to fix the above is have page_mkclean return
> a is_pin boolean if page is pin than the fs code use a bounce page
> to do the write back giving a stable bounce page. More over fs will
> need to keep around all buffer_head, blocks, ... ie whatever is
> associated with that file offset so that any latter set_page_dirty
> would not freak out and would not need to reallocate blocks or do
> anything heavy weight.

.... keep the dirty page pinned and never written back until the GUP
is released.

Which, quite frankly, is insanity.  The whole point of
->page_mkwrite() is that we can clean file backed mapped pages at
any point in time and have the next write access correctly mark it
dirty again so it can be written back.

This is *absolutely necessary* for data integrity (i.e. fsync,
sync(), etc) as well as filesystem management operations (e.g.
filesystem freeze) to work correctly and not lose data if the system
crashes or generate corrupt snapshots for backup or migration
purposes.

> We have a separate discussion on what to do about truncate and other
> fs event that inherently invalidate portion of file so i do not
> want to complexify present discussion with those but we also have
> that in mind.
> 
> Do you see any fundamental issues with that ? It abides by all
> existing fs standard AFAICT (you have a page_mkwrite and we ask
> fs to keep the result of that around).

The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time.

> > Think ENOSPC - that has to be handled before we do the DMA, not
> > after. Before the DMA it is a recoverable error, after the DMA it is
> > data loss/corruption failure.
> 
> Yes agree and i hope that the above explaination properly explains
> that it would become legal to do set_page_dirty in put_user_page
> thanks to page_mkclean telling fs code not to recycle anything
> after write back finish.

No, page_mkclean doesn't help at all. Every time the page is dirtied
it may require block allocation (think COW filesystems) and so
ENOSPC (and block allocation) must be done /before/ the page is
dirtied. YOU can't just keep re-dirtying the same page and assuming
that the filesystem will just work with that - that's essentially
what the current code does with long term GUP references, and that's
why it's so broken.

/me is getting tired of explaining the same thing over and over
again.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx