RFC: Page cache coherency in dio write path (was: [LSF/MM/BPF TOPIC] Bcachefs update)

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 18 Dec 2019 14:11:14 -0500

On Wed, Dec 18, 2019 at 01:40:52PM +0100, Jan Kara wrote:
> On Mon 16-12-19 14:38:52, Kent Overstreet wrote:
> > Pagecache consistency:
> > 
> > I recently got rid of my pagecache add lock; that added locking to core paths in
> > filemap.c and some found my locking scheme to be distastefull (and I never liked
> > it enough to argue). I've recently switched to something closer to XFS's locking
> > scheme (top of the IO paths); however, I do still need one patch to the
> > get_user_pages() path to avoid deadlock via recursive page fault - patch is
> > below:
> > 
> > (This would probably be better done as a new argument to get_user_pages(); I
> > didn't do it that way initially because the patch would have been _much_
> > bigger.)
> > 
> > Yee haw.
> > 
> > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
> > Author: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > Date:   Wed Oct 16 15:03:50 2019 -0400
> > 
> >     mm: Add a mechanism to disable faults for a specific mapping
> >     
> >     This will be used to prevent a nasty cache coherency issue for O_DIRECT
> >     writes; O_DIRECT writes need to shoot down the range of the page cache
> >     corresponding to the part of the file being written to - but, if the
> >     file is mapped in, userspace can pass in an address in that mapping to
> >     pwrite(), causing those pages to be faulted back into the page cache
> >     in get_user_pages().
> >     
> >     Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> 
> I'm not really sure about the exact nature of the deadlock since the
> changelog doesn't explain it but if you need to take some lockA in your
> page fault path and you already hold lockA in your DIO code, then this
> patch isn't going to cut it. Just think of a malicious scheme with two
> tasks one doing DIO from fileA (protected by lockA) to buffers mapped from
> fileB and the other process the other way around...

Ooh, yeah, good catch...

The lock in question is - for the purposes of this discussion, a RW lock (call
it map lock here): the fault handler and the buffered IO paths take it it read
mode, and the DIO path takes it in write mode to block new pages being added to
the page cache.

But get_user_pages() -> page fault invokes the fault handler, hence deadlock. My
patch was for avoiding this deadlock when the fault handler tries locking the
same inode's map lock, but as you note this is a more general problem...

This is starting to smell like possibly what wound/wait mutexes were invented
for, a situation where we need deadlock avoidance because lock ordering is under
userspace control.

So for that we need some state describing what locks are held that we can refer
to when taking the next lock of this class - and since it's got to be shared
between the dio write path and then (via gup()) the fault handler, that means
it's pretty much going to have to hang off of task struct. Then in the fault
handler, when we go to take the map lock we:
 - return -EFAULT if it's the same lock the dio write path took
 - trylock; if that fails and lock ordering is wrong (pointer comparison of the
   locks works here) then we have to do a dance that involves bailing out and
   retrying from the top of the dio write path.

I dunno. On the plus side, if other filesystems don't want this I think this can
all be implemented in bcachefs code with only a pointer added to task_struct to
hang this lock state, but I would much rather either get buy in from the other
filesystem people and make this a general purpose facility or not do it at all.

And I'm not sure if anyone else cares about the page cache consistency issues
inherent to dio writes as much as I do... so I'd like to hear from other people
on that.