On Tue 09-05-23 12:56:31, Kent Overstreet wrote: > From: Kent Overstreet <kent.overstreet@xxxxxxxxx> > > This is used by bcachefs to fix a page cache coherency issue with > O_DIRECT writes. > > Also relevant: mapping->invalidate_lock, see below. > > O_DIRECT writes (and other filesystem operations that modify file data > while bypassing the page cache) need to shoot down ranges of the page > cache - and additionally, need locking to prevent those pages from > pulled back in. > > But O_DIRECT writes invoke the page fault handler (via get_user_pages), > and the page fault handler will need to take that same lock - this is a > classic recursive deadlock if userspace has mmaped the file they're DIO > writing to and uses those pages for the buffer to write from, and it's a > lock ordering deadlock in general. > > Thus we need a way to signal from the dio code to the page fault handler > when we already are holding the pagecache add lock on an address space - > this patch just adds a member to task_struct for this purpose. For now > only bcachefs is implementing this locking, though it may be moved out > of bcachefs and made available to other filesystems in the future. It would be nice to have at least a link to the code that's actually using the field you are adding. Also I think we were already through this discussion [1] and we ended up agreeing that your scheme actually solves only the AA deadlock but a malicious userspace can easily create AB BA deadlock by running direct IO to file A using mapped file B as a buffer *and* direct IO to file B using mapped file A as a buffer. [1] https://lore.kernel.org/all/20191218124052.GB19387@xxxxxxxxxxxxxx > --------------------------------- > > The closest current VFS equivalent is mapping->invalidate_lock, which > comes from XFS. However, it's not used by direct IO. Instead, direct IO > paths shoot down the page cache twice - before starting the IO and at > the end, and they're still technically racy w.r.t. page cache coherency. > > This is a more complete approach: in the future we might consider > replacing mapping->invalidate_lock with the bcachefs code. Yes, and this is because we never provided 100% consistent buffered VS direct IO behavior on the same file exactly because we never found the complexity worth the usefulness... Honza > > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx> > Cc: Jan Kara <jack@xxxxxxx> > Cc: Darrick J. Wong <djwong@xxxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > --- > include/linux/sched.h | 1 + > init/init_task.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 63d242164b..f2a56f64f7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -869,6 +869,7 @@ struct task_struct { > > struct mm_struct *mm; > struct mm_struct *active_mm; > + struct address_space *faults_disabled_mapping; > > int exit_state; > int exit_code; > diff --git a/init/init_task.c b/init/init_task.c > index ff6c4b9bfe..f703116e05 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -85,6 +85,7 @@ struct task_struct init_task > .nr_cpus_allowed= NR_CPUS, > .mm = NULL, > .active_mm = &init_mm, > + .faults_disabled_mapping = NULL, > .restart_block = { > .fn = do_no_restart_syscall, > }, > -- > 2.40.1 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR