[LSF/MM/BPF TOPIC] Bcachefs update

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 16 Dec 2019 14:38:52 -0500

Hi all, I'd like to poke my head up and let everyone know where bcachefs is at,
and talk about finally upstreaming it.

Last LSF (two years ago), bcachefs was coming along quite nicely and was quite
usable but there was still some core work unfinished (primarily persistent alloc
info; we still had to walk all metadata at mount time). Additionally, there some
unresolved discussions around locking for pagecache consistency.

The core features one would expect from a posix filesystem are now done, and
then some. Reflink was finished recently, and I'm now starting to work towards
snapshots.

If there's interest I may talk a bit about the plan for snapshots in bcachefs.

The short version is: all metadata in bcachefs are keys in various btrees
(extents/inodes/dirents/xattrs btrees) indexed by inode:offset; for snapshots we
extent the key so that the low bits are a snapshot id, i.e.
inode:offset:snapshot. Snapshots form a tree where the root has id U32_MAX and
children always have smaller IDs than their parent, so to read from a given
snapshot we do a lookup as normal, including the snapshot ID of the snapshot we
want, and then filter out keys from unrelated (non ancestor) snapshots.

This will give us excellent overall performance when there are many snapshots
that each have only a small number of overwrites; when we end up in a situation
where a given part of the keyspace has many keys from unrelated snapshots we'll
want to arrange metadata differently.

This scheme does get considerably trickier when you add extents; that's what
I've been focusing on recently.

Pagecache consistency:

I recently got rid of my pagecache add lock; that added locking to core paths in
filemap.c and some found my locking scheme to be distastefull (and I never liked
it enough to argue). I've recently switched to something closer to XFS's locking
scheme (top of the IO paths); however, I do still need one patch to the
get_user_pages() path to avoid deadlock via recursive page fault - patch is
below:

(This would probably be better done as a new argument to get_user_pages(); I
didn't do it that way initially because the patch would have been _much_
bigger.)

Yee haw.

commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
Author: Kent Overstreet <kent.overstreet@xxxxxxxxx>
Date:   Wed Oct 16 15:03:50 2019 -0400

    mm: Add a mechanism to disable faults for a specific mapping
    
    This will be used to prevent a nasty cache coherency issue for O_DIRECT
    writes; O_DIRECT writes need to shoot down the range of the page cache
    corresponding to the part of the file being written to - but, if the
    file is mapped in, userspace can pass in an address in that mapping to
    pwrite(), causing those pages to be faulted back into the page cache
    in get_user_pages().
    
    Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ebfa046b2d..3b4d9689ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,7 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	/* Per-thread vma caching: */
 	struct vmacache			vmacache;
diff --git a/init/init_task.c b/init/init_task.c
index ee3d9d29b8..706abd9547 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -77,6 +77,7 @@ struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37b..9cc1479201 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -849,6 +849,13 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 		cond_resched();
 
+		if (current->faults_disabled_mapping &&
+		    vma->vm_file &&
+		    vma->vm_file->f_mapping == current->faults_disabled_mapping) {
+			ret = -EFAULT;
+			goto out;
+		}
+
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,