Hi all, I'd like to poke my head up and let everyone know where bcachefs is at, and talk about finally upstreaming it. Last LSF (two years ago), bcachefs was coming along quite nicely and was quite usable but there was still some core work unfinished (primarily persistent alloc info; we still had to walk all metadata at mount time). Additionally, there some unresolved discussions around locking for pagecache consistency. The core features one would expect from a posix filesystem are now done, and then some. Reflink was finished recently, and I'm now starting to work towards snapshots. If there's interest I may talk a bit about the plan for snapshots in bcachefs. The short version is: all metadata in bcachefs are keys in various btrees (extents/inodes/dirents/xattrs btrees) indexed by inode:offset; for snapshots we extent the key so that the low bits are a snapshot id, i.e. inode:offset:snapshot. Snapshots form a tree where the root has id U32_MAX and children always have smaller IDs than their parent, so to read from a given snapshot we do a lookup as normal, including the snapshot ID of the snapshot we want, and then filter out keys from unrelated (non ancestor) snapshots. This will give us excellent overall performance when there are many snapshots that each have only a small number of overwrites; when we end up in a situation where a given part of the keyspace has many keys from unrelated snapshots we'll want to arrange metadata differently. This scheme does get considerably trickier when you add extents; that's what I've been focusing on recently. Pagecache consistency: I recently got rid of my pagecache add lock; that added locking to core paths in filemap.c and some found my locking scheme to be distastefull (and I never liked it enough to argue). I've recently switched to something closer to XFS's locking scheme (top of the IO paths); however, I do still need one patch to the get_user_pages() path to avoid deadlock via recursive page fault - patch is below: (This would probably be better done as a new argument to get_user_pages(); I didn't do it that way initially because the patch would have been _much_ bigger.) Yee haw. commit 20ebb1f34cc9a532a675a43b5bd48d1705477816 Author: Kent Overstreet <kent.overstreet@xxxxxxxxx> Date: Wed Oct 16 15:03:50 2019 -0400 mm: Add a mechanism to disable faults for a specific mapping This will be used to prevent a nasty cache coherency issue for O_DIRECT writes; O_DIRECT writes need to shoot down the range of the page cache corresponding to the part of the file being written to - but, if the file is mapped in, userspace can pass in an address in that mapping to pwrite(), causing those pages to be faulted back into the page cache in get_user_pages(). Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx> diff --git a/include/linux/sched.h b/include/linux/sched.h index ebfa046b2d..3b4d9689ef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -740,6 +740,7 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; + struct address_space *faults_disabled_mapping; /* Per-thread vma caching: */ struct vmacache vmacache; diff --git a/init/init_task.c b/init/init_task.c index ee3d9d29b8..706abd9547 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -77,6 +77,7 @@ struct task_struct init_task .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, + .faults_disabled_mapping = NULL, .restart_block = { .fn = do_no_restart_syscall, }, diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37b..9cc1479201 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -849,6 +849,13 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } cond_resched(); + if (current->faults_disabled_mapping && + vma->vm_file && + vma->vm_file->f_mapping == current->faults_disabled_mapping) { + ret = -EFAULT; + goto out; + } + page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page) { ret = faultin_page(tsk, vma, start, &foll_flags,