On Fri, Sep 07, 2018 at 11:04:28AM -0700, Omar Sandoval wrote: > On Fri, Sep 07, 2018 at 11:39:25AM +0300, Nikolay Borisov wrote: > > > > > > On 7.09.2018 10:39, Omar Sandoval wrote: > > > From: Omar Sandoval <osandov@xxxxxx> > > > > > > Implement the swap file a_ops on Btrfs. Activation needs to make sure > > > that the file can be used as a swap file, which currently means it must > > > be fully allocated as nocow with no compression on one device. It must > > > also do the proper tracking so that ioctls will not interfere with the > > > swap file. Deactivation clears this tracking. > > > > > > Signed-off-by: Omar Sandoval <osandov@xxxxxx> > > > --- > > > fs/btrfs/inode.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 316 insertions(+) > > > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > > index 9357a19d2bff..55aba2d7074c 100644 > > > --- a/fs/btrfs/inode.c > > > +++ b/fs/btrfs/inode.c > > > @@ -27,6 +27,7 @@ > > > #include <linux/uio.h> > > > #include <linux/magic.h> > > > #include <linux/iversion.h> > > > +#include <linux/swap.h> > > > #include <asm/unaligned.h> > > > #include "ctree.h" > > > #include "disk-io.h" > > > @@ -10437,6 +10438,319 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) > > > } > > > } > > > > > > +/* > > > + * Add an entry indicating a block group or device which is pinned by a > > > + * swapfile. Returns 0 on success, 1 if there is already an entry for it, or a > > > + * negative errno on failure. > > > + */ > > > +static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr, > > > + bool is_block_group) > > > +{ > > > + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; > > > + struct btrfs_swapfile_pin *sp, *entry; > > > + struct rb_node **p; > > > + struct rb_node *parent = NULL; > > > + > > > + sp = kmalloc(sizeof(*sp), GFP_NOFS); > > > + if (!sp) > > > + return -ENOMEM; > > > + sp->ptr = ptr; > > > + sp->inode = inode; > > > + sp->is_block_group = is_block_group; > > > + > > > + spin_lock(&fs_info->swapfile_pins_lock); > > > + p = &fs_info->swapfile_pins.rb_node; > > > + while (*p) { > > > + parent = *p; > > > + entry = rb_entry(parent, struct btrfs_swapfile_pin, node); > > > + if (sp->ptr < entry->ptr || > > > + (sp->ptr == entry->ptr && sp->inode < entry->inode)) { > > > + p = &(*p)->rb_left; > > > + } else if (sp->ptr > entry->ptr || > > > + (sp->ptr == entry->ptr && sp->inode > entry->inode)) { > > > + p = &(*p)->rb_right; > > > + } else { > > > + spin_unlock(&fs_info->swapfile_pins_lock); > > > + kfree(sp); > > > + return 1; > > > + } > > > > > > I have to admit this is creative use of pointers but I dislike it: > > > > 1. You are not really doing an interval tree of any sorts so rb seems a > > bit of an overkill. How many block groups/devices do you expect to have > > in the rb tree i.e how many swap files per file system so that the logn > > search behavior really matter? Why not a simple linked list and just an > > equality comparison of pointers? > > We know there's at least one block group per gigabyte of swap, but there > can be many more if the file is fragmented. We could probably get away > with a linked list in most cases, but the number of entries is only > bounded by the size of the filesystem * number of swapfiles. With a > linked list, checking n block groups for balance becomes O(n^2) instead > of O(n * log n), so I'd rather be on the safe side here. The rbtree > manipulation isn't that much more complicated than using a linked list, > after all. > > > 2. The code self-admits that using pointers for lt/gr comparison is a > > hack since in case pointers match you fall back to checking the inode > > pointer > > It's not a fallback. A block group or device can contain more than one > swapfile, so it's really a separate entry. > > > 3. There is a discrepancy between the keys used for adding (ptr + inode) > > and deletion (just inode) > > Well, yeah, we add an entry that says block group/device X is pinned by > inode Y, and we delete all of the entries pinned by inode Y. > > > At the very least this hack needs to be at least mentioned in the > > changelog. > > I'll add these comments: > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index e37ce40db380..1c258ee4be24 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -719,6 +719,11 @@ struct btrfs_delayed_root; > /* > * Block group or device which contains an active swapfile. Used for preventing > * unsafe operations while a swapfile is active. > + * > + * These are sorted on (ptr, inode) (note that a block group or device can > + * contain more than one swapfile). We compare the pointer values because we > + * don't actually care what the object is, we just need a quick check whether > + * the object exists in the rbtree. > */ > struct btrfs_swapfile_pin { > struct rb_node node; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 55aba2d7074c..e103e81c6533 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -10481,6 +10481,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr, > return 0; > } > > +/* Free all of the entries pinned by this swapfile. */ > static void btrfs_free_swapfile_pins(struct inode *inode) > { > struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 514932c47bcd..062ad86358ad 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -7545,6 +7545,10 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info) > return ret; > } > > +/* > + * Check whether the given block group or device is pinned by any inode being > + * used as a swapfile. > + */ > bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr) > { > struct btrfs_swapfile_pin *sp; Pushed to https://github.com/osandov/linux/tree/btrfs-swap, along with moving the docstring in patch 5.