Re: [PATCH] ext4: Add XIP functionality

Jan Kara <jack@xxxxxxx> · Mon, 2 Dec 2013 09:34:59 +0100

  Hello,

On Mon 18-11-13 14:51:32, Ross Zwisler wrote:
> This is a port of the XIP functionality found in the current version of
> ext2.  This patch set is intended to achieve feature parity with XIP in
> ext2 rather than non-XIP in ext4.  In particular, it lacks support for
> splice and AIO.  We'll be submitting patches in the future to add that
> functionality, but we think this is a good start.
> 
> There are also a couple of bugs that also appear in ext2 around handling
> of the xip mount option; we're currently investigating and will submit
> patches to fix both in ext2 and ext4, but didn't want to delay getting
> this patch out for comment.
> 
> The motivation behind this work is that we believe that the XIP feature
> will begin to find new uses as various persistent memory devices and
> technologies come on to the market.  Having direct, byte-addressable
> access to persistent memory without having an additional copy in the
> page cache can be a win in terms of I/O latency and overall memory
> usage.
  Yes, I believe implementing XIP in ext4 is desirable. It is the only
ext2 feature I'm aware of that is missing from ext4.

> This patch applies cleanly to v3.12, and was tested using brd as our
> block driver.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> Reviewed-by: Andreas Dilger <andreas.dilger@xxxxxxxxx>
> ---
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e274e9c..dea66bb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
...
> @@ -4645,11 +4673,19 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>  			} else
>  				ext4_wait_for_tail_page_commit(inode);
>  		}
> -		/*
> -		 * Truncate pagecache after we've waited for commit
> -		 * in data=journal mode to make pages freeable.
> -		 */
> +
> +		if (mapping_is_xip(inode->i_mapping)) {
> +			error = xip_truncate_page(inode->i_mapping,
> +						  inode->i_size);
> +			if (error)
> +				goto err_out;
> +		} else {
> +			/*
> +			 * Truncate pagecache after we've waited for commit
> +			 * in data=journal mode to make pages freeable.
> +			 */
>  			truncate_pagecache(inode, inode->i_size);
> +		}
>  	}
>  	/*
>  	 * We want to call ext4_truncate() even if attr->ia_size ==
  Umm, much more logical place for this would be in ext4_truncate() at the
place where we do ext4_block_truncate_page(). Because xip_truncate_page()
does what ext4_block_truncate_page() does.

Also thinking about it for a while you must call truncate_pagecache() in
XIP mode as well to unmap PTEs removed by truncate. In ext2 this is hidden
in truncate_setsize() call...

Also you seem to be missing any hole punching support at all. For that
you'd need to modify xip_truncate_page() to accept not only offset but also
length of the truncate area (a separate patch please). And then you will
need to use that function from ext4_punch_hole() at the place where
ext4_zero_partial_blocks() is used.

Finally, as Matthew Wilcox pointed out
(http://www.spinics.net/lists/linux-fsdevel/msg70582.html) there's a race
between truncate and mmap in xip support because xip is missing
serialization on page locks. So I believe we should solve that when we are
growing XIP support in another filesystem... Probably using mmap_sem for
that might be viable but I have to try.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 2c2e6cb..18e70d2 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
...
> @@ -3525,11 +3532,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		}
>  		if (test_opt(sb, DELALLOC))
>  			clear_opt(sb, DELALLOC);
> +		if (test_opt(sb, XIP)) {
> +			ext4_msg(sb, KERN_ERR, "can't mount with "
> +				 "both data=journal and xip");
> +			goto failed_mount;
> +		}
>  	}
>  
>  	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
>  		(test_opt(sb, POSIX_ACL) ? MS_POSIXACL : 0);
>  
> +	ext4_xip_verify_sb(sb); /* see if bdev supports xip, unset
> +				    EXT4_MOUNT_XIP if not */
> +
  I don't like clearing the flag inside this function. Just opencode the
function here please (I don't think the other call site at ext4_remount()
makes sense at all).

>  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
>  	    (EXT4_HAS_COMPAT_FEATURE(sb, ~0U) ||
>  	     EXT4_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
> @@ -3576,6 +3591,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		goto failed_mount;
>  	}
>  
> +	if (ext4_use_xip(sb) && blocksize != PAGE_SIZE) {
> +		if (!silent)
> +			ext4_msg(sb, KERN_ERR,
> +				"error: unsupported blocksize for xip");
> +		goto failed_mount;
> +	}
> +
>  	if (sb->s_blocksize != blocksize) {
>  		/* Validate the filesystem blocksize */
>  		if (!sb_set_blocksize(sb, blocksize)) {
> @@ -4707,6 +4729,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  	struct ext4_super_block *es;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  	unsigned long old_sb_flags;
> +	unsigned long old_mount_opt = sbi->s_mount_opt;
>  	struct ext4_mount_options old_opts;
>  	int enable_quota = 0;
>  	ext4_group_t g;
> @@ -4773,7 +4796,23 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
>  		(test_opt(sb, POSIX_ACL) ? MS_POSIXACL : 0);
>  
> +	ext4_xip_verify_sb(sb); /* see if bdev supports xip, unset
> +				    EXT4_MOUNT_XIP if not */
> +
> +	if (ext4_use_xip(sb) && sb->s_blocksize != PAGE_SIZE) {
> +		ext4_msg(sb, KERN_WARNING,
> +			"warning: unsupported blocksize for xip");
> +		err = -EINVAL;
> +		goto restore_opts;
> +	}
> +
>  	es = sbi->s_es;
> +	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT4_MOUNT_XIP) {
> +		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
> +			 "xip flag with busy inodes while remounting");
> +		sbi->s_mount_opt &= ~EXT4_MOUNT_XIP;
> +		sbi->s_mount_opt |= old_mount_opt & EXT4_MOUNT_XIP;
> +	}
  So why do you bother with ext4_xip_verify_sb() and other stuff when you
disallow remount to change xip flag anyway (which I think makes sense)?

>  	if (sbi->s_journal) {
>  		ext4_init_journal_params(sb, sbi->s_journal);
> diff --git a/fs/ext4/xip.c b/fs/ext4/xip.c
> new file mode 100644
> index 0000000..e0a430a
> --- /dev/null
> +++ b/fs/ext4/xip.c
> @@ -0,0 +1,91 @@
> +/*
> + *  linux/fs/ext4/xip.c
> + *
> + * Copyright (C) 2005 IBM Corporation
> + * Author: Carsten Otte (cotte@xxxxxxxxxx)
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <linux/genhd.h>
> +#include <linux/buffer_head.h>
> +#include <linux/blkdev.h>
> +#include "ext4.h"
> +#include "xip.h"
> +
> +static inline int
> +__inode_direct_access(struct inode *inode, sector_t block,
> +		      void **kaddr, unsigned long *pfn)
> +{
> +	struct block_device *bdev = inode->i_sb->s_bdev;
> +	const struct block_device_operations *ops = bdev->bd_disk->fops;
> +	sector_t sector;
> +
> +	sector = block * (PAGE_SIZE / 512); /* ext4 block to bdev sector */
> +
> +	BUG_ON(!ops->direct_access);
> +	return ops->direct_access(bdev, sector, kaddr, pfn);
> +}
> +
> +static inline int
> +__ext4_get_block(struct inode *inode, pgoff_t pgoff, int create,
> +		   sector_t *result)
> +{
> +	struct buffer_head tmp;
> +	int rc;
> +
> +	memset(&tmp, 0, sizeof(struct buffer_head));
> +	tmp.b_size = inode->i_sb->s_blocksize;
> +	rc = ext4_get_block(inode, pgoff, &tmp, create);
> +	*result = tmp.b_blocknr;
  Please use ext4_map_blocks() directly. There's no need to go via
ext4_get_block() with its suboptimal buffer_head interface...

> +	/* did we get a sparse block (hole in the file)? */
> +	if (!tmp.b_blocknr && !rc) {
> +		BUG_ON(create);
> +		rc = -ENODATA;
> +	}
> +
> +	return rc;
> +}
> +

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html