Re: [PATCH v9 1/2] fs: New zonefs file system

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 29 Jan 2020 13:06:29 +0000

Hi Darrick,

On Tue, 2020-01-28 at 09:46 -0800, Darrick J. Wong wrote:
[...]
> > +/*
> > + * Update a file inode access permissions based on the file zone condition.
> > + */
> > +static void zonefs_update_file_perm(struct inode *inode, struct blk_zone *zone)
> > +{
> > +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> > +		/*
> > +		 * Dead zone: make the inode immutable, disable all accesses
> > +		 * and set the file size to 0 (zone wp set to zone start).
> > +		 */
> > +		inode->i_flags |= S_IMMUTABLE;
> 
> One annoying nit about setting S_IMMUTABLE: the generic vfs write
> routines do not check S_IMMUTABLE, which means that zonefs will have to
> do that on its own.
> 
> I tried to fix it last year, but there were complaints that it could
> break existing workloads (open O_TMPFILE for write, mark it immutable,
> link it into the filesystem, continue to write it since you're the only
> writer...)

OK. Understood. Adding checks where appropriate.

> > +		inode->i_mode &= ~0777;
> > +		zone->wp = zone->start;
> > +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> > +		/* Do not allow writes in read-only zones */
> > +		inode->i_flags |= S_IMMUTABLE;
> > +		inode->i_mode &= ~0222;
> > +	}
> > +}
> > +
> > +struct zonefs_ioerr_data {
> > +	struct inode	*inode;
> > +	bool		write;
> > +};
> > +
> > +static int zonefs_io_err_cb(struct blk_zone *zone, unsigned int idx, void *data)
> > +{
> > +	struct zonefs_ioerr_data *ioerr = data;
> > +	struct inode *inode = ioerr->inode;
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	struct super_block *sb = inode->i_sb;
> > +	loff_t isize, wp_ofst;
> > +
> > +	/*
> > +	 * The condition of the zone may have change. Fix the file access
> > +	 * permissions if necessary.
> > +	 */
> > +	zonefs_update_file_perm(inode, zone);
> > +
> > +	/*
> > +	 * There is no write pointer on conventional zones and read operations
> > +	 * do not change a zone write pointer. So there is nothing more to do
> > +	 * for these two cases.
> > +	 */
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV || !ioerr->write)
> > +		return 0;
> > +
> > +	/*
> > +	 * For sequential zones write, make sure that the zone write pointer
> > +	 * position is as expected, that is, in sync with the inode size.
> > +	 */
> > +	wp_ofst = (zone->wp - zone->start) << SECTOR_SHIFT;
> > +	zi->i_wpoffset = wp_ofst;
> > +	isize = i_size_read(inode);
> > +
> > +	if (isize == wp_ofst)
> /> +		return 0;
> > +
> > +	/*
> > +	 * The inode size and the zone write pointer are not in sync.
> > +	 * If the inode size is below the zone write pointer, then data was
> 
> I'm a little confused about what events these states reflect.
> 
> "inode size is below the zone wp" -- let's say we have a partially
> written sequential zone:
> 
>     isize
> ----v---------------
> DDDDD
> ----^---------------
>     WP
> 
> Then we tried to write to the end of the sequential zone:
> 
>     isize
> ----v---------------
> DDDDDWWWW
> ----^---------------
>     WP
> 
> Then an error happens so we didn't update the isize, and now we see that
> the write pointer is beyond isize (pretend the write failed to the '?'
> area):
> 
>     isize
> ----v---------------
> DDDDDD?DD
> --------^-----------
>         WP

If the write failed at the "?" location, then the zone write pointer
points to that location since nothing after that location can be
written unless that location itself is first written.

So with your example, the drive will give back:

    isize
----v---------------
DDDDDD?XX
------^-------------
      WP

With XX denoting the unwritten part of the issued write.

> So if we increase isize to match the WP, what happens when userspace
> tries to read the question-mark area?  Do they get read errors?  Stale
> contents?

Nope, see above: the write pointer always point to the sector following
the last sector correctly written. So increasing isize to the write
pointer location only exposes the data that actually was written and is
readable. No stale data.

> Or am I misunderstanding SMR firmware, and the drive only advances the
> write pointer once it has written a block?  i.e. if a write fails in
> the middle, the drive ends up in this state, not the one I drew above:
> 
>     isize
> ----v---------------
> DDDDDD?
> -----^--------------
>      WP
> 
> In which case it would be fine to push isize up to the write pointer?

Exactly. This is how the ZBC & ZAC (and upcoming ZNS) specifications
define the write pointer behavior. That makes error recovery a lot
easier and does not result in stale data accesses. Just notice the one-
off difference for the WP position from your example as WP will be
pointing at the error location, not the last written location. Indexing
from 0, we get (wp - zone start) always being isize with all written
and readable data in the sector range between zone start and zone write
pointer.

> Aha, you /did/ say exactly this in the v8 thread.
> 
> > +	 * writen at the end of the file. This can happen in the case of a
> > +	 * partial failure of a large multi-bio DIO. No data is lost. Simply fix
> > +	 * the inode size to reflect the partial write.

Yes. I further improved this comment to make it, I hope this time,
super easy to understand.

> > +	 * On the other hand, if the inode size is over the zone write pointer,
> > +	 * then there was an external corruption, e.g. an application reset the
> > +	 * file zone directly, or the device has a problem.
> 
> So I guess this case "isize is greater than WP" means we start with
> this appending write to what we think is the end of the zone:
> 
>     isize
> ----v---------------
> DDDDDWWWW
> --------------------
> 
> (The position of the WP is irrelevant here)
> 
> Then we get a disk error, so we query the WP and discover that it's
> actually below isize:
> 
>     isize
> ----v---------------
> DDDDDDD
> -^------------------
>  WP
> 
> So now we conclude that either the drive is broken or someone is messing
> with the zones behind our back, so we'd rather just shut down and let
> the sysadmin figure it out?  Because while we could truncate the zone
> file down to the WP, this is a sign that something could be seriously
> broken?

Yes. Exactly. The figure for the file after such error would be:

    isize
----v---------------
DDXXX
-^------------------
 WP

With the XXX sectors being garbage data since read accesses to sectors
after a zone write pointer returns zeroes, or the drive format pattern
if it is set.

Which also means that the "DD" data above cannot be trusted since if we
started with isize after WP, it means that we saw WP == isize on mount.
And with SMR specifications, the only way to get into the situation
above is if the zone is reset and rewritten behind our back.

It is hard to decide on the best action to take here considering the
simple nature of zonefs (i.e. another better interface to do raw block
device file accesses). Including your comments on mount options, I cam
up with these actions that the user can choose with mount options:
* repair: Truncate the inode size only, nothing else
* remount-ro (default): Truncate the inode size and remount read-only
* zone-ro: Truncate the inode size and set the inode read-only
* zone-offline: Truncate the inode size to 0 and assume that its zone 
is offline (no reads nor writes possible).

This gives I think a good range of possible behaviors that the user may
want, from almost nothing (repair) to extreme to avoid accessing bad
data (zone-offline).

> (Oh, you said this in the v8 thread too.)
> 
> > +	 */
> > +	zonefs_warn(sb, "inode %lu: size %lld should be %lld\n",
> > +		    inode->i_ino, isize, wp_ofst);
> > +	if (isize > wp_ofst) {
> > +		struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> > +
> > +		if ((sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_RO) &&
> 
> Mount options?  Hey, wait a minute, this didn't exist in v8...

Yes, improvement in v9 to better handle all error cases (indirectly
suggested by Dave who pointed out deficiencies in that area).

> > +		    !sb_rdonly(sb)) {
> > +			zonefs_warn(sb,
> > +				"Zone %lu corruption detected, remounting fs read-only\n",
> > +				inode->i_ino);
> > +			sb->s_flags |= SB_RDONLY;
> > +			return 0;
> > +		} else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_CONT) {
> > +			zonefs_warn(sb,
> > +				"Zone %lu corruption detected, continuing\n",
> > +				inode->i_ino);
> 
> I'm frankly not sure errors=continue makes sense for a filesystem.  It
> exists for ext* as a crutch for the root fs to help users stumble
> towards /sbin/reboot and a full fsck afterwards.

Good point.

> 
> Also wondering if you should have an errors=zone-ro that will set
> S_IMMUTABLE on the zone file?  That would enable the intact zones to
> keep operating.

Done. And as noted above, I also added "errors=zone-offline" and
"error=repair".

> (Or I guess if you really want a "continue" mode you could truncate the
> zone...)

That is the errors=repair option now. It is clearer this way I think.

> > +		} else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_PANIC) {
> 
> I don't think it's a good idea to crash the entire kernel on zone
> corruption.

I have dropped this one.

> > +			zonefs_panic(sb,
> > +				"Zone %lu corruption detected\n",
> > +				inode->i_ino);
> > +		}
> > +	}
> > +
> > +	zonefs_update_stats(inode, wp_ofst);
> > +	i_size_write(inode, wp_ofst);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * When an IO error occurs, check the target zone to see if there is a change
> > + * in the zone condition (e.g. offline or read-only). For a failed write to a
> > + * sequential zone, the zone write pointer position must also be checked to
> > + * eventually correct the file size and zonefs inode write pointer offset
> > + * (which can be out of sync with the drive due to partial write failures).
> > + */
> > +static void zonefs_io_error(struct inode *inode, bool write)
> > +{
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	struct super_block *sb = inode->i_sb;
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> > +	unsigned int noio_flag;
> > +	unsigned int nr_zones =
> > +		zi->i_max_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
> > +	struct zonefs_ioerr_data ioerr = {
> > +		.inode = inode,
> > +		.write = write
> > +	};
> > +	int ret;
> > +
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +
> > +	/*
> > +	 * Memory allocations in blkdev_report_zones() can trigger a memory
> > +	 * reclaim which may in turn cause a recursion into zonefs as well as
> > +	 * BIO allocations for the same device. The former case may end up in
> > +	 * a deadlock on the inode truncate mutex, while the latter may prevent
> > +	 * forward progress with BIO allocations as we are potentially still
> > +	 * holding the failed BIO. Executing the report zones under GFP_NOIO
> > +	 * avoids both problems.
> > +	 */
> > +	noio_flag = memalloc_noio_save();
> 
> Don't you still need memalloc_nofs_ here too?

noio implies nofs, doesn't it ? Or rather, noio is more restrictive
than nofs here. Which is safer since we need a struct request to be
able to execute blkdev_report_zones().

> > +	ret = blkdev_report_zones(sb->s_bdev, zi->i_zsector, nr_zones,
> > +				  zonefs_io_err_cb, &ioerr);
> > +	if (ret != nr_zones)
> > +		zonefs_err(sb, "Get inode %lu zone information failed %d\n",
> > +			   inode->i_ino, ret);
> > +	memalloc_noio_restore(noio_flag);
> > +
> > +	mutex_unlock(&zi->i_truncate_mutex);
> > +}
> > +
> > +static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
> > +					int error, unsigned int flags)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +
> > +	if (error) {
> > +		zonefs_io_error(inode, true);
> > +		return error;
> > +	}
> > +
> > +	if (size && zi->i_ztype != ZONEFS_ZTYPE_CNV) {
> > +		mutex_lock(&zi->i_truncate_mutex);
> > +		if (i_size_read(inode) < iocb->ki_pos + size) {
> > +			zonefs_update_stats(inode, iocb->ki_pos + size);
> > +			i_size_write(inode, iocb->ki_pos + size);
> > +		}
> > +		mutex_unlock(&zi->i_truncate_mutex);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct iomap_dio_ops zonefs_write_dio_ops = {
> > +	.end_io			= zonefs_file_write_dio_end_io,
> > +};

Unrelated to your other comments, I discovered that the end_io
operation is called with the flags argument being dio->flags. Since the
flags for that are the IOMAP_DIO_XXX flags defined in fs/iomap/direct-
io.c, the flags values are not visible by the implementation and the
end_io() callback function cannot determine if the dio is a read or a
write. This can be worked around by defining one end_io op for reads
and another for writes (which I did here, see
zonefs_file_read_dio_end_io()).

But we could allow code simplification by simply adding the IOMAP_XXX
flags passed to iomap_begin() into the dio->flags (theses two set of
flags do not collide as mentioned in fs/iomap/direct-io.c). That would
keep the interface in include/linux/iomap.h clean (no new flags) and
give more information to the end_io() callback. With that, I could get
rid of the zonefs_file_read_dio_end_io() function and change
zonefs_file_write_dio_end_io() into zonefs_file_dio_end_io() for both
read and write operations. Less code. Thoughts ?

> > +
> > +/*
> > + * Handle direct writes. For sequential zone files, this is the only possible
> > + * write path. For these files, check that the user is issuing writes
> > + * sequentially from the end of the file. This code assumes that the block layer
> > + * delivers write requests to the device in sequential order. This is always the
> > + * case if a block IO scheduler implementing the ELEVATOR_F_ZBD_SEQ_WRITE
> 
> Is there any way for zonefs to detect that it's talking to an io
> scheduler that doesn't support ZBD_SEQ_WRITE and react accordingly (log
> message, refuse to mount, etc.)?

Not really. It can be done if zonefs sits directly on the bdev of a
real device, but if the block device comes from a BIO-based device
mapper target (e.g. dm-linear), then there is no scheduler for that
device. Scheduling is on the backend device(s) in that case and that is
invisible from the top bdev interface. Not to mention that target may
be using several devices...

Furthermore, I am trying to limit as much as possible dependencies on
the block layer implementation of "sequential write guarantees" as we
are still trying to evolve that into something that works for any
scheduler.

> > + * elevator feature is being used (e.g. mq-deadline). The block layer always
> > + * automatically select such an elevator for zoned block devices during the
> > + * device initialization.
> 
> Or is the case that the block layer knows when it's dealing with a zoned
> block device and will not allow the assignment of an ioscheduler that
> does not support ZBD_SEQ_WRITE?

Currently, for zoned block devices, the block layer will only allow
setting a scheduler that has the ZBD_SEQ_WRITE feature. The only one
that does for now is mq-deadline. Other schedulers without this feature
support will not even be shown in /sys/block/xxx/queue/scheduler. The
only exception to this is "none", which is always allowed.

> > [...]
> > +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	/* Write operations beyond the zone size are not allowed */
> > +	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
> > +		return -EFBIG;
> 
> This needs a check for IS_IMMUTABLE so that userspace can't write to
> zones which zonefs has decided are no longer writable, even if the
> program has a writeable file descriptor.

Done, with another additional checks in zonefs_file_read_iter() for
offline zones (immutable + no reads allowed).

> > [...]
> > +/*
> > + * Check that the device is zoned. If it is, get the list of zones and create
> > + * sub-directories and files according to the device zone configuration and
> > + * format options.
> > + */
> > +static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
> > +{
> > +	struct zonefs_zone_data zd;
> > +	struct zonefs_sb_info *sbi;
> > +	struct inode *inode;
> > +	enum zonefs_ztype t;
> > +	int ret;
> > +
> > +	if (!bdev_is_zoned(sb->s_bdev)) {
> > +		zonefs_err(sb, "Not a zoned block device\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * Initialize super block information: the maximum file size is updated
> > +	 * when the zone files are created so that the format option
> > +	 * ZONEFS_F_AGGRCNV which increases the maximum file size of a file
> > +	 * beyond the zone size is taken into account.
> > +	 */
> > +	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
> > +	if (!sbi)
> > +		return -ENOMEM;
> > +
> > +	spin_lock_init(&sbi->s_lock);
> > +	sb->s_fs_info = sbi;
> > +	sb->s_magic = ZONEFS_MAGIC;
> > +	sb->s_maxbytes = 0;
> > +	sb->s_op = &zonefs_sops;
> > +	sb->s_time_gran	= 1;
> > +
> > +	/*
> > +	 * The block size is set to the device physical sector size to ensure
> > +	 * that write operations on 512e devices (512B logical block and 4KB
> > +	 * physical block) are always aligned to the device physical blocks,
> > +	 * as mandated by the ZBC/ZAC specifications.
> > +	 */
> > +	sb_set_blocksize(sb, bdev_physical_block_size(sb->s_bdev));
> > +	sbi->s_blocksize_mask = sb->s_blocksize - 1;
> > +	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
> > +	sbi->s_uid = GLOBAL_ROOT_UID;
> > +	sbi->s_gid = GLOBAL_ROOT_GID;
> > +	sbi->s_perm = 0640;
> > +	sbi->s_mount_opts = ZONEFS_MNTOPT_ERRORS_RO;
> > +
> > +	ret = zonefs_read_super(sb);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = zonefs_parse_options(sb, data);
> > +	if (ret)
> > +		return ret;
> > +
> > +	memset(&zd, 0, sizeof(struct zonefs_zone_data));
> > +	zd.sb = sb;
> > +	ret = zonefs_get_zone_info(&zd);
> > +	if (ret)
> > +		goto out;
> > +
> 
> It might be a good idea to spit out an EXPERIMENTAL warning at mount
> time for the first 6 months while you, uh, seek out advanced bleeding
> edge testers to really give this code a thorough workout.
> 
> zonefs_warn(sb, "EXPERIMENTAL filesystem in use; use at your own risk");

Yes, I thought about this too but I am still wondering if it is the
right thing to do. See below.

> Or something like that to manage peoples' expectations in case you find
> a really nasty data-chomping bug. :)

Well, my view is that since zonefs does not have any run-time changing
on-disk metadata, it is not worse that the raw block device file use
case in terms of reliability. Unmount zonefs, ignoring the first zone
of the device that has the superblock, using the zones directly through
the raw block device file open/close/read/write/ioctl will give the
same level of confidence about data in the zones. If anything, zonefs
improves on that with the various checks it adds for writes and IO
errors (fs/block-dev.c does not have anything like that for zoned block
devices).

Of course I do not mean that zonefs is bug free. But I still consider
the likeliness of loosing data equivalent to the raw block device file
case: it mostly will depend on the application doing the right thing.
The value of zonefs is in the file access interface simplification, and
not in strong additional guarantees about data loss or corruption
detection. So warning about the experimental status may be too scary
and discourage users from using it and start developing for the block
device file access use case. I would rather encourage people to start
using zonefs now, especially considering the fact that the upcoming
NVMe ZNS will need some additional zone specific handling (zone
resource control for writes) that are fairly easy to handle with a one-
file-per-zone in-kernel FS interface. That simplifies even more the
application implementation.

But I do not have strong feeling about it either, and I will add the
warning if you or others insist :)

> (Or as a lever to convince people to stop running old code some day...)

I am still trying to convince a lot of SMR users to move away from
SG_IO and use the kernel block layer instead. But a lot of deployments
still use enterprise distros with kernels that do not have SMR support.
Getting zonefs into the kernel and I will definitely push for its use
in place of the raw block device file interface as that also simplifies
support for various application programming languages (e.g. SMR drive
handling directly from JAVA or python).

Thank you for all your comments.

-- 
Damien Le Moal
Western Digital Research