Hi Darrick, On Tue, 2020-01-28 at 09:46 -0800, Darrick J. Wong wrote: [...] > > +/* > > + * Update a file inode access permissions based on the file zone condition. > > + */ > > +static void zonefs_update_file_perm(struct inode *inode, struct blk_zone *zone) > > +{ > > + if (zone->cond == BLK_ZONE_COND_OFFLINE) { > > + /* > > + * Dead zone: make the inode immutable, disable all accesses > > + * and set the file size to 0 (zone wp set to zone start). > > + */ > > + inode->i_flags |= S_IMMUTABLE; > > One annoying nit about setting S_IMMUTABLE: the generic vfs write > routines do not check S_IMMUTABLE, which means that zonefs will have to > do that on its own. > > I tried to fix it last year, but there were complaints that it could > break existing workloads (open O_TMPFILE for write, mark it immutable, > link it into the filesystem, continue to write it since you're the only > writer...) OK. Understood. Adding checks where appropriate. > > + inode->i_mode &= ~0777; > > + zone->wp = zone->start; > > + } else if (zone->cond == BLK_ZONE_COND_READONLY) { > > + /* Do not allow writes in read-only zones */ > > + inode->i_flags |= S_IMMUTABLE; > > + inode->i_mode &= ~0222; > > + } > > +} > > + > > +struct zonefs_ioerr_data { > > + struct inode *inode; > > + bool write; > > +}; > > + > > +static int zonefs_io_err_cb(struct blk_zone *zone, unsigned int idx, void *data) > > +{ > > + struct zonefs_ioerr_data *ioerr = data; > > + struct inode *inode = ioerr->inode; > > + struct zonefs_inode_info *zi = ZONEFS_I(inode); > > + struct super_block *sb = inode->i_sb; > > + loff_t isize, wp_ofst; > > + > > + /* > > + * The condition of the zone may have change. Fix the file access > > + * permissions if necessary. > > + */ > > + zonefs_update_file_perm(inode, zone); > > + > > + /* > > + * There is no write pointer on conventional zones and read operations > > + * do not change a zone write pointer. So there is nothing more to do > > + * for these two cases. > > + */ > > + if (zi->i_ztype == ZONEFS_ZTYPE_CNV || !ioerr->write) > > + return 0; > > + > > + /* > > + * For sequential zones write, make sure that the zone write pointer > > + * position is as expected, that is, in sync with the inode size. > > + */ > > + wp_ofst = (zone->wp - zone->start) << SECTOR_SHIFT; > > + zi->i_wpoffset = wp_ofst; > > + isize = i_size_read(inode); > > + > > + if (isize == wp_ofst) > /> + return 0; > > + > > + /* > > + * The inode size and the zone write pointer are not in sync. > > + * If the inode size is below the zone write pointer, then data was > > I'm a little confused about what events these states reflect. > > "inode size is below the zone wp" -- let's say we have a partially > written sequential zone: > > isize > ----v--------------- > DDDDD > ----^--------------- > WP > > Then we tried to write to the end of the sequential zone: > > isize > ----v--------------- > DDDDDWWWW > ----^--------------- > WP > > Then an error happens so we didn't update the isize, and now we see that > the write pointer is beyond isize (pretend the write failed to the '?' > area): > > isize > ----v--------------- > DDDDDD?DD > --------^----------- > WP If the write failed at the "?" location, then the zone write pointer points to that location since nothing after that location can be written unless that location itself is first written. So with your example, the drive will give back: isize ----v--------------- DDDDDD?XX ------^------------- WP With XX denoting the unwritten part of the issued write. > So if we increase isize to match the WP, what happens when userspace > tries to read the question-mark area? Do they get read errors? Stale > contents? Nope, see above: the write pointer always point to the sector following the last sector correctly written. So increasing isize to the write pointer location only exposes the data that actually was written and is readable. No stale data. > Or am I misunderstanding SMR firmware, and the drive only advances the > write pointer once it has written a block? i.e. if a write fails in > the middle, the drive ends up in this state, not the one I drew above: > > isize > ----v--------------- > DDDDDD? > -----^-------------- > WP > > In which case it would be fine to push isize up to the write pointer? Exactly. This is how the ZBC & ZAC (and upcoming ZNS) specifications define the write pointer behavior. That makes error recovery a lot easier and does not result in stale data accesses. Just notice the one- off difference for the WP position from your example as WP will be pointing at the error location, not the last written location. Indexing from 0, we get (wp - zone start) always being isize with all written and readable data in the sector range between zone start and zone write pointer. > Aha, you /did/ say exactly this in the v8 thread. > > > + * writen at the end of the file. This can happen in the case of a > > + * partial failure of a large multi-bio DIO. No data is lost. Simply fix > > + * the inode size to reflect the partial write. Yes. I further improved this comment to make it, I hope this time, super easy to understand. > > + * On the other hand, if the inode size is over the zone write pointer, > > + * then there was an external corruption, e.g. an application reset the > > + * file zone directly, or the device has a problem. > > So I guess this case "isize is greater than WP" means we start with > this appending write to what we think is the end of the zone: > > isize > ----v--------------- > DDDDDWWWW > -------------------- > > (The position of the WP is irrelevant here) > > Then we get a disk error, so we query the WP and discover that it's > actually below isize: > > isize > ----v--------------- > DDDDDDD > -^------------------ > WP > > So now we conclude that either the drive is broken or someone is messing > with the zones behind our back, so we'd rather just shut down and let > the sysadmin figure it out? Because while we could truncate the zone > file down to the WP, this is a sign that something could be seriously > broken? Yes. Exactly. The figure for the file after such error would be: isize ----v--------------- DDXXX -^------------------ WP With the XXX sectors being garbage data since read accesses to sectors after a zone write pointer returns zeroes, or the drive format pattern if it is set. Which also means that the "DD" data above cannot be trusted since if we started with isize after WP, it means that we saw WP == isize on mount. And with SMR specifications, the only way to get into the situation above is if the zone is reset and rewritten behind our back. It is hard to decide on the best action to take here considering the simple nature of zonefs (i.e. another better interface to do raw block device file accesses). Including your comments on mount options, I cam up with these actions that the user can choose with mount options: * repair: Truncate the inode size only, nothing else * remount-ro (default): Truncate the inode size and remount read-only * zone-ro: Truncate the inode size and set the inode read-only * zone-offline: Truncate the inode size to 0 and assume that its zone is offline (no reads nor writes possible). This gives I think a good range of possible behaviors that the user may want, from almost nothing (repair) to extreme to avoid accessing bad data (zone-offline). > (Oh, you said this in the v8 thread too.) > > > + */ > > + zonefs_warn(sb, "inode %lu: size %lld should be %lld\n", > > + inode->i_ino, isize, wp_ofst); > > + if (isize > wp_ofst) { > > + struct zonefs_sb_info *sbi = ZONEFS_SB(sb); > > + > > + if ((sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_RO) && > > Mount options? Hey, wait a minute, this didn't exist in v8... Yes, improvement in v9 to better handle all error cases (indirectly suggested by Dave who pointed out deficiencies in that area). > > + !sb_rdonly(sb)) { > > + zonefs_warn(sb, > > + "Zone %lu corruption detected, remounting fs read-only\n", > > + inode->i_ino); > > + sb->s_flags |= SB_RDONLY; > > + return 0; > > + } else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_CONT) { > > + zonefs_warn(sb, > > + "Zone %lu corruption detected, continuing\n", > > + inode->i_ino); > > I'm frankly not sure errors=continue makes sense for a filesystem. It > exists for ext* as a crutch for the root fs to help users stumble > towards /sbin/reboot and a full fsck afterwards. Good point. > > Also wondering if you should have an errors=zone-ro that will set > S_IMMUTABLE on the zone file? That would enable the intact zones to > keep operating. Done. And as noted above, I also added "errors=zone-offline" and "error=repair". > (Or I guess if you really want a "continue" mode you could truncate the > zone...) That is the errors=repair option now. It is clearer this way I think. > > + } else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_PANIC) { > > I don't think it's a good idea to crash the entire kernel on zone > corruption. I have dropped this one. > > + zonefs_panic(sb, > > + "Zone %lu corruption detected\n", > > + inode->i_ino); > > + } > > + } > > + > > + zonefs_update_stats(inode, wp_ofst); > > + i_size_write(inode, wp_ofst); > > + > > + return 0; > > +} > > + > > +/* > > + * When an IO error occurs, check the target zone to see if there is a change > > + * in the zone condition (e.g. offline or read-only). For a failed write to a > > + * sequential zone, the zone write pointer position must also be checked to > > + * eventually correct the file size and zonefs inode write pointer offset > > + * (which can be out of sync with the drive due to partial write failures). > > + */ > > +static void zonefs_io_error(struct inode *inode, bool write) > > +{ > > + struct zonefs_inode_info *zi = ZONEFS_I(inode); > > + struct super_block *sb = inode->i_sb; > > + struct zonefs_sb_info *sbi = ZONEFS_SB(sb); > > + unsigned int noio_flag; > > + unsigned int nr_zones = > > + zi->i_max_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT); > > + struct zonefs_ioerr_data ioerr = { > > + .inode = inode, > > + .write = write > > + }; > > + int ret; > > + > > + mutex_lock(&zi->i_truncate_mutex); > > + > > + /* > > + * Memory allocations in blkdev_report_zones() can trigger a memory > > + * reclaim which may in turn cause a recursion into zonefs as well as > > + * BIO allocations for the same device. The former case may end up in > > + * a deadlock on the inode truncate mutex, while the latter may prevent > > + * forward progress with BIO allocations as we are potentially still > > + * holding the failed BIO. Executing the report zones under GFP_NOIO > > + * avoids both problems. > > + */ > > + noio_flag = memalloc_noio_save(); > > Don't you still need memalloc_nofs_ here too? noio implies nofs, doesn't it ? Or rather, noio is more restrictive than nofs here. Which is safer since we need a struct request to be able to execute blkdev_report_zones(). > > + ret = blkdev_report_zones(sb->s_bdev, zi->i_zsector, nr_zones, > > + zonefs_io_err_cb, &ioerr); > > + if (ret != nr_zones) > > + zonefs_err(sb, "Get inode %lu zone information failed %d\n", > > + inode->i_ino, ret); > > + memalloc_noio_restore(noio_flag); > > + > > + mutex_unlock(&zi->i_truncate_mutex); > > +} > > + > > +static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size, > > + int error, unsigned int flags) > > +{ > > + struct inode *inode = file_inode(iocb->ki_filp); > > + struct zonefs_inode_info *zi = ZONEFS_I(inode); > > + > > + if (error) { > > + zonefs_io_error(inode, true); > > + return error; > > + } > > + > > + if (size && zi->i_ztype != ZONEFS_ZTYPE_CNV) { > > + mutex_lock(&zi->i_truncate_mutex); > > + if (i_size_read(inode) < iocb->ki_pos + size) { > > + zonefs_update_stats(inode, iocb->ki_pos + size); > > + i_size_write(inode, iocb->ki_pos + size); > > + } > > + mutex_unlock(&zi->i_truncate_mutex); > > + } > > + > > + return 0; > > +} > > + > > +static const struct iomap_dio_ops zonefs_write_dio_ops = { > > + .end_io = zonefs_file_write_dio_end_io, > > +}; Unrelated to your other comments, I discovered that the end_io operation is called with the flags argument being dio->flags. Since the flags for that are the IOMAP_DIO_XXX flags defined in fs/iomap/direct- io.c, the flags values are not visible by the implementation and the end_io() callback function cannot determine if the dio is a read or a write. This can be worked around by defining one end_io op for reads and another for writes (which I did here, see zonefs_file_read_dio_end_io()). But we could allow code simplification by simply adding the IOMAP_XXX flags passed to iomap_begin() into the dio->flags (theses two set of flags do not collide as mentioned in fs/iomap/direct-io.c). That would keep the interface in include/linux/iomap.h clean (no new flags) and give more information to the end_io() callback. With that, I could get rid of the zonefs_file_read_dio_end_io() function and change zonefs_file_write_dio_end_io() into zonefs_file_dio_end_io() for both read and write operations. Less code. Thoughts ? > > + > > +/* > > + * Handle direct writes. For sequential zone files, this is the only possible > > + * write path. For these files, check that the user is issuing writes > > + * sequentially from the end of the file. This code assumes that the block layer > > + * delivers write requests to the device in sequential order. This is always the > > + * case if a block IO scheduler implementing the ELEVATOR_F_ZBD_SEQ_WRITE > > Is there any way for zonefs to detect that it's talking to an io > scheduler that doesn't support ZBD_SEQ_WRITE and react accordingly (log > message, refuse to mount, etc.)? Not really. It can be done if zonefs sits directly on the bdev of a real device, but if the block device comes from a BIO-based device mapper target (e.g. dm-linear), then there is no scheduler for that device. Scheduling is on the backend device(s) in that case and that is invisible from the top bdev interface. Not to mention that target may be using several devices... Furthermore, I am trying to limit as much as possible dependencies on the block layer implementation of "sequential write guarantees" as we are still trying to evolve that into something that works for any scheduler. > > + * elevator feature is being used (e.g. mq-deadline). The block layer always > > + * automatically select such an elevator for zoned block devices during the > > + * device initialization. > > Or is the case that the block layer knows when it's dealing with a zoned > block device and will not allow the assignment of an ioscheduler that > does not support ZBD_SEQ_WRITE? Currently, for zoned block devices, the block layer will only allow setting a scheduler that has the ZBD_SEQ_WRITE feature. The only one that does for now is mq-deadline. Other schedulers without this feature support will not even be shown in /sys/block/xxx/queue/scheduler. The only exception to this is "none", which is always allowed. > > [...] > > +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > > +{ > > + struct inode *inode = file_inode(iocb->ki_filp); > > + > > + /* Write operations beyond the zone size are not allowed */ > > + if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size) > > + return -EFBIG; > > This needs a check for IS_IMMUTABLE so that userspace can't write to > zones which zonefs has decided are no longer writable, even if the > program has a writeable file descriptor. Done, with another additional checks in zonefs_file_read_iter() for offline zones (immutable + no reads allowed). > > [...] > > +/* > > + * Check that the device is zoned. If it is, get the list of zones and create > > + * sub-directories and files according to the device zone configuration and > > + * format options. > > + */ > > +static int zonefs_fill_super(struct super_block *sb, void *data, int silent) > > +{ > > + struct zonefs_zone_data zd; > > + struct zonefs_sb_info *sbi; > > + struct inode *inode; > > + enum zonefs_ztype t; > > + int ret; > > + > > + if (!bdev_is_zoned(sb->s_bdev)) { > > + zonefs_err(sb, "Not a zoned block device\n"); > > + return -EINVAL; > > + } > > + > > + /* > > + * Initialize super block information: the maximum file size is updated > > + * when the zone files are created so that the format option > > + * ZONEFS_F_AGGRCNV which increases the maximum file size of a file > > + * beyond the zone size is taken into account. > > + */ > > + sbi = kzalloc(sizeof(*sbi), GFP_KERNEL); > > + if (!sbi) > > + return -ENOMEM; > > + > > + spin_lock_init(&sbi->s_lock); > > + sb->s_fs_info = sbi; > > + sb->s_magic = ZONEFS_MAGIC; > > + sb->s_maxbytes = 0; > > + sb->s_op = &zonefs_sops; > > + sb->s_time_gran = 1; > > + > > + /* > > + * The block size is set to the device physical sector size to ensure > > + * that write operations on 512e devices (512B logical block and 4KB > > + * physical block) are always aligned to the device physical blocks, > > + * as mandated by the ZBC/ZAC specifications. > > + */ > > + sb_set_blocksize(sb, bdev_physical_block_size(sb->s_bdev)); > > + sbi->s_blocksize_mask = sb->s_blocksize - 1; > > + sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev)); > > + sbi->s_uid = GLOBAL_ROOT_UID; > > + sbi->s_gid = GLOBAL_ROOT_GID; > > + sbi->s_perm = 0640; > > + sbi->s_mount_opts = ZONEFS_MNTOPT_ERRORS_RO; > > + > > + ret = zonefs_read_super(sb); > > + if (ret) > > + return ret; > > + > > + ret = zonefs_parse_options(sb, data); > > + if (ret) > > + return ret; > > + > > + memset(&zd, 0, sizeof(struct zonefs_zone_data)); > > + zd.sb = sb; > > + ret = zonefs_get_zone_info(&zd); > > + if (ret) > > + goto out; > > + > > It might be a good idea to spit out an EXPERIMENTAL warning at mount > time for the first 6 months while you, uh, seek out advanced bleeding > edge testers to really give this code a thorough workout. > > zonefs_warn(sb, "EXPERIMENTAL filesystem in use; use at your own risk"); Yes, I thought about this too but I am still wondering if it is the right thing to do. See below. > Or something like that to manage peoples' expectations in case you find > a really nasty data-chomping bug. :) Well, my view is that since zonefs does not have any run-time changing on-disk metadata, it is not worse that the raw block device file use case in terms of reliability. Unmount zonefs, ignoring the first zone of the device that has the superblock, using the zones directly through the raw block device file open/close/read/write/ioctl will give the same level of confidence about data in the zones. If anything, zonefs improves on that with the various checks it adds for writes and IO errors (fs/block-dev.c does not have anything like that for zoned block devices). Of course I do not mean that zonefs is bug free. But I still consider the likeliness of loosing data equivalent to the raw block device file case: it mostly will depend on the application doing the right thing. The value of zonefs is in the file access interface simplification, and not in strong additional guarantees about data loss or corruption detection. So warning about the experimental status may be too scary and discourage users from using it and start developing for the block device file access use case. I would rather encourage people to start using zonefs now, especially considering the fact that the upcoming NVMe ZNS will need some additional zone specific handling (zone resource control for writes) that are fairly easy to handle with a one- file-per-zone in-kernel FS interface. That simplifies even more the application implementation. But I do not have strong feeling about it either, and I will add the warning if you or others insist :) > (Or as a lever to convince people to stop running old code some day...) I am still trying to convince a lot of SMR users to move away from SG_IO and use the kernel block layer instead. But a lot of deployments still use enterprise distros with kernels that do not have SMR support. Getting zonefs into the kernel and I will definitely push for its use in place of the raw block device file interface as that also simplifies support for various application programming languages (e.g. SMR drive handling directly from JAVA or python). Thank you for all your comments. -- Damien Le Moal Western Digital Research