On Mon, Dec 23, 2019 at 01:33:30AM +0000, Damien Le Moal wrote: > On 2019/12/21 7:38, Darrick J. Wong wrote: > > On Fri, Dec 20, 2019 at 03:55:27PM +0900, Damien Le Moal wrote: > [...]>> +static int zonefs_inode_setattr(struct dentry *dentry, struct > iattr *iattr) > >> +{ > >> + struct inode *inode = d_inode(dentry); > >> + int ret; > >> + > >> + ret = setattr_prepare(dentry, iattr); > >> + if (ret) > >> + return ret; > >> + > >> + if ((iattr->ia_valid & ATTR_UID && > >> + !uid_eq(iattr->ia_uid, inode->i_uid)) || > >> + (iattr->ia_valid & ATTR_GID && > >> + !gid_eq(iattr->ia_gid, inode->i_gid))) { > >> + ret = dquot_transfer(inode, iattr); > >> + if (ret) > >> + return ret; > >> + } > >> + > >> + if (iattr->ia_valid & ATTR_SIZE) { > >> + /* The size of conventional zone files cannot be changed */ > >> + if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV) > >> + return -EPERM; > >> + > >> + ret = zonefs_seq_file_truncate(inode, iattr->ia_size); > >> + if (ret) > >> + return ret; > >> + } > > > > /me wonders if you need to filter out ATTR_MODE changes here, at least > > so you can't make the zone file for a readonly zone writable? > > Good point. Will add that to V3. > > > I also wonder, does an O_TRUNC open reset the zone's write pointer to > > zero? > > Yes, it does. That does not change from a regular FS behavior. This is > also consistent with the fact that a truncate(0) does exactly the same > thing. Ok, good, just checking. :) > [...] > >> +static const struct vm_operations_struct zonefs_file_vm_ops = { > >> + .fault = zonefs_filemap_fault, > >> + .map_pages = filemap_map_pages, > >> + .page_mkwrite = zonefs_filemap_page_mkwrite, > >> +}; > >> + > >> +static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma) > >> +{ > >> + /* > >> + * Conventional zone files can be mmap-ed READ/WRITE. > >> + * For sequential zone files, only readonly mappings are possible. > > > > Hmm, but the code below looks like it allows private writable mmapings > > of sequential zones? > > It is my understanding that changes made to pages of a MAP_PRIVATE > mapping are not written back to the underlying file, so a > mmap(MAP_WRITE|MAP_PRIVATE) is essentially equivalent to a read only > mapping for the FS. Am I missing something ? > > Not sure if it make any sense at all to allow private writeable mappings > though, but if my assumption is correct, I do not see any reason to > prevent them either. <nod> You're correct, I was just checking that this is indeed the correct behavior for zonefs. :) > [...] > >> +static const struct iomap_dio_ops zonefs_dio_ops = { > >> + .end_io = zonefs_file_dio_write_end, > >> +}; > >> + > >> +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from) > >> +{ > >> + struct inode *inode = file_inode(iocb->ki_filp); > >> + struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb); > >> + struct zonefs_inode_info *zi = ZONEFS_I(inode); > >> + size_t count; > >> + ssize_t ret; > >> + > >> + if (iocb->ki_flags & IOCB_NOWAIT) { > >> + if (!inode_trylock(inode)) > >> + return -EAGAIN; > >> + } else { > >> + inode_lock(inode); > >> + } > >> + > >> + ret = generic_write_checks(iocb, from); > >> + if (ret <= 0) > >> + goto out; > >> + > >> + iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos); > >> + count = iov_iter_count(from); > >> + > >> + /* > >> + * Direct writes must be aligned to the block size, that is, the device > >> + * physical sector size, to avoid errors when writing sequential zones > >> + * on 512e devices (512B logical sector, 4KB physical sectors). > >> + */ > >> + if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) { > >> + ret = -EINVAL; > >> + goto out; > >> + } > >> + > >> + /* > >> + * Enforce sequential writes (append only) in sequential zones. > >> + */ > > > > I wonder, shouldn't zonefs require users to open sequential zones with > > O_APPEND? I don't see anything in here that would suggest that it does, > > though maybe I missed something. > > Yes, I thought about this too but decided against it for several reasons: > 1) Requiring O_APPEND breaks some shell command like tools such as > "truncate" which makes scripting (including tests) harder. Yeah, I realized right after I sent this that you can't usually truncate an append-only file so O_APPEND really doesn't apply here. > 2) Without enforcing O_APPEND, an application doing pwrite() or aios to > an incorrect offset will see an error instead of potential file data > corruption (due to the application bug, not the FS). > 3) Since sequential zone file size is updated only on completion of > direct IOs, O_APPEND would generate an incorrect offset for AIOs at > queue depth bigger than 1. ooh, good point. :) > Thoughts ? "Heh, that was a silly point to make on my part", or "Maybe it's good that we have these discussions on the mailing lists" :) > [...] > >> +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > >> +{ > >> + struct inode *inode = file_inode(iocb->ki_filp); > >> + > >> + /* > >> + * Check that the write operation does not go beyond the zone size. > >> + */ > >> + if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size) > >> + return -EFBIG; > >> + > >> + if (iocb->ki_flags & IOCB_DIRECT) > >> + return zonefs_file_dio_write(iocb, from); > >> + > >> + return zonefs_file_buffered_write(iocb, from); > >> +} > >> + > >> +static const struct file_operations zonefs_file_operations = { > >> + .open = generic_file_open, > > > > Hmm, ok, so there isn't any explicit O_APPEND requirement, even though > > it looks like the filesystem enforces one. > > Yes, in purpose. See above for the reasons. > > [...] > >> +static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone) > >> +{ > >> + struct super_block *sb = inode->i_sb; > >> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb); > >> + struct zonefs_inode_info *zi = ZONEFS_I(inode); > >> + umode_t perm = sbi->s_perm; > >> + > >> + zi->i_ztype = zonefs_zone_type(zone); > >> + zi->i_zsector = zone->start; > >> + > >> + switch (zone->cond) { > >> + case BLK_ZONE_COND_OFFLINE: > >> + /* > >> + * Disable all accesses and set the file size to 0 for > >> + * offline zones. > >> + */ > >> + zi->i_wpoffset = 0; > >> + zi->i_max_size = 0; > >> + perm = 0; > >> + break; > >> + case BLK_ZONE_COND_READONLY: > >> + /* Do not allow writes in read-only zones*/ > >> + perm &= ~(0222); /* S_IWUGO */ > >> + /* Fallthrough */ > > > > You might want to set S_IMMUTABLE in i_flags here, since (I assume) > > readonly zones are never, ever, going to be modifable in any way? > > Good point. Will do. > > > In which case, zonefs probably shouldn't let people run 'chmod a+w' on a > > readonly zone. Either that or disallow mode changes via > > zonefs_inode_setattr. > > Yes, will do. > > [...] > >> +static int zonefs_create_zgroup(struct zonefs_zone_data *zd, > >> + enum zonefs_ztype type) > >> +{ > >> + struct super_block *sb = zd->sb; > >> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb); > >> + struct blk_zone *zone, *next, *end; > >> + char name[ZONEFS_NAME_MAX]; > >> + struct dentry *dir; > >> + unsigned int n = 0; > >> + > >> + /* If the group is empty, there is nothing to do */ > >> + if (!zd->nr_zones[type]) > >> + return 0; > >> + > >> + dir = zonefs_create_inode(sb->s_root, zgroups_name[type], NULL); > >> + if (!dir) > >> + return -ENOMEM; > >> + > >> + /* > >> + * The first zone contains the super block: skip it. > >> + */ > >> + end = zd->zones + blkdev_nr_zones(sb->s_bdev->bd_disk); > >> + for (zone = &zd->zones[1]; zone < end; zone = next) { > >> + > >> + next = zone + 1; > >> + if (zonefs_zone_type(zone) != type) > >> + continue; > >> + > >> + /* > >> + * For conventional zones, contiguous zones can be aggregated > >> + * together to form larger files. > >> + * Note that this overwrites the length of the first zone of > >> + * the set of contiguous zones aggregated together. > >> + * Only zones with the same condition can be agreggated so that > >> + * offline zones are excluded and readonly zones are aggregated > >> + * together into a read only file. > >> + */ > >> + if (type == ZONEFS_ZTYPE_CNV && > >> + sbi->s_features & ZONEFS_F_AGGRCNV) { > > > > This probably needs parentheses around the flag check, e.g. > > > > if (type == ZONEFS_ZTYPE_CNV && > > (sbi->s_features & ZONEFS_F_AGGRCNV)) { > > gcc does not complain but I agree. It is cleaner and older gcc versions > will also probably be happier :) > > [...] > >> + > >> +static int zonefs_get_zone_info(struct zonefs_zone_data *zd) > >> +{ > >> + struct block_device *bdev = zd->sb->s_bdev; > >> + int ret; > >> + > >> + zd->zones = kvcalloc(blkdev_nr_zones(bdev->bd_disk), > >> + sizeof(struct blk_zone), GFP_KERNEL); > > > > Hmm, so one 64-byte blk_zone structure for each zone on the disk? > > > > I have a 14TB SMR disk with ~459,000x 32M zones on it. That's going to > > require a contiguous 30MB memory allocation to hold all the zone > > information. Even your 15T drive from the commit message will need a > > contiguous 3.8MB memory allocation for all the zone info. > > > > I wonder if each zone should really be allocated separately and then > > indexed with an xarray or something like that to reduce the chance of > > failure when memory is fragmented or tight. > > > > That could be subsequent work though, since in the meantime that just > > makes zonefs mounts more likely to run out of memory and fail. I > > suppose you don't hang on to the huge allocation for very long. > > No, this memory allocation is only for mount. It is dropped as soon as > all the zone file inodes are created. Furthermore, this allocation is a > kvalloc, not a kmalloc. So there is no memory continuity requirement. > This is only an array of structures and that is not used to do IOs for > the report zone itself. > > I debated trying to optimize (I mean reducing the mount temporary memory > use) by processing mount in small chunks of zones instead of all zones > in one go. I kept simple, but rather brutal, approach to keep the code > simple. This can be rewritten and optimized at any time if we see > problems appearing. <nod> vmalloc space is quite limited on 32-bit platforms, so that's the most likely place you'll get complaints. > > > >> + if (!zd->zones) > >> + return -ENOMEM; > >> + > >> + /* Get zones information */ > >> + ret = blkdev_report_zones(bdev, 0, BLK_ALL_ZONES, > >> + zonefs_get_zone_info_cb, zd); > >> + if (ret < 0) { > >> + zonefs_err(zd->sb, "Zone report failed %d\n", ret); > >> + return ret; > >> + } > >> + > >> + if (ret != blkdev_nr_zones(bdev->bd_disk)) { > >> + zonefs_err(zd->sb, "Invalid zone report (%d/%u zones)\n", > >> + ret, blkdev_nr_zones(bdev->bd_disk)); > >> + return -EIO; > >> + } > >> + > >> + return 0; > >> +} > >> + > >> +static inline void zonefs_cleanup_zone_info(struct zonefs_zone_data *zd) > >> +{ > >> + kvfree(zd->zones); > >> +} > >> + > >> +/* > >> + * Read super block information from the device. > >> + */ > >> +static int zonefs_read_super(struct super_block *sb) > >> +{ > >> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb); > >> + struct zonefs_super *super; > >> + u32 crc, stored_crc; > >> + struct page *page; > >> + struct bio_vec bio_vec; > >> + struct bio bio; > >> + int ret; > >> + > >> + page = alloc_page(GFP_KERNEL); > >> + if (!page) > >> + return -ENOMEM; > >> + > >> + bio_init(&bio, &bio_vec, 1); > >> + bio.bi_iter.bi_sector = 0; > >> + bio_set_dev(&bio, sb->s_bdev); > >> + bio_set_op_attrs(&bio, REQ_OP_READ, 0); > >> + bio_add_page(&bio, page, PAGE_SIZE, 0); > >> + > >> + ret = submit_bio_wait(&bio); > >> + if (ret) > >> + goto out; > >> + > >> + super = page_address(page); > >> + > >> + stored_crc = super->s_crc; > >> + super->s_crc = 0; > >> + crc = crc32_le(ZONEFS_MAGIC, (unsigned char *)super, > >> + sizeof(struct zonefs_super)); > > > > Unusual; usually crc32 computations are seeded with ~0U, but <shrug>. > > No strong opinion on this one. I will change to ~0U to follow the > general convention. Ok. > > Anyway, this looks to be in decent shape now, modulo other comments. > > Thank you for your comments. Sending a V3. Ok, I'll flip over to that thread now. --D > > > > -- > Damien Le Moal > Western Digital Research