On Wed 22-03-23 12:04:01, Ritesh Harjani wrote: > Jan Kara <jack@xxxxxxx> writes: > >> + pos += size; > >> + if (pos > i_size_read(inode)) > >> + i_size_write(inode, pos); > >> + > >> + return 0; > >> +} > >> + > >> +static const struct iomap_dio_ops ext2_dio_write_ops = { > >> + .end_io = ext2_dio_write_end_io, > >> +}; > >> + > >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > >> +{ > >> + struct file *file = iocb->ki_filp; > >> + struct inode *inode = file->f_mapping->host; > >> + ssize_t ret; > >> + unsigned int flags; > >> + unsigned long blocksize = inode->i_sb->s_blocksize; > >> + loff_t offset = iocb->ki_pos; > >> + loff_t count = iov_iter_count(from); > >> + > >> + > >> + inode_lock(inode); > >> + ret = generic_write_checks(iocb, from); > >> + if (ret <= 0) > >> + goto out_unlock; > >> + ret = file_remove_privs(file); > >> + if (ret) > >> + goto out_unlock; > >> + ret = file_update_time(file); > >> + if (ret) > >> + goto out_unlock; > >> + > >> + /* > >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > >> + * calls for generic_write_sync in iomap_dio_complete(). > >> + * Since ext2_fsync nmust be called w/o inode lock, > >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > >> + * ourselves. > >> + */ > >> + flags = IOMAP_DIO_NOSYNC; > > > > Meh, this is kind of ugly and we should come up with something better for > > simple filesystems so that they don't have to play these games. Frankly, > > these days I doubt there's anybody really needing inode_lock in > > __generic_file_fsync(). Neither sync_mapping_buffers() nor > > sync_inode_metadata() need inode_lock for their self-consistency. So it is > > only about flushing more consistent set of metadata to disk when fsync(2) > > races with other write(2)s to the same file so after a crash we have higher > > chances of seeing some real state of the file. But I'm not sure it's really > > worth keeping for filesystems that are still using sync_mapping_buffers(). > > People that care about consistency after a crash have IMHO moved to other > > filesystems long ago. > > > > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock > inode -> call generic_write_sync(). I haven't yet worked on this part. So I see two problems with what Christoph suggests: a) It is unfortunate API design to require trivial (and low maintenance) filesystem to do these relatively complex locking games. But this can be solved by providing appropriate wrapper for them I guess. b) When you unlock the inode, other stuff can happen with the inode. And e.g. i_size update needs to happen after IO is completed so filesystems would have to be taught to avoid say two racing expanding writes. That's IMHO really too much to ask. > Are you suggesting to rip of inode_lock from __generic_file_fsync()? > Won't it have a much larger implications? Yes and yes :). But inode writeback already happens from other paths without inode_lock so there's hardly any surprise there. sync_mapping_buffers() is impossible to "customize" by filesystems and the generic code is fine without inode_lock. So I have hard time imagining how any filesystem would really depend on inode_lock in this path (famous last words ;)). > >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || > >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) > >> + flags |= IOMAP_DIO_FORCE_WAIT; > >> + > >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, > >> + flags, NULL, 0); > >> + > >> + if (ret == -ENOTBLK) > >> + ret = 0; > > > > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of > > blockdev_direct_IO(). Thus you have to implement that in your > > ext2_iomap_ops, in particular in iomap_begin... > > > > Aah yes. Thanks for pointing that out - > ext2_iomap_begin() should have something like this - > /* > * We cannot fill holes in indirect tree based inodes as that could > * expose stale data in the case of a crash. Use the magic error code > * to fallback to buffered I/O. > */ > > Also I think ext2_iomap_end() should also handle a case like in ext4 - > > /* > * Check to see whether an error occurred while writing out the data to > * the allocated blocks. If so, return the magic error code so that we > * fallback to buffered I/O and attempt to complete the remainder of > * the I/O. Any blocks that may have been allocated in preparation for > * the direct I/O will be reused during buffered I/O. > */ > if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > return -ENOTBLK; > > > I am wondering if we have testcases in xfstests which really tests these > functionalities also or not? Let me give it a try... > ... So I did and somehow couldn't find any testcase which fails w/o > above changes. I guess we don't. It isn't that simple (but certainly possible) to test for stale data exposure... > Another query - > > We have this function ext2_iomap_end() (pasted below) > which calls ext2_write_failed(). > > Here IMO two cases are possible - > > 1. written is 0. which means an error has occurred. > In that case calling ext2_write_failed() make sense. > > 2. consider a case where written > 0 && written < length. > (This is possible right?). In that case we still go and call > ext2_write_failed(). This function will truncate the pagecache and disk > blocks beyong i_size. Now we haven't yet updated inode->i_size (we do > that in ->end_io which gets called in the end during completion) > So that means it just removes everything. > > Then in ext2_dax_write_iter(), we might go and update inode->i_size > to iocb->ki_pos including for short writes. This looks like it isn't > consistent because earlier we had destroyed all the blocks for the short > writes and we will be returning ret > 0 to the user saying these many > bytes have been written. > Again I haven't yet found a test case at least not in xfstests which > can trigger this short writes. Let me know your thoughts on this. > All of this lies on the fact that there can be a case where > written > 0 && written < length. I will read more to see if this even > happens or not. But I atleast wanted to capture this somewhere. So as far as I remember, direct IO writes as implemented in iomap are all-or-nothing (see iomap_dio_complete()). But it would be good to assert that in ext4 code to avoid surprises if the generic code changes. > Another thing - > In dax while truncating the inode i_size in ext2_setsize(), > I think we don't properly call dax_zero_blocks() when we are trying to > zero the last block beyond EOF. i.e. for e.g. it can be called with len > as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with > len = 0 and can bug_on at maxblocks == 0. How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will not call iomap_begin() at all if iter.len == 0. > I think it should be this. I will spend some more time analyzing this > and also test it once against DAX paths. > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 7ff669d0b6d2..cc264b1e288c 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) > inode_dio_wait(inode); > > if (IS_DAX(inode)) > - error = dax_zero_range(inode, newsize, > - PAGE_ALIGN(newsize) - newsize, NULL, > - &ext2_iomap_ops); > + error = dax_truncate_page(inode, newsize, NULL, > + &ext2_iomap_ops); > else > error = block_truncate_page(inode->i_mapping, > newsize, ext2_get_block); That being said this is indeed a nice cleanup. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR