On Wed 28-08-24 10:06:46, yangerkun wrote: > > > 在 2024/8/28 1:08, Jan Kara 写道: > > On Tue 20-08-24 22:06:57, yangerkun wrote: > > > From: yangerkun <yangerkun@xxxxxxxxxx> > > > > Thanks for debugging this. Couple of spelling fixes first: > > Hi, > > Thank you for your patient review! > > > > Any extended write for ext4 requires the inode to be placed on the > > ^^^ extending > > > > > orphan list before the actual write. In addition, the inode can be > > > actually removed from the orphan list only after all writes are > > > completed. Otherwise, those overcommitted blocks (If the allocated > > ^^ I'd phrase this: Otherwise we'd leave allocated blocks > > beyond i_disksize if we could not copy all the data into allocated block > > and e2fsck would complain. > > > > > blocks are not written due to certain reasons, the inode size does not > > > exceed the offset of these blocks) The leak status is always retained, > > > and fsck reports an alarm for this scenario. > > > > > > Currently, the dio and buffer IO comply with this logic. However, the > > ^^ buffered > > > > BTW: The only reason why direct IO doesn't have this problem is because > > we don't do short writes for direct IO. We either submit all or we return > > error. > > Yeah. In fact, the first version in my mind is same as this, don't do > short write for dax too. But thinking deeper, it seems better to keep > the blocks that has been successfully written... > > > > > > > dax write will removed the inode from orphan list since > > ^^^ remove ^ the orphan ... > > > > > ext4_handle_inode_extension is unconditionally called during extend > > ^^ extending > > > > > write. Fix it with this patch. We open the code from > > > ext4_handle_inode_extension since we want to keep the blocks valid > > > has been allocated and write success. > > > > > > Signed-off-by: yangerkun <yangerkun@xxxxxxxxxx> > > > --- > > > fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++---- > > > 1 file changed, 31 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > > > index be061bb64067..fd8597eef75e 100644 > > > --- a/fs/ext4/file.c > > > +++ b/fs/ext4/file.c > > > @@ -628,11 +628,12 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > static ssize_t > > > ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > { > > > - ssize_t ret; > > > + ssize_t ret, written; > > > size_t count; > > > loff_t offset; > > > handle_t *handle; > > > bool extend = false; > > > + bool need_trunc = true; > > > struct inode *inode = file_inode(iocb->ki_filp); > > > if (iocb->ki_flags & IOCB_NOWAIT) { > > > @@ -668,10 +669,36 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops); > > > - if (extend) { > > > - ret = ext4_handle_inode_extension(inode, offset, ret); > > > - ext4_inode_extension_cleanup(inode, ret < (ssize_t)count); > > > + if (!extend) > > > + goto out; > > > + > > > + if (ret <= 0) > > > + goto err_trunc; > > > + > > > + written = ret; > > > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > > > + if (IS_ERR(handle)) { > > > + ret = PTR_ERR(handle); > > > + goto err_trunc; > > > } > > > + > > > + if (ext4_update_inode_size(inode, offset + written)) { > > > + ret = ext4_mark_inode_dirty(handle, inode); > > > + if (unlikely(ret)) { > > > + ext4_journal_stop(handle); > > > + goto err_trunc; > > > + } > > > + } > > > + > > > + if (written == count) > > > + need_trunc = false; > > > + > > > + if (inode->i_nlink) > > > + ext4_orphan_del(handle, inode); > > > > Why did you keep ext4_orphan_del() here? I thought the whole point of this > > Sorry, I make a mistake here, there should be a truncate before. Thanks > for point out this! > > > patch is to avoid it? In fact, rather then opencoding > > ext4_handle_inode_extension() I'd add argument to > > ext4_handle_inode_extension() like: > > > > ext4_handle_inode_extension(inode, pos, written, allocated) > > > > and remove inode from the orphan list only if written == allocated. The > > call site in ext4_dio_write_end_io() would call: > > > > /* > > * For DIO we don't do partial writes so we must have submitted all > > * that was allocated. > > */ > > return ext4_handle_inode_extension(inode, pos, size, size); > > > > and the call site in ext4_dax_write_iter() would call: > > > > ret = ext4_handle_inode_extension(inode, offset, ret, count); > > > > What do you think? > > Great! This seems more clearly and I think it should works too. Whould I > send a v2 patch for this? Yes, please! If you could send a fix, that would be great. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR