Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

Ritesh Harjani <riteshh@xxxxxxxxxxxxx> · Tue, 17 Sep 2019 14:30:15 +0530




Hello,

On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
@@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
  	struct inode *inode = file_inode(iocb->ki_filp);
  	ssize_t ret;
  
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return -EPERM;
+
  	ret = generic_write_checks(iocb, from);
  	if (ret <= 0)
  		return ret;
  
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return -EPERM;
+	ret = file_modified(iocb->ki_filp);
+	if (ret)
+		return 0;
  
  	/*
  	 * If we have encountered a bitmap-format file, the size limit

Independent of the error return issue you probably want to split
modifying ext4_write_checks into a separate preparation patch.

Providing that there's no objections to introducing a possible performance
change with this separate preparation patch (overhead of calling
file_remove_privs/file_update_time twice), then I have no issues in doing so.

+/*
+ * For a write that extends the inode size, ext4_dio_write_iter() will
+ * wait for the write to complete. Consequently, operations performed
+ * within this function are still covered by the inode_lock(). On
+ * success, this function returns 0.
+ */
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+				 unsigned int flags)
+{
+	int ret;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (error) {
+		ret = ext4_handle_failed_inode_extension(inode, offset + size);
+		return ret ? ret : error;
+	}

Just a personal opinion, but I find the use of the ternary operator
here a little weird.

A plain old:

	ret = ext4_handle_failed_inode_extension(inode, offset + size);
	if (ret)
		return ret;
	return error;

flow much easier.

Agree, much cleaner.

+	if (!inode_trylock(inode)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock(inode);
+	}
+
+	if (!ext4_dio_checks(inode)) {
+		inode_unlock(inode);
+		/*
+		 * Fallback to buffered IO if the operation on the
+		 * inode is not supported by direct IO.
+		 */
+		return ext4_buffered_write_iter(iocb, from);

I think you want to lift the locking into the caller of this function
so that you don't have to unlock and relock for the buffered write
fallback.

I don't exactly know what you really mean by "lift the locking into the caller
of this function". I'm interpreting that as moving the inode_unlock()
operation into ext4_buffered_write_iter(), but I can't see how that would be
any different from doing it directly here? Wouldn't this also run the risk of
the locks becoming unbalanced as we'd need to add checks around whether the
resource is being contended? Maybe I'm misunderstanding something here...

+	if (offset + count > i_size_read(inode) ||
+	    offset + count > EXT4_I(inode)->i_disksize) {
+		ext4_update_i_disksize(inode, inode->i_size);
+		extend = true;

Doesn't the ext4_update_i_disksize need to be under an open journal
handle?

After all, it is a metadata update, which should go through an open journal
handle.

Hmmm, it seems like a race here. But I am not sure if this is just due 
to not updating i_disksize under open journal handle.


So if we have a delayed buffered write to a file,
in that case we first only update inode->i_size and update
i_disksize at writeback time
(i.e. during block allocation).
In that case when we call for ext4_dio_write_iter
since offset + len > i_disksize, we call for ext4_update_i_disksize().

Now if writeback for some reason failed. And the system crashes, during 
the DIO writes, after the blocks are allocated. Then during reboot we 
may have an inconsistent inode, since we did not add the inode into the
orphan list before we updated the inode->i_disksize. And journal replay
may not succeed.

1. Can above actually happen? I am still not able to figure out the
   race/inconsistency completely.
2. Can you please help explain under what other cases
   it was necessary to call ext4_update_i_disksize() in DIO write paths?
3. When will i_disksize be out-of-sync with i_size during DIO writes?


-ritesh