Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

John Garry <john.g.garry@xxxxxxxxxx> · Thu, 5 Dec 2024 10:52:50 +0000

On 04/12/2024 20:35, Dave Chinner wrote:
On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
From: "Ritesh Harjani (IBM)" <ritesh.list@xxxxxxxxx>

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split into multiple BIOs. Hence
let's check if the iomap_length() is same as iter->len or not.

It is the responsibility of userspace to ensure that a write does not span
mixed unwritten and mapped extents (which would lead to multiple BIOs).

How is "userspace" supposed to do this?

If an atomic write spans mixed unwritten and mapped extents, then it 
should manually zero the unwritten extents beforehand.


No existing utility in userspace is aware of atomic write limits or
rtextsize configs, so how does "userspace" ensure everything is
laid out in a manner compatible with atomic writes?

e.g. restoring a backup (or other disaster recovery procedures) is
going to have to lay the files out correctly for atomic writes.
backup tools often sparsify the data set and so what gets restored
will not have the same layout as the original data set...

I am happy to support whatever is needed to make atomic writes work over
mixed extents if that is really an expected use case and it is a pain 
for an application writer/admin to deal with this (by manually zeroing 
extents).

JFYI, I did originally support the extent pre-zeroing for this. That was 
to support a real-life scenario which we saw where we were attempting 
atomic writes over mixed extents. The mixed extents were coming from 
userspace punching holes and then attempting an atomic write over that 
space. However that was using an early experimental and buggy 
forcealign; it was buggy as it did not handle punching holes properly - 
it punched out single blocks and not only full alloc units.


Where's the documentation that outlines all the restrictions on
userspace behaviour to prevent this sort of problem being triggered?

I would provide a man page update.

Common operations such as truncate, hole punch,

So how would punch hole be a problem? The atomic write unit max is 
limited by the alloc unit, and we can only punch out full alloc units.

buffered writes,
reflinks, etc will trip over this, so application developers, users
and admins really need to know what they should be doing to avoid
stepping on this landmine...

If this is not a real-life scenario which we expect to see, then I don't 
see why we would add the complexity to the kernel for this.

My motivation for atomic writes support is to support atomically writing 
large database internal page size. If the database only writes at a 
fixed internal page size, then we should not see mixed mappings.

But you see potential problems elsewhere ..


Further to that, what is the correction process for users to get rid
of this error? They'll need some help from an atomic write
constraint aware utility that can resilver the file such that these
failures do not occur again. Can xfs_fsr do this? Or maybe the new
exchangr-range code? Or does none of this infrastructure yet exist?

Nothing exists yet.


Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
jpg: tweak commit message
Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
---
  fs/iomap/direct-io.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b521eb15759e..3dd883dd77d2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
  	size_t copied = 0;
  	size_t orig_count;
  
-	if (atomic && length != fs_block_size)
+	if (atomic && length != iter->len)
  		return -EINVAL;

Given this is now rejecting IOs that are otherwise well formed from
userspace, this situation needs to have a different errno now. The
userspace application has not provided an invalid set of
IO parameters for this IO - the IO has been rejected because
the previously set up persistent file layout was screwed up by
something in the past.

i.e. This is not an application IO submission error anymore, hence
EINVAL is the wrong error to be returning to userspace here.


Understood, but let's come back to this (if needed..).

Thanks,
John