Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

John Garry <john.g.garry@xxxxxxxxxx> · Mon, 24 Jun 2024 14:58:40 +0100




On 21/06/2024 22:18, Darrick J. Wong wrote:
On Thu, Jun 13, 2024 at 11:31:35AM +0100, John Garry wrote:
On 12/06/2024 22:32, Darrick J. Wong wrote:
unsigned int fs_block_size = i_blocksize(inode), pad;
+	u64 io_block_size = iomap->io_block_size;
I wonder, should iomap be nice and not require filesystems to set
io_block_size themselves unless they really need it?

That's what I had in v3, like:

if (iomap->io_block_size)
	io_block_size = iomap->io_block_size;
else
	io_block_size = i_block_size(inode)

but it was suggested to change that (to like what I have here).

oh, ok.  Ignore that comment, then. :)


To be clear, Dave actually suggested that change.

Anyone working on
an iomap port while this patchset is in progress may or may not remember
to add this bit if they get their port merged after atomicwrites is
merged; and you might not remember to prevent the bitrot if the reverse
order happens.

Sure, I get your point.

However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close to
all members of struct iomap, so we are just continuing that trend, i.e. it
is the job of the FS callback to set all these members.


	u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);

   	loff_t length = iomap_length(iter);
   	loff_t pos = iter->pos;
   	blk_opf_t bio_opf;
@@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
   	int nr_pages, ret = 0;
   	size_t copied = 0;
   	size_t orig_count;
+	unsigned int pad;
   	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
   	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
@@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
   	if (need_zeroout) {
   		/* zero out from the start of the block to the write offset */
-		pad = pos & (fs_block_size - 1);
+		if (is_power_of_2(io_block_size)) {
+			pad = pos & (io_block_size - 1);
+		} else {
+			loff_t _pos = pos;
+
+			pad = do_div(_pos, io_block_size);
+		}
Please don't opencode this twice.

static unsigned int offset_in_block(loff_t pos, u64 blocksize)
{
	if (likely(is_power_of_2(blocksize)))
		return pos & (blocksize - 1);
	return do_div(pos, blocksize);
}

ok, fine


		pad = offset_in_block(pos, io_block_size);
		if (pad)
			...

Also, what happens if pos-pad points to a byte before the mapping?

It's the job of the FS to map in something aligned to io_block_size. Having
said that, I don't think we are doing that for XFS (which sets io_block_size
i_block_size(inode)), so I need to check that.

<nod>  You can only play with the mapping that the fs gave you.
If xfs doesn't give you a big enough mapping, then that's a programming
bug to WARN_ON_ONCE about and return EIO.


I think that this is pretty easy to solve by just ensuring that for an 
atomic write inode, the call xfs_direct_write_iomap_being() -> 
xfs_bmapi_read(bno, len) is such that bno and len are extent size aligned.


Naming is hard. Essentally we are trying to reuse the sub-fs block zeroing
code for sub-extent granule writes. More below.

Yeah.  For sub-fsblock zeroing we issue (chained) bios to write zeroes
to the sectors surrounding the part we're actually writing, then we're
issuing the write itself, and finally the ioend converts the mapping to
unwritten.

For untorn writes we're doing the same thing, but now on the level of
multiple fsblocks.  I guess this is all going to support a


<nod> "IO granularity" ?  For untorn writes I guess you want mappings
that are aligned to a supported untorn write granularity; for bs > ps
filesystems I guess the IO granularity is

For LBS, it's still going to be inode block size.



<still confused about why we need to do this, maybe i'll figure it out
as I go along>

This zeroing is just really required for atomic writes. The purpose is to
zero the extent granule for any write within a newly allocated granule.

Consider we have uuWu, above. If the user then attempts to write the full
16K as an atomic write, the iomap iter code would generate writes for sizes
8k, 4k, and 4k, i.e. not a single 16K write. This is not acceptable. So the
idea is to zero the full extent granule when allocated, so we have ZZWZ
after the 4k write at offset 8192, above. As such, if we then attempt this
16K atomic write, we get a single 16K BIO, i.e. there is no unwritten extent
conversion.

Wait, are we issuing zeroing writes for 0-8191 and 12288-16383, then
issuing a single atomic write for 0-16383? 

When we have uuuu and attempt the first 4k write @ offset 4k, we also 
issue zeroes for 0-8191 and 12288-16383.

But this is done synchronously.  We are leveraging the existing code to 
issue the write with the exclusive IOLOCK in 
xfs_file_dio_write_unaligned(), so no one else can access that data 
while we do that initial write+zeroing to the extent.

That won't work, because all
the bios attached to an iomap_dio are submitted and execute
asynchronously.  I think you need ->iomap_begin to do XFS_BMAPI_ZERO
allocations if the writes aren't aligned to the minimum untorn write
granularity.

I am not sure if we should be doing this only for atomic writes inodes, or
also forcealign only or RT.

I think it only applies to untorn writes because the default behavior
everywhere is is that writes can tear.


ok, fine.

Thanks,
John