On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
John Garry <john.g.garry@xxxxxxxxxx> writes:
On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
Same as mentioned above. We can't have atomic writes to get split.
This patch is just lifting the restriction of iomap to allow more than
blocksize but the mapped length should still meet iter->len, as
otherwise the writes can get split.
Sure, I get this. But I wonder why would we be getting multiple
mappings? Why cannot the FS always provide a single mapping?
FS can decide to split the mappings when it couldn't allocate a single
large mapping of the requested length. Could be due to -
- already allocated extent followed by EOF,
- already allocated extent followed by a hole
- already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
This is the sort of scenario which I am concerned with. This issue has
been discussed at length for XFS forcealign support for atomic writes.
extsize and forcealign is being worked for ext4 as well where we can
add such support, sure.
So far, the user can atomic write a single FS block regardless of
whether the extent in which it would be part of is in written or
unwritten state.
Now the rule will be to write multiple FS blocks atomically, all blocks
need to be in same written or unwritten state.
FS needs to ensure that the writes does not get torned. So for whatever reason
FS splits the mapping then we need to return an -EINVAL error to not
allow such writes to get torned. This patch just does that.
But I get your point. More below.
This oddity at least needs to be documented.
Got it. Yes, we can do that.
Better yet would be to not have this restriction.
I haven't thought of a clever way where we don't have to zero out the
rest of the unwritten mapping. With ext4 bigalloc since the entire
cluster is anyway reserved - I was thinking if we can come up with a
clever way for doing atomic writes to the entire user requested size w/o
zeroing out.
This following was main method which was being attempted:
https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@xxxxxxxxxx/
There were other ideas in different versions of the forcelign/xfs block
atomic writes series.
Zeroing out the other unwritten extent is also a cost penalty to the
user anyways.
Sure, unless we have a special inode flag to say "pre-zero the extent".
So user will anyway will have to be made aware of not to
attempt writes of fashion which can cause them such penalties.
As patch-6 mentions this is a base support for bs = ps systems for
enabling atomic writes using bigalloc. For now we return -EINVAL when we
can't allocate a continuous user requested mapping which means it won't
support operations of types 8k followed by 16k.
That's my least-preferred option.
I think better would be reject atomic writes that cover unwritten
extents always - but that boat is about to sail...
Thanks,
John