John Garry <john.g.garry@xxxxxxxxxx> writes: > On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote: >> John Garry <john.g.garry@xxxxxxxxxx> writes: >> >>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote: >>>>>> Same as mentioned above. We can't have atomic writes to get split. >>>>>> This patch is just lifting the restriction of iomap to allow more than >>>>>> blocksize but the mapped length should still meet iter->len, as >>>>>> otherwise the writes can get split. >>>>> Sure, I get this. But I wonder why would we be getting multiple >>>>> mappings? Why cannot the FS always provide a single mapping? >>>> FS can decide to split the mappings when it couldn't allocate a single >>>> large mapping of the requested length. Could be due to - >>>> - already allocated extent followed by EOF, >>>> - already allocated extent followed by a hole >>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written) >>> >>> This is the sort of scenario which I am concerned with. This issue has >>> been discussed at length for XFS forcealign support for atomic writes. >> >> extsize and forcealign is being worked for ext4 as well where we can >> add such support, sure. >> >>> >>> So far, the user can atomic write a single FS block regardless of >>> whether the extent in which it would be part of is in written or >>> unwritten state. >>> >>> Now the rule will be to write multiple FS blocks atomically, all blocks >>> need to be in same written or unwritten state. >> >> FS needs to ensure that the writes does not get torned. So for whatever reason >> FS splits the mapping then we need to return an -EINVAL error to not >> allow such writes to get torned. This patch just does that. >> >> But I get your point. More below. >> >>> >>> This oddity at least needs to be documented. >> >> Got it. Yes, we can do that. >> >>> >>> Better yet would be to not have this restriction. >>> >> >> I haven't thought of a clever way where we don't have to zero out the >> rest of the unwritten mapping. With ext4 bigalloc since the entire >> cluster is anyway reserved - I was thinking if we can come up with a >> clever way for doing atomic writes to the entire user requested size w/o >> zeroing out. > > This following was main method which was being attempted: > > https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@xxxxxxxxxx/ > > There were other ideas in different versions of the forcelign/xfs block > atomic writes series. > >> >> Zeroing out the other unwritten extent is also a cost penalty to the >> user anyways. > > Sure, unless we have a special inode flag to say "pre-zero the extent". > >> So user will anyway will have to be made aware of not to >> attempt writes of fashion which can cause them such penalties. >> >> As patch-6 mentions this is a base support for bs = ps systems for >> enabling atomic writes using bigalloc. For now we return -EINVAL when we >> can't allocate a continuous user requested mapping which means it won't >> support operations of types 8k followed by 16k. >> > > That's my least-preferred option. > > I think better would be reject atomic writes that cover unwritten > extents always - but that boat is about to sail... That's what this patch does. For whatever reason if we couldn't allocate a single contiguous region of requested size for atomic write, then we reject the request always, isn't it. Or maybe I didn't understand your comment. If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks()) for atomic writes in ext4_dio_write_checks(), similar to how we detect overwrites case to decide whether we need a read v/s write semaphore. So this can check if the user has a partially allocated extent for the user requested region and if yes, we can return -EINVAL from ext4_dio_write_iter() itself. I think this maybe better option than waiting until ->iomap_begin(). This might also bring all atomic write constraints to be checked in one place i.e. during ext4_file_write_iter() itself. Thoughts? -ritesh