Re: [PATCH v5 3/7] fs: iomap: Atomic write support

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 30 Aug 2024 16:48:36 +0100

On 22/08/2024 21:30, Darrick J. Wong wrote:
Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
-> xfs_iomap_write_unwritten() for the complete FSB range.

Do you see a problem with this?

Sorry again for the slow response.

Please see this also for some more background:
https://urldefense.com/v3/__https://lore.kernel.org/linux- 
xfs/20240726171358.GA27612@xxxxxx/__;!!ACWV5N9M2RV99hQ! 
P5jeP96F8wAtRAblbm8NvRo8nlpil03vA26UMMX8qrYa4IzKecAAk7x1l1M45bBshC3Czxn1CkDXypNSAg$ 
Yes -- if you have a mix of written and unwritten blocks for the same
chunk of physical space:

0      7
WUWUWUWU

the directio ioend function will start four separate transactions to
convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
midway through, they will see this afterwards:

WWWWW0W0

IOWs, although the*disk write* was completed successfully, the mapping
updates were torn, and the user program sees a torn write.
> > The most performant/painful way to fix this would be to make the whole
ioend completion a logged operation so that we could commit to updating
all the unwritten mappings and restart it after a crash.

could we make it logged for those special cases which we are interested 
in only?

The least performant of course is to write zeroes at allocation time,
like we do for fsdax.

That idea was already proposed:
https://lore.kernel.org/linux-xfs/ZcGIPlNCkL6EDx3Z@xxxxxxxxxxxxxxxxxxx/

A possible middle ground would be to detect IOMAP_ATOMIC in the
->iomap_begin method, notice that there are mixed mappings under the
proposed untorn IO, and pre-convert the unwritten blocks by writing
zeroes to disk and updating the mappings 

Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. 
zeroing during allocation?

before handing the one single
mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
you'd only be suffering that penalty for the (probable) corner case of
someone creating mixed mappings.

BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from 
v4 series is how the unwritten conversion has changed, like:

xfs_iomap_write_unwritten()
{
	unsigned int rounding;

	/* when converting anything unwritten, we must be spanning an alloc 
unit, so round up/down */
	if (rounding > 1) {
		offset_fsb = rounddown(rounding);
		count_fsb = roundup(rounding);
	}

	...
	do {
		xfs_bmapi_write();
		...
		xfs_trans_commit();
	} while ();
}

I'm not too happy with it and it seems a bit of a bodge, as I would 
rather we report the complete size written (user data and zeroes); then 
xfs_iomap_write_unwritten() would do proper individual block conversion. 
However, we do something similar for zeroing for sub-FSB writes. I am 
not sure if that is the same thing really, as we only round up to FSB 
size. Opinion?

Thanks,
John