Re: [PATCH v5 3/7] fs: iomap: Atomic write support

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 30 Aug 2024 16:56:21 -0700

On Fri, Aug 30, 2024 at 04:48:36PM +0100, John Garry wrote:
> On 22/08/2024 21:30, Darrick J. Wong wrote:
> > > Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
> > > IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
> > > we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
> > > -> xfs_iomap_write_unwritten() for the complete FSB range.
> > > 
> > > Do you see a problem with this?
> 
> Sorry again for the slow response.
> 
> > > 
> > > Please see this also for some more background:
> > > https://urldefense.com/v3/__https://lore.kernel.org/linux-
> > > xfs/20240726171358.GA27612@xxxxxx/__;!!ACWV5N9M2RV99hQ! P5jeP96F8wAtRAblbm8NvRo8nlpil03vA26UMMX8qrYa4IzKecAAk7x1l1M45bBshC3Czxn1CkDXypNSAg$
> > Yes -- if you have a mix of written and unwritten blocks for the same
> > chunk of physical space:
> > 
> > 0      7
> > WUWUWUWU
> > 
> > the directio ioend function will start four separate transactions to
> > convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
> > midway through, they will see this afterwards:
> > 
> > WWWWW0W0
> > 
> > IOWs, although the*disk write* was completed successfully, the mapping
> > updates were torn, and the user program sees a torn write.
> > > The most performant/painful way to fix this would be to make the whole
> > ioend completion a logged operation so that we could commit to updating
> > all the unwritten mappings and restart it after a crash.
> 
> could we make it logged for those special cases which we are interested in
> only?

Yes, though this is the long route -- you get to define a new ondisk log
item, build all the incore structures to process them, and then define a
new high level operation that uses the state encoded in that new log
item to run all the ioend completion transactions within that framework.
Also you get to add a new log incompat feature bit for this.

Perhaps we should analyze the cost of writing and QA'ing all that vs.
the amount of time saved in the handling of this corner case using one
of the less exciting options.

> > The least performant of course is to write zeroes at allocation time,
> > like we do for fsdax.
> 
> That idea was already proposed:
> https://lore.kernel.org/linux-xfs/ZcGIPlNCkL6EDx3Z@xxxxxxxxxxxxxxxxxxx/

Yes, I'm aware.

> > A possible middle ground would be to detect IOMAP_ATOMIC in the
> > ->iomap_begin method, notice that there are mixed mappings under the
> > proposed untorn IO, and pre-convert the unwritten blocks by writing
> > zeroes to disk and updating the mappings
> 
> Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. zeroing
> during allocation?

Only if you set the forcealign size to > 1fsb and fail to write new
file data in forcealign units, even for non-untorn writes.  If all
writes to the file are aligned to the forcealign size then there's only
one extent conversion to be done, and that cannot be torn.

> > before handing the one single
> > mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
> > you'd only be suffering that penalty for the (probable) corner case of
> > someone creating mixed mappings.
> 
> BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from v4
> series is how the unwritten conversion has changed, like:
> 
> xfs_iomap_write_unwritten()
> {
> 	unsigned int rounding;
> 
> 	/* when converting anything unwritten, we must be spanning an alloc unit,
> so round up/down */
> 	if (rounding > 1) {
> 		offset_fsb = rounddown(rounding);
> 		count_fsb = roundup(rounding);
> 	}
> 
> 	...
> 	do {
> 		xfs_bmapi_write();
> 		...
> 		xfs_trans_commit();
> 	} while ();
> }
> 
> I'm not too happy with it and it seems a bit of a bodge, as I would rather
> we report the complete size written (user data and zeroes); then
> xfs_iomap_write_unwritten() would do proper individual block conversion.
> However, we do something similar for zeroing for sub-FSB writes. I am not
> sure if that is the same thing really, as we only round up to FSB size.
> Opinion?

xfs_iomap_write_unwritten is in the ioend path; that's not what I was
talking about.

I'm talking about a separate change to the xfs_direct_write_iomap_begin
function that would detect the case where the bmapi_read returns an
@imap that doesn't span the whole forcealign region, then repeatedly
calls bmapi_write(BMAPI_ZERO | BMAPI_CONVERT) on any unwritten mappings
within that file range until the original bmapi_read would return a
single written mapping.

--D

> 
> Thanks,
> John
> 
> 
>