Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 9 Feb 2024 12:47:38 +0000

Playing devil's advocate here, at least this behavior should be documented.

That's what man pages are for, yes?

Are you expecting your deployments to be run on highly suboptimal
configurations and so the code needs to be optimised for this
behaviour, or are you expecting them to be run on correctly
configured systems which would never see these issues?

The latter hopefully

The whole reason for rtextsize existing is to optimise the rtextent
allocation to the typical minimum IO size done to that volume. If
all your IO is sub-rtextsize size and alignment, then all that has
been done is forcing the entire rt device IO into a corner it was
never really intended nor optimised for.

Sure, but just because we are optimized for a certain IO write size should
not mean that other writes are disallowed or quite problematic.

Atomic writes are just "other writes". They are writes that are
*expected to fail* if they cannot be done atomically.

Agreed

Application writers will quickly learn how to do sane, fast,
reliable atomic write IO if we reject anything that is going to
requires some complex, sub-optimal workaround in the kernel to make
it work. The simplest solution is to -fail the write-, because
userspace *must* be prepared for *any* atomic write to fail.

Sure, but it needs to be such that the application writer at least knows 
why it failed, which so far had not been documented.

Why should we jump through crazy hoops to try to make filesystems
optimised for large IOs with mismatched, overlapping small atomic
writes?

As mentioned, typically the atomic writes will be the same size, but we may
have other writes of smaller size.

Then we need the tiny write to allocate and zero according to the
maximum sized atomic write bounds. Then we just don't care about
large atomic IO overlapping small IO, because the extent on disk
aligned to the large atomic IO is then always guaranteed to be the
correct size and shape.

I think it's worth mentioning that there is currently a separation 
between how we configure the FS extent size for atomic writes and what 
the bdev can actually support in terms of atomic writes.

Setting the rtvol extsize at mkfs time or enabling atomic writes 
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do 
in terms of atomic writes.

This check is not done as it is not fixed what the bdev can do in terms 
of atomic writes - or, more specifically, what they request_queue 
reports is not be fixed. There are things which can change this. For 
example, a FW update could change all the atomic write capabilities of a 
disk. Or even if we swapped a SCSI disk into another host the atomic 
write limits may change, as the atomic write unit max depends on the 
SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a 
kernel update - could also change what we report as atomic write limit 
in the request queue.

With the change in this patch, instead we have something like this after the
first write:

# /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
wrote 4096 bytes at pos 0 write_size=4096
# filefrag -v mnt/file
Filesystem type is: 58465342
File size of mnt/file is 4096 (1 block of 4096 bytes)
    ext:     logical_offset:        physical_offset: length:   expected:
flags:
      0:        0..       3:         24..        27:      4:
last,eof
mnt/file: 1 extent found
#

So the 16KB extent is in written state and the 2nd 16KB write would iter
once, producing a single BIO.
Sure, I know how it works. My point is that it's a terrible way to
go about allowing that second atomic write to succeed.
I think 'terrible' is a bit too strong a word here.

Doing it anything in a way that a user can DOS the entire filesystem
is *terrible*. No ifs, buts or otherwise.

Understood

Indeed, you suggest to
manually zero the file to solve this problem, below, while this code change
does the same thing automatically.

Yes, but I also outlined a way that it can be done automatically
without being terrible. There are multiple options here, I outlined
two different approaches that are acceptible.

I think that I need to check these alternate solutions in more detail. 
More below.

In this
scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
ensure that the extent is completely written whenever we allocate it. At
least that is my idea.
So return an unaligned extent, and then the IOMAP_ATOMIC checks you
add below say "no" and then the application has to do things the
slow, safe way....
We have been porting atomic write support to some database apps and they
(database developers) have had to do something like manually zero the
complete file to get around this issue, but that's not a good user
experience.
Better the application zeros the file when it is being initialised
and doesn't have performance constraints rather than forcing the
filesystem to do it in the IO fast path when IO performance and
latency actually matters to the application.

Can't we do both? I mean, the well-informed user can still pre-zero the file
just to ensure we aren't doing this zero'ing with the extent allocation.

I never said we can't do zeroing. I just said that it's normally
better when the application controls zeroing directly.

ok

And therein lies the problem.

If you are doing sub-rtextent IO at all, then you are forcing the
filesystem down the path of explicitly using unwritten extents and
requiring O_DSYNC direct IO to do journal flushes in IO completion
context and then performance just goes down hill from them.

The requirement for unwritten extents to track sub-rtextsize written
regions is what you're trying to work around with XFS_BMAPI_ZERO so
that atomic writes will always see "atomic write aligned" allocated
regions.

Do you see the problem here? You've explicitly told the filesystem
that allocation is aligned to 64kB chunks, then because the
filesystem block size is 4kB, it's allowed to track unwritten
regions at 4kB boundaries. Then you do 4kB aligned file IO, which
then changes unwritten extents at 4kB boundaries. Then you do a
overlapping 16kB IO that*requires*  16kB allocation alignment, and
things go BOOM.

Yes, they should go BOOM.

This is a horrible configuration - it is incomaptible with 16kB
aligned and sized atomic IO.

Just because the DB may do 16KB atomic writes most of the time should not
disallow it from any other form of writes.

That's not what I said. I said the using sub-rtextsize atomic writes
with single FSB unwritten extent tracking is horrible and
incompatible with doing 16kB atomic writes.

This setup will not work at all well with your patches and should go
BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
has uncoordinated extent allocation and unwritten conversion
granularity.

That's the fundamental design problem with your approach - it allows
unwritten conversion at *minimum IO sizes* and that does not work
with atomic IOs with larger alignment requirements.

The fundamental design principle is this: for maximally sized atomic
writes to always succeed we require every allocation, zeroing and
unwritten conversion operation to use alignments and sizes that are
compatible with the maximum atomic write sizes being used.

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a 
file. Rather we mkfs with some extsize value which should match our 
atomic write maximal value and then tell the FS we want to do atomic 
writes on a file, and if this is accepted then we can query the atomic 
write min and max unit size, and this would be [FS block size, min(bdev 
atomic write limit, rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic 
writes support. So then we could use rtextsize as this max atomic write 
size. But I am not 100% sure that it your idea (apologies if I am wrong 
- I am sincerely trying to follow your idea), but rather it would be 
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and 
bdev atomic write limit is 16KB, then there is no much point in dealing 
in 1MB blocks for this unwritten extent conversion alignment. If so, 
then my concern is that the bdev atomic write upper limit is not fixed. 
This can solved, but I would still like to be clear on this max atomic 
write size.

> i.e. atomic writes need to use max write size granularity for all IO
> operations, not filesystem block granularity.

And that also means things like rtextsize and extsize hints need to
match these atomic write requirements, too....

As above, I am not 100% sure if you mean these to be the atomic write 
maximal value.

Allocation is aligned to 64kB, written
region tracking is aligned to 4kB, and there's nothing to tell the
filesystem that it should be maintaining 16kB "written alignment" so
that 16kB atomic writes can always be issued atomically.

Please note that in my previous example the mkfs rtextsize arg should 
really have been 16KB, and that the intention would have been to enable 
16KB atomic writes. I used 64KB casually as I thought it should be 
possible to support sub-rtextsize atomic writes. The point which I was 
trying to make was that the 16KB atomic write and 4KB regular write 
intermixing was problematic.

i.e. if we are going to do 16kB aligned atomic IO, then all the
allocation and unwritten tracking needs to be done in 16kB aligned
chunks, not 4kB. That means a 4KB write into an unwritten region or
a hole actually needs to zero the rest of the 16KB range it sits
within.

The direct IO code can do this, but it needs extension of the
unaligned IO serialisation in XFS (the alignment checks in
xfs_file_dio_write()) and the the sub-block zeroing in
iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
allocation size, not the fsblock size) to do this safely.

Regardless of how we do it, all IO concurrency on this file is shot
if we have sub-rtextent sized IOs being done. That is true even with
this patch set - XFS_BMAPI_ZERO is done whilst holding the
XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
zeroing is being done.

IOWs, anything to do with sub-rtextent IO really has to be treated
like sub-fsblock DIO - i.e. exclusive inode access until the
sub-rtextent zeroing has been completed.

I do understand that this is not perfect that we may have mixed block sizes
being written, but I don't think that we should disallow it and throw an
error.

Ummmm, did you read what you quoted?

The above is an outline of the IO path modifications that will allow
mixed IO sizes to be used with atomic writes without requiring the
XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
zeroing out to the existing DIO sub-block zeroing, hence ensuring
that we only ever convert unwritten extents on max sized atomic
write boundaries for atomic write enabled inodes.

ok, I get this idea. And, indeed, it does sound better than the 
XFS_BMAPI_ZERO proposal.

At no point have I said "no mixed writes".

For sure

I've said no to the
XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
that it works around and given you a decent amount of detail on how
to sanely implementing mixed write support that will work (slowly)
with those configurations and IO patterns.

So it's your choice - you can continue to beleive I don't mixed
writes to work at all, or you can go back and try to understand the
IO path changes I've suggested that will allow mixed atomic writes
to work as well as they possibly can....

Ack

Much appreciated,
John