Re: [PATCH v6 00/10] block atomic writes

John Garry <john.g.garry@xxxxxxxxxx> · Wed, 10 Apr 2024 09:34:36 +0100

On 08/04/2024 18:50, Luis Chamberlain wrote:
I agree that when you don't set the sector size to 16k you are not forcing the
filesystem to use 16k IOs, the metadata can still be 4k. But when you
use a 16k sector size, the 16k IOs should be respected by the
filesystem.

Do we break BIOs to below a min order if the sector size is also set to
16k?  I haven't seen that and its unclear when or how that could happen.

AFAICS, the only guarantee is to not split below LBS.

At least for NVMe we don't need to yell to a device to inform it we want
a 16k IO issued to it to be atomic, if we read that it has the
capability for it, it just does it. The IO verificaiton can be done with
blkalgn [0].

Does SCSI*require*  an 16k atomic prep work, or can it be done implicitly?
Does it need WRITE_ATOMIC_16?

physical block size is what we can implicitly write atomically. So if 
you have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required 
to write 16KB atomically.

[0]https://urldefense.com/v3/__https://github.com/dagmcr/bcc/tree/blkalgn__;!!ACWV5N9M2RV99hQ!I0tfdPsEq9vdHMSC7JVmVDHCb5w6invjudW7pZW50v3mZ7dWMMf0cBtY_BQlZZmYSjLzPQDZoLO7-K6MQQ$  

So just increasing the inode block size / FS block size does not
really change anything, in itself.
If we're breaking up IOs when a min order is set for an inode, that
would need to be looked into, but we're not seeing that.

In practice you won't see it, but I am talking about guarantees not to 
see it.

Do untorn writes actually exist in SCSI?  I was under the impression
nobody had actually implemented them in SCSI hardware.
I know that some SCSI targets actually atomically write data in chunks >
LBS. Obviously atomic vs non-atomic performance is a moot point there, as
data is implicitly always atomically written.

We actually have an mysql/innodb port of this API working on such a SCSI
target.
I suspect IO verification with the above tool should prove to show the
same if you use a filesystem with a larger sector size set too, and you
just would not have to do any changes to userspace other than the
filesystem creation with say mkfs.xfs params of -b size=16k -s size=16k

Ok, I see

However I am not sure about atomic write support for other SCSI targets.
Good to know!

We saw untorn writes as not being a property of the file or even the inode
itself, but rather an attribute of the specific IO being issued from the
userspace application.
The problem is that keeping track of that is expensive for buffered
writes.  It's a model that only works for direct IO.  Arguably we
could make it work for O_SYNC buffered IO, but that'll require some
surgery.
To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
Would using the same min and max order for the inode work instead?

Maybe, I would need to check further.

Thanks,
John