Re: [PATCH v2 0/7] large atomic writes for xfs

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 13 Dec 2024 17:15:55 +0000

On 13/12/2024 14:38, Christoph Hellwig wrote:
On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
Currently the atomic write unit min and max is fixed at the FS blocksize
for xfs and ext4.

This series expands support to allow multiple FS blocks to be written
atomically.

Can you explain the workload you're interested in a bit more?

Sure, so some background is that we are using atomic writes for innodb 
MySQL so that we can stop relying on the double-write buffer for crash 
protection. MySQL is using an internal 16K page size (so we want 16K 
atomic writes).

MySQL has what is known as a REDO log - see 
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html

Essentially it means that for any data page we write, ahead of time we 
do a buffered 512B log update followed by a periodic fsync. I think that 
such a thing is common to many apps.

I'm still very scared of expanding use of the large allocation sizes.

Yes

IIRC you showed some numbers where increasing the FSB size to something
larger did not look good in your benchmarks, but I'd like to understand
why.  Do you have a link to these numbers just to refresh everyones minds
why that wasn't a good idea. 

I don't think that I can share numbers, but I will summarize the findings.

When we tried just using 16K FS blocksize, we found for low thread count 
testing that performance was poor - even worse baseline of 4K FS 
blocksize and double-write buffer. We put this down to high write 
latency for REDO log. As you can imagine, mostly writing 16K for only a 
512B update is not efficient in terms of traffic generated and increased 
latency (versus 4K FS block size). At higher thread count, performance 
was better. We put that down to bigger log data portions to be written 
to REDO per FS block write.

For 4K FS blocksize and 16K atomic writes configs - supported via 
forcealign or RTvol - performance will generally good across the board. 
forcealign was a bit better.

We also tried a hybrid solution with 2x partitions - 1x partition with 
16K FS block size for data and 1x partition with 4K FS block size for 
REDO. Performance here was good also. Unfortunately, though, this config 
is not fit for production - that is because we have a requirement to do 
FS snapshot and that is not possible over 2x FS instances. We also did 
consider block device snapshot, but there is reluctance to try this also.

Did that also include supporting atomic
writes in the sector size <= write size <= FS block size range, which
aren't currently supported, but very useful?

I have no use for that so far.

Thanks,
John