On 13/12/2024 14:38, Christoph Hellwig wrote:
On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
Currently the atomic write unit min and max is fixed at the FS blocksize
for xfs and ext4.
This series expands support to allow multiple FS blocks to be written
atomically.
Can you explain the workload you're interested in a bit more?
Sure, so some background is that we are using atomic writes for innodb
MySQL so that we can stop relying on the double-write buffer for crash
protection. MySQL is using an internal 16K page size (so we want 16K
atomic writes).
MySQL has what is known as a REDO log - see
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
Essentially it means that for any data page we write, ahead of time we
do a buffered 512B log update followed by a periodic fsync. I think that
such a thing is common to many apps.
I'm still very scared of expanding use of the large allocation sizes.
Yes
IIRC you showed some numbers where increasing the FSB size to something
larger did not look good in your benchmarks, but I'd like to understand
why. Do you have a link to these numbers just to refresh everyones minds
why that wasn't a good idea.
I don't think that I can share numbers, but I will summarize the findings.
When we tried just using 16K FS blocksize, we found for low thread count
testing that performance was poor - even worse baseline of 4K FS
blocksize and double-write buffer. We put this down to high write
latency for REDO log. As you can imagine, mostly writing 16K for only a
512B update is not efficient in terms of traffic generated and increased
latency (versus 4K FS block size). At higher thread count, performance
was better. We put that down to bigger log data portions to be written
to REDO per FS block write.
For 4K FS blocksize and 16K atomic writes configs - supported via
forcealign or RTvol - performance will generally good across the board.
forcealign was a bit better.
We also tried a hybrid solution with 2x partitions - 1x partition with
16K FS block size for data and 1x partition with 4K FS block size for
REDO. Performance here was good also. Unfortunately, though, this config
is not fit for production - that is because we have a requirement to do
FS snapshot and that is not possible over 2x FS instances. We also did
consider block device snapshot, but there is reluctance to try this also.
Did that also include supporting atomic
writes in the sector size <= write size <= FS block size range, which
aren't currently supported, but very useful?
I have no use for that so far.
Thanks,
John