Re: [PATCH v2 0/7] large atomic writes for xfs

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 13 Dec 2024 17:43:09 +0000

On 13/12/2024 17:22, Christoph Hellwig wrote:
On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
Sure, so some background is that we are using atomic writes for innodb
MySQL so that we can stop relying on the double-write buffer for crash
protection. MySQL is using an internal 16K page size (so we want 16K atomic
writes).

Make perfect sense so far.

MySQL has what is known as a REDO log - see
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html

Essentially it means that for any data page we write, ahead of time we do a
buffered 512B log update followed by a periodic fsync. I think that such a
thing is common to many apps.

So it's actually using buffered I/O for that and not direct I/O?

Right

> >> When we tried just using 16K FS blocksize, we found for low thread 
count
testing that performance was poor - even worse baseline of 4K FS blocksize
and double-write buffer. We put this down to high write latency for REDO
log. As you can imagine, mostly writing 16K for only a 512B update is not
efficient in terms of traffic generated and increased latency (versus 4K FS
block size). At higher thread count, performance was better. We put that
down to bigger log data portions to be written to REDO per FS block write.

So if the redo log uses buffered I/O I can see how that would bloat writes.
But then again using buffered I/O for a REDO log seems pretty silly
to start with.

Yeah, at the low end, it may make sense to do the 512B write via DIO. 
But OTOH sync'ing many redo log FS blocks at once at the high end can be 
more efficient.

From what I have heard, this was attempted before (using DIO) by some 
vendor, but did not come to much.

So it seems that we are stuck with this redo log limitation.

Let me know if you have any other ideas to avoid large atomic writes...

Cheers,
John