On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote: > On 13/12/2024 17:22, Christoph Hellwig wrote: > > On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote: > > > Sure, so some background is that we are using atomic writes for innodb > > > MySQL so that we can stop relying on the double-write buffer for crash > > > protection. MySQL is using an internal 16K page size (so we want 16K atomic > > > writes). > > > > Make perfect sense so far. > > > > > > > > MySQL has what is known as a REDO log - see > > > https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html > > > > > > Essentially it means that for any data page we write, ahead of time we do a > > > buffered 512B log update followed by a periodic fsync. I think that such a > > > thing is common to many apps. > > > > So it's actually using buffered I/O for that and not direct I/O? > > Right > > > >> When we tried just using 16K FS blocksize, we found for low thread > count > > > testing that performance was poor - even worse baseline of 4K FS blocksize > > > and double-write buffer. We put this down to high write latency for REDO > > > log. As you can imagine, mostly writing 16K for only a 512B update is not > > > efficient in terms of traffic generated and increased latency (versus 4K FS > > > block size). At higher thread count, performance was better. We put that > > > down to bigger log data portions to be written to REDO per FS block write. > > > > So if the redo log uses buffered I/O I can see how that would bloat writes. > > But then again using buffered I/O for a REDO log seems pretty silly > > to start with. > > > > Yeah, at the low end, it may make sense to do the 512B write via DIO. But > OTOH sync'ing many redo log FS blocks at once at the high end can be more > efficient. > > From what I have heard, this was attempted before (using DIO) by some > vendor, but did not come to much. > > So it seems that we are stuck with this redo log limitation. > > Let me know if you have any other ideas to avoid large atomic writes...