Re: [PATCH v2 0/7] large atomic writes for xfs

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 13 Dec 2024 16:42:56 -0800



On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
> On 13/12/2024 17:22, Christoph Hellwig wrote:
> > On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
> > > Sure, so some background is that we are using atomic writes for innodb
> > > MySQL so that we can stop relying on the double-write buffer for crash
> > > protection. MySQL is using an internal 16K page size (so we want 16K atomic
> > > writes).
> > 
> > Make perfect sense so far.
> > 
> > > 
> > > MySQL has what is known as a REDO log - see
> > > https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
> > > 
> > > Essentially it means that for any data page we write, ahead of time we do a
> > > buffered 512B log update followed by a periodic fsync. I think that such a
> > > thing is common to many apps.
> > 
> > So it's actually using buffered I/O for that and not direct I/O?
> 
> Right
> 
> > >> When we tried just using 16K FS blocksize, we found for low thread
> count
> > > testing that performance was poor - even worse baseline of 4K FS blocksize
> > > and double-write buffer. We put this down to high write latency for REDO
> > > log. As you can imagine, mostly writing 16K for only a 512B update is not
> > > efficient in terms of traffic generated and increased latency (versus 4K FS
> > > block size). At higher thread count, performance was better. We put that
> > > down to bigger log data portions to be written to REDO per FS block write.
> > 
> > So if the redo log uses buffered I/O I can see how that would bloat writes.
> > But then again using buffered I/O for a REDO log seems pretty silly
> > to start with.
> > 
> 
> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
> OTOH sync'ing many redo log FS blocks at once at the high end can be more
> efficient.
> 
> From what I have heard, this was attempted before (using DIO) by some
> vendor, but did not come to much.
> 
> So it seems that we are stuck with this redo log limitation.
> 
> Let me know if you have any other ideas to avoid large atomic writes...