Yeah, at the low end, it may make sense to do the 512B write via DIO. But
OTOH sync'ing many redo log FS blocks at once at the high end can be more
efficient.
From what I have heard, this was attempted before (using DIO) by some
vendor, but did not come to much.
So it seems that we are stuck with this redo log limitation.
Let me know if you have any other ideas to avoid large atomic writes...
From the description it sounds like the redo log consists of 512b blocks
that describe small changes to the 16k table file pages. If they're
issuing 16k atomic writes to get each of those 512b redo log records to
disk it's no wonder that cranks up the overhead substantially.
They are not issuing the redo log atomically. They do 512B buffered
writes and then periodically fsync.
Also,
replaying those tiny updates through the pagecache beats issuing a bunch
of tiny nonlocalized writes.
For the first case I don't know why they need atomic writes -- 512b redo
log records can't be torn because they're single-sector writes. The
second case might be better done with exchange-range.
As for exchange-range, that would very much pre-date any MySQL port.
Furthermore, I can't imagine that exchange-range support is portable to
other FSes, which is probably quite important. Anyway, they are not
issuing the redo log atomically, so I don't know if mentioning
exchange-range is relevant.
Regardless of what MySQL is specifically doing here, there are going to
be other users/applications which want to keep a 4K FS blocksize and do
larger atomic writes.
Thanks,
John