Re: How do you optimize the disk IO when you cannot assume a file will start at a boundary then?

Thomas Munro <thomas.munro@xxxxxxxxx> · Fri, 23 Feb 2024 10:09:50 +1300

On Thu, Feb 22, 2024 at 2:23 PM Siddharth Jain <siddhsql@xxxxxxxxx> wrote:
> I understand the storage layer in databases goes to great lengths to ensure:
> - a row does not cross a block boundary
> - read/writes/allocation happen in units of blocks
> etc. The motivation is that at the OS level, it reads and writes pages (blocks), not individual bytes. I am only concerned about SSDs but I think the principle applies to HDD as well.
>
> but how can we do all this when we are not even guaranteed that the beginning of a file will be aligned with a block boundary? refer this.

Interesting question.  I was aware of FFS/UFS fragments 'cause I've
poked at that code a bit (one of my learn-to-hack-kernels side
projects was to add O_DSYNC support), but not some of those others
mentioned.  I don't think they are a problem for anything PostgreSQL
does.  Even with O_DIRECT, FFS (at least in its FreeBSD descendant)
just quietly skips its raw fast path if the I/O is not *sector*
aligned, but otherwise quietly falls back to the usual buffered path;
actual interesting sized tables wouldn't use that feature anyway, so
really it's just an internal edge case space optimisation that we
don't have to worry about.  Generally, that family of systems
interprets O_DIRECT to mean "'*try* to bypass buffers" anyway, so
there is no way for it to bite us.  On the other hand, it does perform
pretty badly if you use logical blocks that are too large: that's why
in src/tools/ci/gcp_freebsd_repartition.sh I set up a new partition
with 8KB blocks for CI testing (I originally did that in a branch
doing a lot more with O_DIRECT, where the slow down due to block
mismatch is worse).  I think we just have to call that an
administrator's problem to worry about.

I have idly wondered before about some way to set a preferred 'grain'
size on a tablespace so that, for example, a parallel seq scan over a
very large read-only compressed ZFS/BTRFS/whatever file with 128KB
blocks could try to align the allocation of block ranges with those
boundaries in mind, just as a sort of performance hint.  Or perhaps
read it directly from st_blksize[1].  Combined with the I/O merging
thing I'm working on right now which can build up single 128KB
preadv() calls, that'd allow parallel workers to line those up neatly,
and avoid accessing the same underlying block from two processes,
which might have advantages in terms of ZFS's decompression threads
and other kinds of contentions.  I haven't looked into the potential
benefits of that though, that is just speculation...

For direct I/O, which we're still working on bringing to PostgreSQL in
useful form, we had to do this:

commit faeedbcefd40bfdf314e048c425b6d9208896d90
Author: Thomas Munro <tmunro@xxxxxxxxxxxxxx>
Date:   Sat Apr 8 10:38:09 2023 +1200

    Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

... to avoid EINVAL errors, falling back to buffered mode or
pathologically bad performance (depending on the system).  It probably
also helps buffered I/O performance for non-huge-pages a tiny bit (by
touching the minimum number of memory pages).  There is no industry
standard to tell us what exactly is required for O_DIRECT, but we make
sure addresses, offsets and lengths are all aligned to our macro
PG_IO_ALIGN_SIZE, which is 4KB, because this is enough to satisfy all
the systems we are aware of today (on some systems the true
offset/length requirement is based on real or historical sectors,
either 4KB or 512 bytes, which may be smaller than the filesystem
block size, while on others it may be the filesystem block size which
is usually some power-of-two multiple of those numbers).  Since you
can theoretically set PostgreSQL's data or WAL block size to less than
that at compile time, we just don't let you turn direct I/O on in that
case (perhaps we should, in case it does work, and just let it error
out if it's going to?  It's early days...).  If the file system's
block size is larger than that, you may have some issues, though,
depending on the file system.

Nobody standardised this stuff, and we certainly can't make any
guarantees that anything we come up with will work on future systems,
eg if someone invents 8KB sectors maybe it will need some work.
That's OK, because direct I/O will likely always be a non-default
optional mode, and many systems don't support it at all today.

> Further, I don't see any APIs exposing I/O operations in terms of blocks. All File I/O APIs I see expose a file as a randomly accessible contiguous byte buffer. Would it not have been easier if there were APIs that exposed I/O operations in terms of blocks?

There are/were operating systems with block or record oriented I/O (eg
IBM ones).  Unix was about minimalism and simplicity and explicitly
rejected that idea (first bullet point in [2]).  Buffered,
byte-oriented, synchronous I/O suited most applications.  But yeah,
not so much databases.  Dr Stonebraker, who started this project,
wrote a bunch of papers and articles about the mismatch between Unix
file systems and database needs (back in those days, there wasn't even
an fsync(), and the system buffer cache was basically the enemy).

[1] https://pubs.opengroup.org/onlinepubs/009696699/basedefs/sys/stat.h.html
[2] https://www.bell-labs.com/usr/dmr/www/retro.pdf