On Thu, Feb 22, 2024 at 2:23 PM Siddharth Jain <siddhsql@xxxxxxxxx> wrote: > I understand the storage layer in databases goes to great lengths to ensure: > - a row does not cross a block boundary > - read/writes/allocation happen in units of blocks > etc. The motivation is that at the OS level, it reads and writes pages (blocks), not individual bytes. I am only concerned about SSDs but I think the principle applies to HDD as well. > > but how can we do all this when we are not even guaranteed that the beginning of a file will be aligned with a block boundary? refer this. Interesting question. I was aware of FFS/UFS fragments 'cause I've poked at that code a bit (one of my learn-to-hack-kernels side projects was to add O_DSYNC support), but not some of those others mentioned. I don't think they are a problem for anything PostgreSQL does. Even with O_DIRECT, FFS (at least in its FreeBSD descendant) just quietly skips its raw fast path if the I/O is not *sector* aligned, but otherwise quietly falls back to the usual buffered path; actual interesting sized tables wouldn't use that feature anyway, so really it's just an internal edge case space optimisation that we don't have to worry about. Generally, that family of systems interprets O_DIRECT to mean "'*try* to bypass buffers" anyway, so there is no way for it to bite us. On the other hand, it does perform pretty badly if you use logical blocks that are too large: that's why in src/tools/ci/gcp_freebsd_repartition.sh I set up a new partition with 8KB blocks for CI testing (I originally did that in a branch doing a lot more with O_DIRECT, where the slow down due to block mismatch is worse). I think we just have to call that an administrator's problem to worry about. I have idly wondered before about some way to set a preferred 'grain' size on a tablespace so that, for example, a parallel seq scan over a very large read-only compressed ZFS/BTRFS/whatever file with 128KB blocks could try to align the allocation of block ranges with those boundaries in mind, just as a sort of performance hint. Or perhaps read it directly from st_blksize[1]. Combined with the I/O merging thing I'm working on right now which can build up single 128KB preadv() calls, that'd allow parallel workers to line those up neatly, and avoid accessing the same underlying block from two processes, which might have advantages in terms of ZFS's decompression threads and other kinds of contentions. I haven't looked into the potential benefits of that though, that is just speculation... For direct I/O, which we're still working on bringing to PostgreSQL in useful form, we had to do this: commit faeedbcefd40bfdf314e048c425b6d9208896d90 Author: Thomas Munro <tmunro@xxxxxxxxxxxxxx> Date: Sat Apr 8 10:38:09 2023 +1200 Introduce PG_IO_ALIGN_SIZE and align all I/O buffers. ... to avoid EINVAL errors, falling back to buffered mode or pathologically bad performance (depending on the system). It probably also helps buffered I/O performance for non-huge-pages a tiny bit (by touching the minimum number of memory pages). There is no industry standard to tell us what exactly is required for O_DIRECT, but we make sure addresses, offsets and lengths are all aligned to our macro PG_IO_ALIGN_SIZE, which is 4KB, because this is enough to satisfy all the systems we are aware of today (on some systems the true offset/length requirement is based on real or historical sectors, either 4KB or 512 bytes, which may be smaller than the filesystem block size, while on others it may be the filesystem block size which is usually some power-of-two multiple of those numbers). Since you can theoretically set PostgreSQL's data or WAL block size to less than that at compile time, we just don't let you turn direct I/O on in that case (perhaps we should, in case it does work, and just let it error out if it's going to? It's early days...). If the file system's block size is larger than that, you may have some issues, though, depending on the file system. Nobody standardised this stuff, and we certainly can't make any guarantees that anything we come up with will work on future systems, eg if someone invents 8KB sectors maybe it will need some work. That's OK, because direct I/O will likely always be a non-default optional mode, and many systems don't support it at all today. > Further, I don't see any APIs exposing I/O operations in terms of blocks. All File I/O APIs I see expose a file as a randomly accessible contiguous byte buffer. Would it not have been easier if there were APIs that exposed I/O operations in terms of blocks? There are/were operating systems with block or record oriented I/O (eg IBM ones). Unix was about minimalism and simplicity and explicitly rejected that idea (first bullet point in [2]). Buffered, byte-oriented, synchronous I/O suited most applications. But yeah, not so much databases. Dr Stonebraker, who started this project, wrote a bunch of papers and articles about the mismatch between Unix file systems and database needs (back in those days, there wasn't even an fsync(), and the system buffer cache was basically the enemy). [1] https://pubs.opengroup.org/onlinepubs/009696699/basedefs/sys/stat.h.html [2] https://www.bell-labs.com/usr/dmr/www/retro.pdf