On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote: > Hi, > > On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote: > > On 6/27/23 10:50 AM, Masahiko Sawada wrote: > > > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@xxxxxxxxxxx> wrote: > > > > > > > > On 6/25/23 10:17 PM, Masahiko Sawada wrote: > > > > > FYI, to share the background of what PostgreSQL does, when > > > > > bulk-insertions into one table are running concurrently, one process > > > > > extends the underlying files depending on how many concurrent > > > > > processes are waiting to extend. The more processes wait, the more 8kB > > > > > blocks are appended. As the current implementation, if the process > > > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > > > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > > > > > details). We don't use fallocate() for small extensions as it's slow > > > > > on some filesystems. Therefore, if a bulk-insertion process tries to > > > > > extend the table by say 5~10 blocks many times, it could use > > > > > poxis_fallocate() and pwrite() alternatively, which led to the slow > > > > > performance as I reported. > > > > > > > > To what end? What problem is PostgreSQL trying to solve with this > > > > scheme? I might be missing something but it seems like you've described > > > > the "what" in detail, but no "why." > > > > > > It's for better scalability. SInce the process who wants to extend the > > > table needs to hold an exclusive lock on the table, we need to > > > minimize the work while holding the lock. > > > > Ok, but what is the reason for zeroing out the blocks prior to them being > > written with real data? I'm wondering what the core requirement here is for > > the zeroing, either via fallocate (which btw posix_fallocate does not > > guarantee) or pwrites of zeros. > > The goal is to avoid ENOSPC at a later time. We do this before filling our own > in-memory buffer pool with pages containing new contents. If we have dirty > pages in our buffer that we can't write out due to ENOSPC, we're in trouble, > because we can't checkpoint. Which typically will make the ENOSPC situation > worse, because we also can't remove WAL / journal files without the checkpoint > having succeeded. Of course a successful fallocate() / pwrite() doesn't > guarantee that much on a COW filesystem, but there's not much we can do about > that, to my knowledge. Yup, which means you're screwed on XFS, ZFS and btrfs right now, and also bcachefs when people start using it. > Using fallocate() for small extensions is problematic because it a) causes > fragmentation b) disables delayed allocation, using pwrite() is also bad > because the kernel will have to write out those dirty pages full of zeroes - > very often we won't write out the page with "real content" before the kernel > decides to do so. Yes, that why we allow fallocate() to preallocate space that extends beyond the current EOF. i.e. for optimising layouts on append-based workloads. posix_fallocate() does not allow that - it forces file size extension, whilst a raw fallocate(FALLOC_FL_KEEP_SIZE) call will allow preallocation anywhere beyond EOF without changing the file size. IOws, with FALLOC_FL_KEEP_SIZE you don't have to initialise buffer space in memory to cover the preallocated space until you actually need to extend the file and write to it. i.e. use fallocate(FALLOC_FL_KEEP_SIZE) to preallocate chunks megabytes beyond the current EOF and then grow into them with normal extending pwrite() calls. When that preallocate space is done, preallocate another large chunk beyond EOF and continue onwards extending the file with your small write()s... > Hence using a heuristic to choose between the two. I think all that's needed > here is a bit of tuning of the heuristic, possibly adding some "history" > awareness. No heuristics needed: just use FALLOC_FL_KEEP_SIZE and preallocate large chunks beyond EOF each time. It works for both cases equally well, which results in less code and is easier to understand. AFAIC, nobody should ever use posix_fallocate() - it's impossible to know what it is doing under the covers, or even know when it fails to provide you with any guarantee at all (e.g. COW files). > If we could opt into delayed allocation while avoiding ENOSPC for a certain > length, it'd be perfect, but I don't think that's possible today? Nope. Not desirable, either, because we currently need to have dirty data in the page cache over delalloc regions. > We're also working on using DIO FWIW, where using fallocate() is just about > mandatory... No, no it isn't. fallocate() is even more important to avoid with DIO than buffered IO because fallocate() completely serialises *all* IO to the file. That's the last thing you want with DIO given the only reason for using DIO is to maximising IO concurrency and minimise IO latency to individual files. If you want to minimise fragmentation with DIO workloads, then you should be using extent size hints of an appropriate size. That will align and size extents to the hint regardless of fallocate/write ranges, hence this controls worst case fragmentation effectively. If you want enospc guarantees for future writes, then large, infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do not use this mechanism as an anti-fragmentation mechanism, that's what extent size hints are for. Use fallocate() as *little as possible*. In my experience, fine grained management of file space by userspace applications via fallocate() is nothing but a recipe for awful performance, highly variable IO latency, bad file fragmentation, and poor filesystem aging characteristics. Just don't do it. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx