Hi, On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote: > On 6/27/23 10:50 AM, Masahiko Sawada wrote: > > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@xxxxxxxxxxx> wrote: > > > > > > On 6/25/23 10:17 PM, Masahiko Sawada wrote: > > > > FYI, to share the background of what PostgreSQL does, when > > > > bulk-insertions into one table are running concurrently, one process > > > > extends the underlying files depending on how many concurrent > > > > processes are waiting to extend. The more processes wait, the more 8kB > > > > blocks are appended. As the current implementation, if the process > > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > > > > details). We don't use fallocate() for small extensions as it's slow > > > > on some filesystems. Therefore, if a bulk-insertion process tries to > > > > extend the table by say 5~10 blocks many times, it could use > > > > poxis_fallocate() and pwrite() alternatively, which led to the slow > > > > performance as I reported. > > > > > > To what end? What problem is PostgreSQL trying to solve with this > > > scheme? I might be missing something but it seems like you've described > > > the "what" in detail, but no "why." > > > > It's for better scalability. SInce the process who wants to extend the > > table needs to hold an exclusive lock on the table, we need to > > minimize the work while holding the lock. > > Ok, but what is the reason for zeroing out the blocks prior to them being > written with real data? I'm wondering what the core requirement here is for > the zeroing, either via fallocate (which btw posix_fallocate does not > guarantee) or pwrites of zeros. The goal is to avoid ENOSPC at a later time. We do this before filling our own in-memory buffer pool with pages containing new contents. If we have dirty pages in our buffer that we can't write out due to ENOSPC, we're in trouble, because we can't checkpoint. Which typically will make the ENOSPC situation worse, because we also can't remove WAL / journal files without the checkpoint having succeeded. Of course a successful fallocate() / pwrite() doesn't guarantee that much on a COW filesystem, but there's not much we can do about that, to my knowledge. Using fallocate() for small extensions is problematic because it a) causes fragmentation b) disables delayed allocation, using pwrite() is also bad because the kernel will have to write out those dirty pages full of zeroes - very often we won't write out the page with "real content" before the kernel decides to do so. Hence using a heuristic to choose between the two. I think all that's needed here is a bit of tuning of the heuristic, possibly adding some "history" awareness. If we could opt into delayed allocation while avoiding ENOSPC for a certain length, it'd be perfect, but I don't think that's possible today? We're also working on using DIO FWIW, where using fallocate() is just about mandatory... Greetings, Andres Freund