Hi, On 2019-06-18 17:13:20 +0100, Fabio Ugo Venchiarutti wrote: > Does the backend mmap() data files when that's possible? No. That doesn't allow us to control when data is written back to disk, which is crucial for durability/consistency. > I've heard the "use the page cache" suggestion before, from users and > hackers alike, but I never quite heard a solid argument dismissing potential > overhead-related ill effects of the seek() & read() syscalls if they're > needed, especially on many random page fetches. We don't issue seek() for reads anymore in 12, instead do a pread() (but it's not a particularly meaningful performance improvement). The read obviously has cost, especially with syscalls getting more and more expensive due to the mitigation for intel vulnerabilities. I'd say that a bigger factor than the overhead of the read itself is that for many workloads we'll e.g. incur additional writes when s_b is smaller, that the kernel has less information about when to discard data, that the kernel pagecaches have some scalability issues (partially due to their generality), and double buffering. > Given that shmem-based shared_buffers are bound to be mapped into the > backend's address space anyway, why isn't that considered always > preferable/cheaper? See e.g. my point in my previous email in this thread about drop/truncate. > I'm aware that there are other benefits in counting on the page cache (eg: > staying hot in the face of a backend restart), however I'm considering > performance in steady state here. There's also the issue that using a large shared buffers setting means that each process' page table gets bigger, unless you configure huge_pages. Which one definitely should - but that's an additional configuration step that requires superuser access on most operating systems. Greetings, Andres Freund