On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > > Well, I cut the previous question deliberately, but if you're going to > > force me to answer, my experience with storage tells me that one test > > being 10x different from all the others usually indicates a problem > > with the benchmark test itself rather than a baseline improvement, so > > I'd wait for more data. > > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). > > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? > > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. It's going to depend substantially on the access pattern. If the 16MB file (oof, that's tiny!) was read in in large chunks or even in small chunks, but consecutively, the folio changes will allocate larger pages (16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and start using PMDs, but I've never seen that in my testing. fdatasync() could indeed have got much faster. If we're writing back a 256kB page as a unit, we're handling 64 times less metadata than writing back 64x4kB pages. We'll track 64x less dirty bits. We'll find only 64 dirty pages per 16MB instead of 4096 dirty pages. It's always possible I just broke something. The xfstests aren't exhaustive, and no regressions doesn't mean no problems. Can you guide Michael towards parameters for pgbench that might give an indication of performance on a more realistic workload that doesn't entirely fit in memory?