Hi, On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote: > On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: > > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > > > Well, I cut the previous question deliberately, but if you're going to > > > force me to answer, my experience with storage tells me that one test > > > being 10x different from all the others usually indicates a problem > > > with the benchmark test itself rather than a baseline improvement, so > > > I'd wait for more data. > > > > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). > > > > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? > > > > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. > > It's going to depend substantially on the access pattern. If the 16MB > file (oof, that's tiny!) was read in in large chunks or even in small > chunks, but consecutively, the folio changes will allocate larger pages > (16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and > start using PMDs, but I've never seen that in my testing. The 16MB files are just for the WAL/journal, and are write only in a benchmark like this. With pgbench it'll be written in small consecutive chunks (a few pages at a time, for each group commit). Each page is only written once, until after a checkpoint the entire file is "recycled" (renamed into the future of the WAL stream) and reused from start. The data files are 1GB. > fdatasync() could indeed have got much faster. If we're writing back a > 256kB page as a unit, we're handling 64 times less metadata than writing > back 64x4kB pages. We'll track 64x less dirty bits. We'll find only > 64 dirty pages per 16MB instead of 4096 dirty pages. The dirty writes will be 8-32k or so in this workload - the constant commits require the WAL to constantly be flushed. > It's always possible I just broke something. The xfstests aren't > exhaustive, and no regressions doesn't mean no problems. > > Can you guide Michael towards parameters for pgbench that might give > an indication of performance on a more realistic workload that doesn't > entirely fit in memory? Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed... Michael, where do I find more details about the codification used during the run? Regards, Andres