Hi, On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote: > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > > > What blows me away is the 80% performance improvement for > > > > PostgreSQL. I know they use the page cache extensively, so it's > > > > plausibly real. I'm a bit surprised that it has such good > > > > locality, and the size of the win far exceeds my > > > > expectations. We should probably dive into it and figure out > > > > exactly what's going on. > > > > > > Since none of the other tested databases showed more than a 3% > > > improvement, this looks like an anomalous result specific to > > > something in postgres ... although the next biggest db: mariadb > > > wasn't part of the tests so I'm not sure that's > > > definitive. Perhaps the next step should be to t > > > est mariadb? Since they're fairly similar in domain (both full > > > SQL) if mariadb shows this type of improvement, you can > > > safely assume it's something in the way SQL databases handle paging > > > and if it doesn't, it's likely fixing a postgres inefficiency. > > > > I think the thing that's specific to PostgreSQL is that it's a heavy > > user of the page cache. My understanding is that most databases use > > direct IO and manage their own page cache, while PostgreSQL trusts > > the kernel to get it right. > > That's testable with mariadb, at least for the innodb engine since the > flush_method is settable. > > > Regardless of whether postgres is "doing something wrong" or not, > > do you not think that an 80% performance win would exert a certain > > amount of pressure on distros to do the backport? > > Well, I cut the previous question deliberately, but if you're going to > force me to answer, my experience with storage tells me that one test > being 10x different from all the others usually indicates a problem > with the benchmark test itself rather than a baseline improvement, so > I'd wait for more data. I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. Regards, Andres