Eliot Gable wrote:
Just curious if this would apply to PostgreSQL: http://queue.acm.org/detail.cfm?id=1814327
It's hard to take this seriously at all when it's so ignorant of actual research in this area. Take a look at http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/BFJ01.pdf for a second, specifically page 9. See the "van Emde Boas" layout? That's basically the same as what this article is calling a B-heap, and the idea goes back to at least 1977. As you can see from that paper, the idea of using it to optimize for multi-level caches goes back to at least 2001. Based on the performance number, it seems a particularly helpful optimization for the type of in-memory caching that his Varnish tool is good at, so kudos for reinventing the right wheel. But that's an environment with one level of cache: you're delivering something from memory, or not. You can't extrapolate from what works for that very far.
So, how does PostgreSQL deal with the different latencies involved in accessing data on disk for searches / sorts vs. accessing data in memory? Is it allocated in a similar way as described in the article such that disk access is reduced to a minimum?
PostgreSQL is modeling a much more complicated situation where there are many levels of caches, from CPU to disk. When executing a query, the database tries to manage that by estimating the relative costs for CPU operations, row operations, sequential disk reads, and random disk reads. Those fundamental operations are then added up to build more complicated machinery like sorting. To minimize query execution cost, various query plans are considered, the cost computed for each one, and the cheapest one gets executed. This has to take into account a wide variety of subtle tradeoffs related to whether memory should be used for things that would otherwise happen on disk. There are three primary ways to search for a row, three main ways to do a join, two for how to sort, and they all need to have cost estimates made for them that balance CPU time, memory, and disk access.
The problem Varnish is solving is most like how PostgreSQL decides what disk pages to keep in memory, specifically the shared_buffers structure. Even there the problem the database is trying to solve is quite a bit more complicated than what a HTTP cache has to deal with. For details about what the database does there, see "Inside the PostgreSQL Buffer Cache" at http://projects.2ndquadrant.com/talks
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@xxxxxxxxxxxxxxx www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance