Re: Need help with 8.4 Performance Testing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> ________________________________________
> From: pgsql-performance-owner@xxxxxxxxxxxxxx [pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of > Jean-David Beyer [jeandavid8@xxxxxxxxxxx]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re:  Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.

Yes, the OS is in better position to arbitrate between multiple things.  Of course, we aren't talking about the highest performance databases if we are talking about mixed use systems though.
Additionally, the OS can never really get it right, with a DB or other apps.  Any app can behave badly and grab too much RAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its important or not versus other processes.

> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.

The problem, is when none of them are really 'idle'.  When the OS has to decide which pages, all of which have been accessed recently, to evict.  Most OS's will make bad choices if the load is mixed random and sequential access, as they treat all pages equally with respect to freshness versus eviction.
Another problem is that there IS a difference between being in postgres' cache and the OS cache.  One is more expensive to retrieve than the other.  Significantly.

Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction overhead) than going over the sys call to the OS.

And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks.  (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up CPU like crazy just reading and identifying tuples.  It does seqscans ~ 25% faster if its from shared_buffers than from the OS's page cache though.   Seqscans are between 250MB/sec and 400MB/sec peak, from mem or disk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently where we can.

In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead.  It only really needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and additionally potentially save code trips down the shared buffer management code paths.  At lest, thats my guess I haven't looked at any code and could be wrong.


Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation ---  Linux (2.6.18, Centos 5.2) does NOT.  I can easily make it spend 100% CPU in system time trying to figure out what to do with the system cache for an hour.  Just do large seqscans with memory pressure from work_mem or other forces that the OS will not deem 'idle'.  Once the requested memory is ~75% of the system total, it will freak out.  Linux simply will not give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast and most of the access is sequential, marginalizing the benefit of the cache.  Depending on how you tune it, it will either spin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an equivalent swap storm.
Mount the data drive in O_DIRECT and the problem vanishes.  I've been told that this problem may be gone in some of the latest kernels.  I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then we lose the useful scheduler and readahead algorithms.  The other way around (small shared_buffers, let the OS do it) hurts performance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS tosses thouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods of time. Great wonder, this operating system caching, eh?

In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they have serious flaws themselves and typically are difficult or impossible to tune to be ideal for a database.

Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database manage its own page cache intelligently and access the OS file system as optimally as it can.  In both of the latter, the DB knows much more about what data is really important than the OS (and could for example, prioritize cache versus work_mem intelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan occurs to make caching those results low priority).  No matter how you do it using the OS cache, you cache twice and copy twice.  O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of a file system are real and substantial.  If you are caching twice, you might as well have the "closer" copy of that data be the larger, more efficient pool.

As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, though my workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements.  32GB RAM and 3TB data (about 250GB scanned regularly) here.  And yes, we are almost completely CPU bound now except for a few tasks.  Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and is regularly < 20%.  COPY doesn't get anywhere near platter speeds, on indexless bulk transfer.  The highest disk usage spikes occur when some of our radom-access data/indexes get shoved out of cache.  These aren't too large, but high enough seqscan load will cause postgres and the OS to dump them from cache.  If we put these on some SSD's the disk utilization % would drop a lot further.

I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost any postgres setup and make it almost pure CPU bound.
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than 12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux