Hi Scott, let me now finish my report and regroup all data together, and then we'll continue discussion as it'll come more in debug/profile phase.. - I'll be not polite from my part to send some tons of attachments to the mail list :-) Rgds, -Dimitri On 5/13/09, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote: > > On 5/13/09 3:22 AM, "Dimitri" <dimitrik.fr@xxxxxxxxx> wrote: > >> Hi Scott, >> >> On 5/12/09, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote: >>> Although nobody wants to support it, he should try the patch that Jignesh >>> K. >>> Shah (from Sun) proposed that makes ProcArrayLock lighter-weight. If it >>> makes 32 cores much faster, then we have a smoking gun. >>> >>> Although everyone here is talking about this as an 'unoptimal' solution, >>> the >>> fact is there is no evidence that a connection pooler will fix the >>> scalability from 16 > 32 cores. >>> Certainly a connection pooler will help most results, but it may not fix >>> the >>> scalability problem. >>> >>> A question for Dimitri: >>> What is the scalability from 16 > 32 cores at the 'peak' load that occurs >>> near 2x the CPU count? Is it also poor? If this is also poor, IMO the >>> community here should not be complaining about this unopimal case -- a >>> connection pooler at that stage does little and prepared statements will >>> increase throughput but not likely alter scalability. >> >> I'm attaching a small graph showing a TPS level on PG 8.4 depending on >> number of cores (X-axis is a number of concurrent users, Y-axis is the >> TPS number). As you may see TPS increase is near linear while moving >> from 8 to 16 cores, while on 32cores even it's growing slightly >> differently, what is unclear is why TPS level is staying limited to >> 11.000 TPS on 32cores. And it's pure read-only workload. >> > > Interesting. What hardware is this, btw? Looks like the 32 core system > probably has 2x the CPU and a bit less interconnect efficiency versus the 16 > core one (which would be typical). > Is the 16 core case the same, but with fewer cores per processor active? Or > fewer processors total? > Understanding the scaling difference may require a better understanding of > the other differences besides core count. > >>> >>> If that result scales, then the short term answer is a connection pooler. >>> >>> In the tests that Jingesh ran -- making the ProcArrayLock faster helped >>> the >>> case where connections = 2x the CPU core count quite a bit. >>> >>> The thread about the CPU scalability is "Proposal of tunable fix for >>> scalability of 8.4", originally posted by "Jignesh K. Shah" >>> <J.K.Shah@xxxxxxx>, March 11 2009. >>> >>> It would be very useful to see results of this benchmark with: >>> 1. A Connection Pooler >> >> will not help, as each client is *not* disconnecting/reconnecting >> during the test, as well PG is keeping well even 256 users. And TPS >> limit is reached already on 64 users, don't think pooler will help >> here. >> > > Actually, it might help a little. Postgres has a flaw that makes backends > block on a lock briefly based on the number of total backends -- active or > completely passive. Your tool has some (very small) user-side delay and a > connection pooler would probably allow 64 of your users to efficiently 'fit' > in 48 or so connection pooler slots. > > It is not about connecting and disconnecting in this case, its about > minimizing Postgres' process count. If this does help, it would hint at > certain bottlenecks. If it doesn't it would point elsewhere (and quiet some > critics). > > However, its unrealistic for any process-per-connection system to have less > backends than about 2x the core count -- else any waiting on I/O or network > will just starve CPU. So this would just be done for research, not a real > answer to making it scale better. > > For those who say "but, what if its I/O bound! You don't need more > backends then!": Well you don't need more CPU either if you're I/O bound. > By definition, CPU scaling tests imply the I/O can keep up. > > >>> 2. Jignesh's patch >> >> I've already tested it and it did not help in my case because the real >> problem is elsewhere.. (however, I did not test it yet with my latest >> config params) >> > > Great to hear that! -- That means this case is probably not ProcArrayLock. > If its Solaris, could we get: > 1. What is the CPU stats when it is in the inefficient state near 64 or 128 > concurrent users (vmstat, etc. I'm interested in CPU in > user/system/idle/wait time, and context switches/sec mostly). > 2. A Dtrace probe on the postgres locks -- we might be able to identify > something here. > > The results here would be useful -- if its an expected condition in the > planner or parser, it would be useful confirmation. If its something > unexpected and easy to fix -- it might be changed relatively soon. > > If its not easy to detect, it could be many other things -- but the process > above at least rules some things out and better characterizes the state. > >>> 3. Prepared statements >>> >> >> yes, I'm preparing this test. >> >>> #3 is important, because prepared statements are ideal for queries that >>> perform well with low statistics_targets, and not ideal for those that >>> require high statistics targets. Realistically, an app won't have more >>> than >>> a couple dozen statement forms to prepare. Setting the default >>> statistics >>> target to 5 is just a way to make some other query perform bad. >> >> Agree, but as you may have a different statistic target *per* table it >> should not be a problem. What is sure - all time spent on parse and >> planner will be removed here, and the final time should be a pure >> execution. >> > > I'm definitely interested here because although pure execution will > certainly be faster, it may not scale any better. > > >> Rgds, >> -Dimitri >> > > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance