Re: Any better plan for this query?..

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Tue, 19 May 2009 10:13:25 -0700

On 5/19/09 3:46 AM, "Dimitri" <dimitrik.fr@xxxxxxxxx> wrote:

> On 5/19/09, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote:
>> 
>> On 5/18/09 3:32 PM, "Dimitri" <dimitrik.fr@xxxxxxxxx> wrote:
>> 
>>> On 5/18/09, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote:
>>>> Great data Dimitri!'
>>> 
>>> Thank you! :-)
>>> 
>>>> 
>>>> I see a few key trends in the poor scalability:
>>>> 
>>>> The throughput scales roughly with %CPU fairly well.  But CPU used
>>>> doesn't
>>>> go past ~50% on the 32 core tests.  This indicates lock contention.
>>>> 
>>> 
>>> You should not look on #1 STATs, but on #2 - they are all with the
>>> latest "fixes"  - on all of them CPU is used well (90% in pic on
>>> 32cores).
>>> Also, keep in mind these cores are having 2 threads, and from Solaris
>>> point of view they are seen as CPU (so 64 CPU) and %busy is accounted
>>> as for 64 CPU
>>> 
>> 
>> Well, if the CPU usage is actually higher, then it might not be lock waiting
>> -- it could be spin locks or context switches or cache coherency overhead.
>> Postgres may also not be very SMT friendly, at least on the hardware tested
>> here.
> 
> do you mean SMP or CMT? ;-)
> however both should work well with PostgreSQL. I also think about CPU
> affinity - probably it may help to avoid CPU cache misses - but makes
> sense mostly if pooler will be added as a part of PG.

Symmetric Multi Threading (HyperThreading in Intels marketing terms, other
marketing terms for Sun or IBM).  One CPU core that can handle more than one
concurrently executing thread.
Technically, 'SMT' allows instructions in flight from multiple threads at
once in a superscalar Cpu core while some implementations differ and might
technically CMT (one thread or the other, but can switch fast, or a
non-superscalar core).

For many implementations of  'multiple threads on one CPU core' many of the
processor resources are reduced per thread when it is active -- caches get
split, instruction re-order buffers are split, etc.  That is rather hardware
implementation dependant.

For Intel's SMT (and other similar), spin-locks hurt scalability if they
aren't using new special instructions for the spin to yield pipeline slots
to the other thread.

Generally, code that stresses common processor resources more than CPU
execution will scale poorly with SMT/CMT etc.

So I'm not sure about the Postgres details, but the general case of an
application that doesn't benefit from these technologies exists, and there
is a non-zero chance that Postgres has some characteristics of such an app.

>> 
>> (what was the context switch rate?  I didn't see that in the data, just
>> mutex spins).
> 
> increasing with a load, as this ex.:
> http://dimitrik.free.fr/Report_20090505/5539_dim_STAT_100.html#bmk_CPU_CtxSwit
> ch_100
> 

Well, on most systems over 100K context switches/sec is a lot.  And those
reach 180000 /sec. 
However, this is 'only' 10 context switches per transaction and less than
20% system CPU, so maybe those numbers aren't quite as big as they seem.

Out of curiosity, what was the context switch rate for MySql at its peak
throughput?
> 
>> 
>> The scalability curve is definitely showing something.  Prepared statements
>> were tried, as were most of the other suggestions other than one:
>> 
>> What happens if the queries are more complicated (say, they take 15ms server
>> side with a more complicated plan required)?  That is a harder question to
>> answer
> 
> What I observed is: if planner takes more long time (like initially
> with 8.3.7 and analyze target 1000) the scalability problem is
> appearing more strange -
> http://dimitrik.free.fr/Report_20090505/5521_dim_STAT_18.html - as you
> see CPU even not used more than 60% , and as you may see spin locks
> are lowering - CPUs are not spinning for locks, there is something
> else..
> I'm supposing a problem of some kind of synchronization - background
> processes are not waking up on time or something like this...
> Then, if more time spent on the query execution itself and not planner:
>  - if it'll be I/O time - I/O will hide everything else until you
> increase a storage performance and/or add more RAM, but then you come
> back to the initial issue :-)
>  - if it'll be a CPU time it may be interesting! :-)
> 
> Rgds,
> -Dimitri
> 

Ok, so that's good info that the planner or parser side seems to scale less
effectively than the execution (as the results show), but I'm wondering
about queries with longer execution times not longer planner times.  I'm
wondering that, because its my opinion that most applications that will use
larger scale hardware will have more complicated queries than your test.
Its also greedy on my part since most queries in my applications are
significantly more complicated.
Regardless of my opinions -- this test is on one extreme (small fast
queries) of the spectrum.  Its useful to know some data points on other
parts of the spectrum.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance