Slow nested loop execution on larger server

Greg Smith <greg@xxxxxxxxxxxxxxx> · Fri, 16 Dec 2011 13:27:57 -0500

Forwarding this on from someone who may pop up on the list too.  This is 
a new server that's running a query slowly.  The rate at which new 
semops happen may be related.  OS is standard RHEL 5.5, kernel 
2.6.18-194.32.1.el5.  One possibly interesting part is that this is 
leading edge hardware:  four 8-core processes with HyperThreading, for 
64 threads total:

processor       : 63
model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
physical id     : 3
siblings        : 16
cpu cores       : 8

My only vague memory of issues around systems like this that comes to 
mind here is Scott Marlowe talking about some kernel tweaks he had to do 
on his 48 core AMD boxes to get them work right.  That seems like a 
sketchy association, and I don't seem to have the details handy.

I do have the full query text and associated plans, but it's mostly 
noise.  Here's the fun part:

->  Nested Loop  (cost=581.52..3930.81 rows=2 width=51) (actual 
time=25.681..17562.198 rows=21574 loops=1)
      ->  Merge Join  (cost=581.52..783.07 rows=2 width=37) (actual 
time=25.650..98.268 rows=21574 loops=1)
            Merge Cond: (rim.instrumentid = ri.instrumentid)
            ->  Index Scan using reportinstruments_part_125_pkey on 
reportinstruments_part_125 rim  (cost=0.00..199.83 rows=110 width=8) 
(actual time=0.033..27.180 rows=20555 loops=1)
                  Index Cond: (reportid = 105668)
            ->  Sort  (cost=581.52..582.31 rows=316 width=33) (actual 
time=25.608..34.931 rows=21574 loops=1)
                  Sort Key: ri.instrumentid
                  Sort Method: quicksort  Memory: 2454kB
                  ->  Index Scan using riskbreakdown_part_51_pkey on 
riskbreakdown_part_51 ri  (cost=0.00..568.40 rows=316 width=33) (actual 
time=0.019..11.599 rows=21574 loops=1)
                        Index Cond: (reportid = 105668)
      ->  Index Scan using energymarketlist_pkey on energymarketlist 
ip  (cost=0.00..1573.86 rows=1 width=18) (actual time=0.408..0.808 
rows=1 loops=21574)
            Index Cond: ((reportid = 105668) AND (instrumentid = 
ri.instrumentid))
...
Total runtime: 21250.377 ms

The stats are terrible and the estimates off by many orders of 
magnitude.  But that's not the point.  It expected 2 rows and 21574 came 
out; fine.  Why is it taking this server 17 seconds to process 21K rows 
of tiny width through a Nested Loop?  Is it bouncing to a new CPU every 
time the thing processes a row or something?  I'm finding it hard to 
imagine how this could be a PostgreSQL problem; seems more like a kernel 
bug aggrevated on this system.  I wonder if we could produce a 
standalone test case with a similar plan from what the bad query looks 
like, and ask the person with this strange system to try it.  See if 
it's possible to make it misbehave in the same way with something 
simpler others can try, too.

The only slow semops thread I found in the archives turned out to be I/O 
bound.  This query is all CPU; combining a few examples here since this 
is repeatable and I'm told acts the same each time:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12634 xxx       25   0  131g 1.4g 1.4g R 100.1  0.6  11:24.20 postgres: 
xxx xxx [local] SELECT

$ ./pidstat -d -T ALL -p 27683 1 100
15:28:46          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
15:28:47        27683      0.00      0.00      0.00  postgres

The PostgreSQL 9.1.1 is a custom build, but looking at pg_config I don't 
see anything special that might be related; it was optimized like this:

CFLAGS=-march=core2 -O2 -pipe

Here's a chunk of the observed and seemingly strange semop rate:

$ strace -fttp 12634
Process 12634 attached - interrupt to quit
15:38:55.966047 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:17.490133 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.532522 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.534874 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:17.603637 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:17.640646 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.658230 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:18.905137 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:33.396657 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:50.208516 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:54.640712 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:55.468458 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:55.488364 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:55.489344 semop(16777323, 0x7fffd09dac00, 1) = 0
Process 12634 detached

pg_locks for this 12634 shows all granted ones, nothing exciting there.  
I asked how well this executes with enable_nestloop turned off, hoping 
to see that next.

This all seems odd, and I get interested and concerned when that start 
showing up specifically on newer hardware.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance