At 6:01 PM -0500 2/26/08, Tom Lane wrote:
Hmm, that says that it's not a matter of locking on pg_listener,
but of actually applying the row update(s) and/or signaling the
recipient(s). If you're not seeing performance issues for ordinary
table-update operations it's hard to see why pg_listener updates would
be any worse, so that seems to point the finger at the signaling.
Which is just a matter of a kill(2) and shouldn't be that expensive.
It might be interesting to try strace'ing the whole PG process tree
while these notifies are going on, and seeing if you can identify
any specific kernel calls that seem to take a long time.
I ran PG via strace and then ran the test script at 25 consumers
looping 25 times each. There were no other connections to the
database for this strace run and test.
Digging through the strace file is a bit mind-numbing but here's some
signs that semop and send are the culprits:
3 misc examples coming from near LISTEN or NOTIFIES:
- - - - - -
7495 18:10:40.251855 <... semop resumed> ) = 0 <1.006149>
7495 18:10:41.325442 <... semop resumed> ) = -1 EINTR (Interrupted
system call) <0.299135>
7495 18:10:41.998219 <... semop resumed> ) = 0 <0.603566>
A chunk of log following the action on fd 7 (slow recv on
ERESTARTSYS) and then the slow semop that follows:
- - - - - - - -
7495 18:10:42.576401 send(7, "C\0\0\0\vNOTIFY\0Z\0\0\0\5I", 18, 0
<unfinished ...>
7495 18:10:42.576503 <... send resumed> ) = 18 <0.000070>
7495 18:10:42.576620 recv(7, <unfinished ...>
7495 18:10:42.873796 <... recv resumed> 0x8331d40, 8192, 0) = ?
ERESTARTSYS (To be restarted) <0.297158>
7495 18:10:42.873911 --- SIGUSR2 (User defined signal 2) @ 0 (0) ---
7495 18:10:42.874079 gettimeofday( <unfinished ...>
7495 18:10:42.874198 <... gettimeofday resumed> {1204078242,
874097}, NULL) = 0 <0.000101>
7495 18:10:42.874324 setitimer(ITIMER_REAL, {it_interval={0, 0},
it_value={1, 0}}, <unfinished ...>
7495 18:10:42.874470 <... setitimer resumed> NULL) = 0 <0.000121>
7495 18:10:42.874604 semop(50495522, 0xbfff9764, 1 <unfinished ...>
7495 18:10:43.678431 <... semop resumed> ) = 0 <0.803809>
A little further on:
- - - - - - - -
7495 18:10:44.905320 <... semop resumed> ) = -1 EINTR (Interrupted
system call) <0.998192>
I'm not sure what exactly that means, in terms of next steps. I'll
dig more through the strace file and see if I can find anything else
but those look to be definite bottlenecks for some reason.
At 2:24 PM -0800 2/26/08, Maurice Aubrey wrote:
What's the OS/Dist?
Red Hat Enterprise Linux ES release 3 (Taroon Update 8)
Thx,
-Joel
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster