Re: plperlu stored procedure seems to freeze for a minute

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Tue, 1 Dec 2015 18:58:31 +0100

On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
> On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> >A rather weird observation from the log files of our server (9.5 beta1):
> >
> >2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value $success in concatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> >2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function "mb_timeseriesdata_zmq"
> >[lots of other stuff from different connections]
> >2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> >                 WDS::Macrobond::Utils::decode_result("") called at line 30
> >                 main::__ANON__("gen_wqehur") called at -e line 0
> >                 eval {...} called at -e line 0
> >
> >Two messages from the same line of the same plperlu stored procedure, 68
> >seconds apart. So what is this line 36?
> >
> >     confess "impossible result '$success' (payload=$payload)";
> >
> >What? The first message clearly comes from interpolating $success
> >(which is undef at that point) into the argument. The second from
> >confess itself. What could cause a plperlu procedure to freeze for 68
> >seconds between the call to confess and its output?
> >
> >Is it possible that only the writing of the log entry is delayed?
> >
> >Another weird thing: $success is undef because a ØMQ rpc call[1] timed
> 
> And the call is?

The sequence is:

my $req_sck = zmq_socket($context, ZMQ_REQ);
zmq_connect($req_sck, $url);
my $qry_msg = join(...);
zmq_send($req_sck, $qry_msg);
my $res_msg = zmq_msg_init();
my $rv = zmq_msg_recv($res_msg, $req_sck);
# check rv here.
my $data = zmq_msg_data($res_msg); # $data is "" here
my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages are from this function

(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
I'm not using it). 

I omitted that because I don't think it's terribly relevant here.
Details of the usage of ØMQ are better discussed on the ØMQ mailing
list.

But there is something else which may be relevant: ØMQ uses threads
internally, and I don't actually know whether zmq_msg_recv returning
means that the read(2) call (or whatever) on the socket terminates.
It may actually continue in another thread. But I still don't see how
that could block the main thread (or wake it up again in a place which
has nothing to do with ØMQ (confess is a standard Perl function to print
a stack trace and die)). Or - just thinking aloud here - I fear I'm
abusing you guys as support teddy bears[1] - maybe it's the other way
round: confess dies, so maybe it frees some lock during cleanup which
allows the message which should have been sent by zmq_send to finally go
out on the wire. But that still doesn't explain the 68 seconds spent in
confess ...

Postgres worker processes are single-threaded, are they? Is there
something else which could interact badly with a moderately complex
multithreaded I/O library used from a stored procedure? 

I suspect such an interaction because I cannot reproduce the problem
outside of a stored procedure. A standalone Perl script doing the same
requests doesn't get a timeout.

I guess Alvaro is right: I should strace the postgres worker process
while it executes the stored procedure. The problem of course is that
it happens often enough be annoying, but rarely enough that it's not
easily reproducible.

> >out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
> >to have a default timeout of 60 seconds, and I don't set one). But at
> 
> Network timeout?

That was my first guess, but I don't see where it would come from. Or
why it only is there if I call the code from a stored procedure, not
from a standalone script.

> >09:24:45 (i.e. the time of the error message) the answer for that RPC
> >call arrived. So it kind of looks like confess waited for the message to
> >arrive (which makes no sense at all) or maybe that confess waited for
> >something which also blocked the sending of the request (because
> >according to the server logs, the RPC request only arrived there at
> >09:24:45 and was answered within 1 second), but that doesn't make any
> 
> So if the request timed out how did you get a reply, a second request?

Nope. I don't really "get" the reply. I just see in the logs of the
other server that it sent a reply at that time. 

The time line is like this

time    postgres process        mb_dal process
T       zmq_send()
        zmq_msg_recv()

T+60    zmq_msg_recv returns
        an empty message;
        decode() is called
        which notices that the
        message is empty and 
        calls confess();

T+128   confess() prints a      receives the message sent at T;
        stacktrace;             sends a reply;

        hp

[1] For those who don't know the story: Someone at a help desk noticed
that people often started to explain their problem and somewhere in the
middle said "never mind, I think I know what the problem is" and went
away. So he put a big teddy bear on the desk and asked people to explain
their problems to the bear. So a support teddy bear is someone who you
talk to mostly to clear your own mind, not because you expect a
solution.

-- 
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants 
| |   | hjp@xxxxxx         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/
Attachment:
signature.asc

Description: Digital signature