On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote: > On 12/01/2015 06:51 AM, Peter J. Holzer wrote: > >A rather weird observation from the log files of our server (9.5 beta1): > > > >2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING: Use of uninitialized value $success in concatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. > >2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT: PL/Perl function "mb_timeseriesdata_zmq" > >[lots of other stuff from different connections] > >2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR: impossible result '' (payload=) at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. > > WDS::Macrobond::Utils::decode_result("") called at line 30 > > main::__ANON__("gen_wqehur") called at -e line 0 > > eval {...} called at -e line 0 > > > >Two messages from the same line of the same plperlu stored procedure, 68 > >seconds apart. So what is this line 36? > > > > confess "impossible result '$success' (payload=$payload)"; > > > >What? The first message clearly comes from interpolating $success > >(which is undef at that point) into the argument. The second from > >confess itself. What could cause a plperlu procedure to freeze for 68 > >seconds between the call to confess and its output? > > > >Is it possible that only the writing of the log entry is delayed? > > > >Another weird thing: $success is undef because a ØMQ rpc call[1] timed > > And the call is? The sequence is: my $req_sck = zmq_socket($context, ZMQ_REQ); zmq_connect($req_sck, $url); my $qry_msg = join(...); zmq_send($req_sck, $qry_msg); my $res_msg = zmq_msg_init(); my $rv = zmq_msg_recv($res_msg, $req_sck); # check rv here. my $data = zmq_msg_data($res_msg); # $data is "" here my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages are from this function (Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but I'm not using it). I omitted that because I don't think it's terribly relevant here. Details of the usage of ØMQ are better discussed on the ØMQ mailing list. But there is something else which may be relevant: ØMQ uses threads internally, and I don't actually know whether zmq_msg_recv returning means that the read(2) call (or whatever) on the socket terminates. It may actually continue in another thread. But I still don't see how that could block the main thread (or wake it up again in a place which has nothing to do with ØMQ (confess is a standard Perl function to print a stack trace and die)). Or - just thinking aloud here - I fear I'm abusing you guys as support teddy bears[1] - maybe it's the other way round: confess dies, so maybe it frees some lock during cleanup which allows the message which should have been sent by zmq_send to finally go out on the wire. But that still doesn't explain the 68 seconds spent in confess ... Postgres worker processes are single-threaded, are they? Is there something else which could interact badly with a moderately complex multithreaded I/O library used from a stored procedure? I suspect such an interaction because I cannot reproduce the problem outside of a stored procedure. A standalone Perl script doing the same requests doesn't get a timeout. I guess Alvaro is right: I should strace the postgres worker process while it executes the stored procedure. The problem of course is that it happens often enough be annoying, but rarely enough that it's not easily reproducible. > >out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem > >to have a default timeout of 60 seconds, and I don't set one). But at > > Network timeout? That was my first guess, but I don't see where it would come from. Or why it only is there if I call the code from a stored procedure, not from a standalone script. > >09:24:45 (i.e. the time of the error message) the answer for that RPC > >call arrived. So it kind of looks like confess waited for the message to > >arrive (which makes no sense at all) or maybe that confess waited for > >something which also blocked the sending of the request (because > >according to the server logs, the RPC request only arrived there at > >09:24:45 and was answered within 1 second), but that doesn't make any > > So if the request timed out how did you get a reply, a second request? Nope. I don't really "get" the reply. I just see in the logs of the other server that it sent a reply at that time. The time line is like this time postgres process mb_dal process T zmq_send() zmq_msg_recv() T+60 zmq_msg_recv returns an empty message; decode() is called which notices that the message is empty and calls confess(); T+128 confess() prints a receives the message sent at T; stacktrace; sends a reply; hp [1] For those who don't know the story: Someone at a help desk noticed that people often started to explain their problem and somewhere in the middle said "never mind, I think I know what the problem is" and went away. So he put a big teddy bear on the desk and asked people to explain their problems to the bear. So a support teddy bear is someone who you talk to mostly to clear your own mind, not because you expect a solution. -- _ | Peter J. Holzer | I want to forget all about both belts and |_|_) | | suspenders; instead, I want to buy pants | | | hjp@xxxxxx | that actually fit. __/ | http://www.hjp.at/ | -- http://noncombatant.org/
Attachment:
signature.asc
Description: Digital signature