On Sat, Mar 29, 2014 at 6:17 PM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
Well what I am seeing are WAL log errors. One saying no file isOn 03/29/2014 08:19 AM, Willy-Bas Loos wrote:
The error that shows up is a Bus error.
That's on the replication slave.
Here's the log about it:
2014-03-29 12:41:33 CET db: ip: us: FATAL: could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
cp: cannot stat
`/data/postgresql/9.1/main/wal_archive/00000001000000720000000A': No
such file or directory
2014-03-29 12:41:33 CET db: ip: us: LOG: unexpected pageaddr
71/E9DA0000 in log file 114, segment 10, offset 14286848
cp: cannot stat
`/data/postgresql/9.1/main/wal_archive/00000001000000720000000A': No
such file or directory
2014-03-29 12:41:33 CET db: ip: us: LOG: streaming replication
successfully connected to primary
2014-03-29 12:41:48 CET db: ip: us: LOG: startup process (PID 17452)
was terminated by signal 7: Bus error
2014-03-29 12:41:48 CET db: ip: us: LOG: terminating any other active
server processes
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos WARNING:
terminating connection because of crash of another server process
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
present, the other pointing at a possible file corruption.
Those are normal notices, nothing to worry about.
Shared memory problems are offered as a possible cause only. Right now I would say we are seeing only half the picture. The Postgres logs from the same time period for the primary server, as well as the system logs for the openvz container would help fill in the other half of the picture.
Here's the log from the primary postgres server:
2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE: ALTER TABLE will create implicit sequence "test_x_seq" for serial column "test.x"
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: SSL renegotiation failure
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: SSL error: unexpected record
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: could not send data to client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: could not receive data from client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: unexpected EOF on standby connection
2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE: ALTER TABLE will create implicit sequence "test_x_seq" for serial column "test.x"
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: SSL renegotiation failure
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: SSL error: unexpected record
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: could not send data to client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: could not receive data from client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication LOG: unexpected EOF on standby connection
(the SSL renegotiation failure happens all the time, without the crash)
And here's the syslog form the container:
Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP: [xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx]
Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP: [xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx]
Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP: [xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx]
Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP: [xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx]
The log on the host doesn't say anything interesting either.
That sounds like "there is a solution to the problem, all you have to do is find out what it is". There doesn't seem to be a variable in the beancounters or anywhere else that can prevent the bus error from happening.A cursory look at memory management in openvz shows it is different from other virtualization software and physical machines. Whether that is a problem would seem to be dependent on where you are on the learning curve:)
There's seems to be no separate way of guaranteeing shared memory. There's no OOM killer active either, nor is host or server running short of memory.
I'm still worried that it's like Tom Lane said in another discussion:"So basically, you've got a broken kernel here: it claimed to give PG circa (135MB) of memory, but what's actually there is only about (128MB). I don't see any connection between those numbers and the shmmax/shmall settings, either --- so I think this must be some busted implementation of a VM-level limitation."
And it makes me wonder what else may be issues that arise from that. But especially, what i can do about it.
Cheers,
WBL
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth