On 03/31/2014 04:12 AM, Willy-Bas Loos wrote:
On Sat, Mar 29, 2014 at 6:17 PM, Adrian Klaver
<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>> wrote:
On 03/29/2014 08:19 AM, Willy-Bas Loos wrote:
The error that shows up is a Bus error.
That's on the replication slave.
Here's the log about it:
2014-03-29 12:41:33 CET db: ip: us: FATAL: could not receive
data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
cp: cannot stat
`/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
No
such file or directory
2014-03-29 12:41:33 CET db: ip: us: LOG: unexpected pageaddr
71/E9DA0000 in log file 114, segment 10, offset 14286848
cp: cannot stat
`/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
No
such file or directory
2014-03-29 12:41:33 CET db: ip: us: LOG: streaming replication
successfully connected to primary
2014-03-29 12:41:48 CET db: ip: us: LOG: startup process (PID
17452)
was terminated by signal 7: Bus error
2014-03-29 12:41:48 CET db: ip: us: LOG: terminating any other
active
server processes
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos WARNING:
terminating connection because of crash of another server process
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos DETAIL: The
postmaster has commanded this server process to roll back the
current
transaction and exit, because another server process exited
abnormally
and possibly corrupted shared memory.
2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos HINT: In a
moment you should be able to reconnect to the database and
repeat your
command.
Well what I am seeing are WAL log errors. One saying no file is
present, the other pointing at a possible file corruption.
Those are normal notices, nothing to worry about.
Well other then they cause the standby to reconnect to the primary,
during which a crash occurs.
Shared memory problems are offered as a possible cause only. Right
now I would say we are seeing only half the picture. The Postgres
logs from the same time period for the primary server, as well as
the system logs for the openvz container would help fill in the
other half of the picture.
Here's the log from the primary postgres server:
2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE: ALTER
TABLE will create implicit sequence "test_x_seq" for serial column "test.x"
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG: SSL renegotiation failure
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG: SSL error: unexpected record
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG: could not send data to client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG: could not receive data from client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG: unexpected EOF on standby connection
(the SSL renegotiation failure happens all the time, without the crash)
And here's the syslog form the container:
Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP:
[xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx]
Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP:
[xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx]
The log on the host doesn't say anything interesting either.
A cursory look at memory management in openvz shows it is different
from other virtualization software and physical machines. Whether
that is a problem would seem to be dependent on where you are on the
learning curve:)
That sounds like "there is a solution to the problem, all you have to do
is find out what it is". There doesn't seem to be a variable in the
beancounters or anywhere else that can prevent the bus error from happening.
There's seems to be no separate way of guaranteeing shared memory.
There's no OOM killer active either, nor is host or server running short
of memory.
At this point I am not sure it is even obvious what is causing the
error, so finding a solution would be a hit or miss affair at best.
I'm still worried that it's like Tom Lane said in another discussion:"So
basically, you've got a broken kernel here: it claimed to give PG circa
(135MB) of memory, but what's actually there is only about (128MB). I
don't see any connection between those numbers and the shmmax/shmall
settings, either --- so I think this must be some busted implementation
of a VM-level limitation."
(here:
http://www.postgresql.org/message-id/CAK3UJREBcyVBtr8D7vMfU=uDdkjXkrPnGcuy8erYB0tMfKe1LA@xxxxxxxxxxxxxx)
And it makes me wonder what else may be issues that arise from that. But
especially, what i can do about it.
I do not use openvz so I do not have a test bed to try out, but this
page seems to be related to your problem:
http://openvz.org/Resource_shortage
or if you want more detail and a link to what looks to a replacement for
beancounters:
http://openvz.org/Setting_UBC_parameters
Cheers,
WBL
--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin