Re: Critical failure of standby

James Sewell <james.sewell@xxxxxxxxxxxx> · Sat, 13 Aug 2016 07:54:48 +1000

And a diagram of how it hangs together.
Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099

On Sat, Aug 13, 2016 at 7:54 AM, James Sewell <james.sewell@xxxxxxxxxxxx> wrote:
(from other thread)
9.5.3
Redhat 7.2 on VMWare
Single PostgreSQL instance one each machine
Every machine in DR became corrupt, so interestingly this must have been sent to the two cascading nodes via WAL before the crash on the hub DR node
No OS logs indicating anything abnormal
 I think the key looks like the (legitimate) loss of network to the Prod master, then:

(0:XX000)FATAL:  invalid memory alloc request size 3445219328

Everything seems to go wrong from there. Are WAL segments checked for integrity once they are received?

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099

On Sat, Aug 13, 2016 at 7:43 AM, James Sewell <james.sewell@jirotech.com> wrote:
It's on 9.5.3.
I've actually created this mail twice (I sent once as an unregistered address and assumed it would be dropped). I sent a diagram to the other one, I'll forward that mail here now for completeness.

Cheers,
James Sewell,
PostgreSQL Team Lead / Solutions Architect 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099

On Sat, Aug 13, 2016 at 5:20 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
James Sewell wrote:

> 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=  (0:00000)LOG:  consistent recovery state reached at 3/8811DFF0
> 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=  (0:XX000)FATAL:  invalid memory alloc request size 3445219328
> 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=  (0:00000)LOG:  database system is ready to accept read only connections
> 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=  (0:00000)LOG:  startup process (PID 23614) exited with exit code 1
> 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=  (0:00000)LOG:  terminating any other active server processes
> 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=  (0:00000)LOG:  archiver process (PID 23627) exited with exit code 1

What version is this?

Hm, so the startup process finds the consistent point (which signals
postmaster so that line 23612/3 says "ready to accept read-only conns")
and immediately dies because of the invalid memory alloc error.  I
suppose that error must be while trying to process some xlog record, but
without a xlog address it's difficult to say anything.  I suppose you
could try to pg_xlogdump WAL starting at the last known good address
3/8811DFF0 but I wouldn't know what to look for.

One strange thing is that xlog replay sets up an error context, so you
would have had a line like "xlog redo HEAP" etc, but there's nothing
here.  So maybe the allocation is not exactly in xlog replay, but
something different.  We'd need to see a backtrace in order to see what.
Since this occurs in the startup process, probably the easiest way is to
patch the source to turn that error into PANIC, then re-run and examine
the resulting core file.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099

On Sat, Aug 13, 2016 at 5:20 AM, Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> wrote:
James Sewell wrote:

> 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=  (0:00000)LOG:  consistent recovery state reached at 3/8811DFF0

> 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=  (0:XX000)FATAL:  invalid memory alloc request size 3445219328

> 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=  (0:00000)LOG:  database system is ready to accept read only connections

> 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=  (0:00000)LOG:  startup process (PID 23614) exited with exit code 1

> 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=  (0:00000)LOG:  terminating any other active server processes

> 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=  (0:00000)LOG:  archiver process (PID 23627) exited with exit code 1

What version is this?

Hm, so the startup process finds the consistent point (which signals

postmaster so that line 23612/3 says "ready to accept read-only conns")

and immediately dies because of the invalid memory alloc error.  I

suppose that error must be while trying to process some xlog record, but

without a xlog address it's difficult to say anything.  I suppose you

could try to pg_xlogdump WAL starting at the last known good address

3/8811DFF0 but I wouldn't know what to look for.

One strange thing is that xlog replay sets up an error context, so you

would have had a line like "xlog redo HEAP" etc, but there's nothing

here.  So maybe the allocation is not exactly in xlog replay, but

something different.  We'd need to see a backtrace in order to see what.

Since this occurs in the startup process, probably the easiest way is to

patch the source to turn that error into PANIC, then re-run and examine

the resulting core file.

--

Álvaro Herrera                http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.Attachment:
diagram.png

Description: PNG image
-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general