Hot-standby 9.6 server stopped after losing master, won't start nor be promoted

David Guyot <david.guyot@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 17 Oct 2017 14:29:24 +0200

Hello, there.

I just encountered a problem with a 9.6 hot standby architecture: by
mistake, we stopped the master and destroyed its data, so, some tens of
minutes later, when we noticed the problem, we tried to promote the
standby server. It stopped after the master failed, and we weren't able
to start it, so we also failed to promote it. I don't understand why,
but I assume I'm missing something in the replication process which
would explain that. The logs at master failure are the following: 
2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10
16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream:
SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at
28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17
10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at
28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17
10:51:39 CEST,,0,LOG,00000,"replication terminated by primary
server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17
10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message
to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at
28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17
10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx »
(2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17
10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx »
(2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10
16:36:36 CEST,,0,FATAL,XX000,"could not restore file «
00000001000000280000000C » from archive: child process exited with exit
code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with
exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

As I understand these lines, the standby server lost the WAL stream to
the master and tried to reconnect; on failure, it tried to retrieve the
last archive and, failing again stopped. First problem: in such a case,
I thought the standby would stay online, waiting for instructions,
instead of shutting itself down. Am I wrong on that? If not, what did I
missed, which prompted the standby to shut itself down?

Then, when we tried to restart it to promote it, we got these lines:
2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10
16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream:
SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at
28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17
10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at
28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17
10:51:39 CEST,,0,LOG,00000,"replication terminated by primary
server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17
10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message
to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at
28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17
10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx »
(2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17
10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx »
(2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10
16:36:36 CEST,,0,FATAL,XX000,"could not restore file «
00000001000000280000000C » from archive: child process exited with exit
code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with
exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10
16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

It seems that the server complains that the WAL stream was abruptly
stopped and that, as it fails to reconnect to the master, it can't
check if its (the standby) data are fresh, so it refuses to start. If
so, is it consistant? I don't think something as important as
replication would fail to start on master failure, as its goal is to
mitigate the master failure, so I assume I did something wrong, but
what? Otherwise, what should I do to allow the standby to restart and
be promoted?

Awaiting your answers,

Regards.
-- 
David Guyot
Administrateur système / Sysadmin
Europe Camions Interactive / Stockway
Moulin Collot F-88500 Ambacourt
Tél : +33 (0)3 29 30 47 85
Attachment:
signature.asc

Description: This is a digitally signed message part