Hello, there. I just encountered a problem with a 9.6 hot standby architecture: by mistake, we stopped the master and destroyed its data, so, some tens of minutes later, when we noticed the problem, we tried to promote the standby server. It stopped after the master failed, and we weren't able to start it, so we also failed to promote it. I don't understand why, but I assume I'm missing something in the replication process which would explain that. The logs at master failure are the following: 2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10 16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"" 2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,"" 2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,"" 2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,"" 2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,"" 2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,"" 2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting TCP/IP connections on port 5433? could not connect to server: Connection refused Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting TCP/IP connections on port 5433?",,,,,,,,,"" 2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting TCP/IP connections on port 5433 ? could not connect to server: Connection refused Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting TCP/IP connections on port 5433?",,,,,,,,,"" 2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,"" 2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,"" 2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" 2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,"" As I understand these lines, the standby server lost the WAL stream to the master and tried to reconnect; on failure, it tried to retrieve the last archive and, failing again stopped. First problem: in such a case, I thought the standby would stay online, waiting for instructions, instead of shutting itself down. Am I wrong on that? If not, what did I missed, which prompted the standby to shut itself down? Then, when we tried to restart it to promote it, we got these lines: 2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10 16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"" 2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,"" 2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,"" 2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,"" 2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,"" 2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,"" 2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting TCP/IP connections on port 5433? could not connect to server: Connection refused Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting TCP/IP connections on port 5433?",,,,,,,,,"" 2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting TCP/IP connections on port 5433 ? could not connect to server: Connection refused Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting TCP/IP connections on port 5433?",,,,,,,,,"" 2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,"" 2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,"" 2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" 2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,"" It seems that the server complains that the WAL stream was abruptly stopped and that, as it fails to reconnect to the master, it can't check if its (the standby) data are fresh, so it refuses to start. If so, is it consistant? I don't think something as important as replication would fail to start on master failure, as its goal is to mitigate the master failure, so I assume I did something wrong, but what? Otherwise, what should I do to allow the standby to restart and be promoted? Awaiting your answers, Regards. -- David Guyot Administrateur système / Sysadmin Europe Camions Interactive / Stockway Moulin Collot F-88500 Ambacourt Tél : +33 (0)3 29 30 47 85
Attachment:
signature.asc
Description: This is a digitally signed message part