>>On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:
>> This setup worked very well for a long time, but since we upgraded
>> PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
>> wal files. This seems to happen mainly when the server is shut down.
>> There's no messages about errors in the logs, and also no corresponding
>> .done file in archive_status, it just goes to the next wal file.
>>
>> I would like to know if any of you had or has a problem like this, and if
>> you had, how you solved it.
>
>Hmm. I don't recall seeing any of that. This would mean that we are
>either missing the generation of some .ready file, or that some .done
>file gets generated when they should not in archive_status/. What
>kind of server shutdown are you doing? Immediate so as recovery
>happens at the follow-up startup. Or is that a clean service
>shutdown?
>--
>Michael
>> This setup worked very well for a long time, but since we upgraded
>> PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
>> wal files. This seems to happen mainly when the server is shut down.
>> There's no messages about errors in the logs, and also no corresponding
>> .done file in archive_status, it just goes to the next wal file.
>>
>> I would like to know if any of you had or has a problem like this, and if
>> you had, how you solved it.
>
>Hmm. I don't recall seeing any of that. This would mean that we are
>either missing the generation of some .ready file, or that some .done
>file gets generated when they should not in archive_status/. What
>kind of server shutdown are you doing? Immediate so as recovery
>happens at the follow-up startup. Or is that a clean service
>shutdown?
>--
>Michael
Almost daily the server is shutdown, because my client wants to save energy.
In the logs, it appears that the service was not shut down properly. Here's the last lines of one log file:
2020-03-16 18:10:03.130 -04 [1964] LOG: received fast shutdown request
2020-03-16 18:10:03.132 -04 [2948] ERROR: canceling statement due to user request
2020-03-16 18:10:03.192 -04 [1964] LOG: aborting any active transactions
2020-03-16 18:10:03.194 -04 [1964] LOG: worker process: logical replication launcher (PID 2948) exited with exit code 1
2020-03-16 18:10:03.210 -04 [2188] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6688] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [3348] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6356] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [5736] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [4028] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.211 -04 [2552] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.212 -04 [4280] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [2336] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.227 -04 [6564] FATAL: terminating connection due to administrator command
2020-03-16 18:10:03.255 -04 [2884] LOG: shutting down
When the service is started the next day, this is what Postgres logs:
2020-03-17 07:47:26.721 -04 [2272] LOG: database system was interrupted; last known up at 2020-03-16 17:56:20 -04
2020-03-17 07:47:34.974 -04 [2884] FATAL: the database system is starting up
2020-03-17 07:47:39.378 -04 [2272] LOG: database system was not properly shut down; automatic recovery in progress
2020-03-17 07:47:39.631 -04 [2272] LOG: redo starts at 38/F30F5108
2020-03-17 07:47:40.038 -04 [3052] FATAL: the database system is starting up
2020-03-17 07:47:40.126 -04 [2272] LOG: redo done at 38/F4001738
2020-03-17 07:47:40.207 -04 [2272] LOG: last completed transaction was at log time 2020-03-16 18:06:53.269798-04
2020-03-17 07:47:42.590 -04 [2080] LOG: database system is ready to accept connections
In this shutdown/startup process, this PostgreSQL instance skipped archiving a wal file.
Another thing that I perceived today is that in another instance, days after (about a week) skipping a wal file, the file was finally archived.
Both of these instances that I talked about also have streaming replication configured. But I don't think it is related because there other instances skipping archiving wal files that don't have streaming replication configured.
Norberto