Re: .history': No such file or directory - a symptom of ?

Stephen Frost <sfrost@xxxxxxxxxxx> · Sun, 13 Aug 2023 19:22:25 -0400

Greetings,

* lejeczek (peljasz@xxxxxxxxxxx) wrote:
> On 10/08/2023 17:41, Stephen Frost wrote:
> > * lejeczek (peljasz@xxxxxxxxxxx) wrote:
> > > What is that - as per subject - a symptom of exactly?
> > Arguably, a poor restore command being used ...
> > 
> > PostgreSQL will request .history files when doing recovery and will keep
> > requesting them, by default, until it finds that one isn't there- it
> > will then target the last timeline that it found to perform replay to
> > (target timeline = latest).  We do also look for .history files when
> > going through the promotion process and similarily will keep checking
> > for timeline files until we don't find one and then that's the timeline
> > we will move to for the promotion and we'll then immediately push a new
> > .history file into the archive to 'claim' that timeline.
> > 
> > Basically, it's not an error and it's entirely intentional that it works
> > that way and your restore command probably shouldn't be complaining
> > about it really (and that's where the actual 'No such file or directory'
> > bit is coming from- not from PG itself).
> > 
> > > I get that there was an issue, but with more details explained. Perhaps
> > > there are docs which explain that?
> > Much more likely that there wasn't actually any issue...
> > 
> > > When it happens I take slave down and do _pg_basebackup_ off the master -
> > > but is there a more "civilized" way to "push" the slave back in sync, maybe
> > > without taking slave off-line?
> > Not following this bit at all.  There being a message about PG not
> > finding a .history file during restore or promotion isn't actually an
> > indication of anything having gone wrong or that the replica is out of
> > sync.  In other words, I don't know that you needed to actually do
> > anything.  Is there some reason you think you did need to do something
> > beside that message being in the log..?

> Perhaps there is not an actual, real issue with synchronization, however the
> logs make me - I'd imagine anybody who is a novice like me - uncomfortable.

You *really* shouldn't be using simple 'cp' commands for your archive or
restore commands.

> These logs, the errors never quiet down - I've been waiting a few days.
> 
> from master:
> ....
> 2023-08-11 10:12:18.908 CEST [776006] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> 2023-08-11 10:12:23.909 CEST [777443] ERROR:  requested WAL segment
> 00000001000000000000004E has already been removed
> 2023-08-11 10:12:23.909 CEST [777443] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> 2023-08-11 10:12:28.911 CEST [778491] ERROR:  requested WAL segment
> 00000001000000000000004E has already been removed
> 2023-08-11 10:12:28.911 CEST [778491] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> ...

This is saying that the replica is asking for WAL segments from the
primary that have already been archived.  That's not a problem if you've
got a functioning archive repository where the replica can pull that WAL
from.

> from slave:
> ...
> cp: cannot stat '/var/lib/pgsql/pg_archive/00000002.history': No such file
> or directory

As mentioned, this can happen without there being an issue.

> 2023-08-11 10:12:38.919 CEST [773947] LOG:  waiting for WAL to become
> available at 0/4E002000
> cp: cannot stat '/var/lib/pgsql/pg_archive/00000001000000000000004E': No
> such file or directory
> 2023-08-11 10:12:43.916 CEST [1050527] LOG:  started streaming WAL from
> primary at 0/4E000000 on timeline 1
> 2023-08-11 10:12:43.916 CEST [1050527] FATAL:  could not receive data from
> WAL stream: ERROR:  requested WAL segment 00000001000000000000004E has
> already been removed

This is a problem though- the primary doesn't have the WAL and neither
does the archive.  Without that WAL, the replica can't play forward and
therefore isn't able to ever catch up to where the primary is.  There's
clearly something going wrong if you're properly archiving the WAL on
your primary to some location and then the replica isn't able to fetch
that WAL.

> So, seeing logs flooded that way.... I don't like it (even if I could be
> sure everything is in sync) particularly for master shows:
> 
> -> $ sudo -u postgres psql --port=5432 -x -c 'select client_addr,sync_state
> from pg_stat_replication;'
> could not change directory to "/root": Permission denied

This is just from psql starting up and trying to look in /root's home
dir because you used sudo.  That's not actually an issue.

> so I do 'pg_basebackup' then.

Do you have an archive_command configured on your primary..?  I'd
strongly recommend that you set that up and, ideally, use a well written
tool like pgbackrest to handle your backup, recovery, archiving, et al.
Without a WAL archive, you'll have this risk that the WAL which the
replica needs isn't available any more, or otherwise risk running the
primary out of disk space if the replica is offline for a long time.

Thanks,

Stephen
Attachment:
signature.asc

Description: PGP signature