Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

Andres Freund <andres@xxxxxxxxxxx> · Wed, 29 Nov 2017 16:41:07 -0800

Hi,

On 2017-11-29 20:22:43 -0300, Alvaro Herrera wrote:
> Alex Kliukin wrote:
>
> > 2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
> > 13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
> > 5714534 instead of 307747550",,,,,,,,,""
>
> Uhh ... I had never heard of this "replication checkpoint" thing.

Contains information about how far logical replication like solutions
have replayed from other systems.

> It is part of replication origins feature, which is fairly new stuff
> (see src/backend/replication/logical/origin.c).  I'd bet this problem
> is related to a bug in logical replication "origins" code rather than
> any procedural problems in your base-backup taking setup ...

Possible.

What's the max_replication_origins setting? Is the system receiving
logical replication data? Could you describe the setup a bit? Any chance
the system's partially been running without fsync? Could you attach both
a corrupt and a non-corrupt state file?

It's a bit weird to see such an error because normally the state file's
just written to a temporary file and then renamed into place,
overwriting the old file.

> I wonder if there is some truncation of the 0x1257DADE value that
> produces the 5714534 value you're seeing -- something related to a
> pg_logical/replorigin_checkpoint file being written partially while the
> backup is being taken.
>
> Another point towards not including pg_logical/ contents when taking a
> base backup, I guess ...

You'd cause corruption if logical replication is in use, so no, please
don't.

- Andres