Re: Difference in the tablespace folders on primary and secondary nodes

Stephen Frost <sfrost@xxxxxxxxxxx> · Tue, 25 Jul 2023 08:22:39 -0400

Greetings,

* Abhishek Bhola (abhishek.bhola@xxxxxxxxxxxxxxx) wrote:
> I recently set up a Postgres15 master-slave (Primary-secondary) cluster on
> 2 nodes. At the time of starting, I ensured that data files are exactly the
> same on both the nodes. The size of the DB is 1.5TB.
> The directory structure on both nodes looks as follows:
> ```
>   - /PROD/datadg/
>    |
>    |-> /PROD/datadg/tablespace
>    |   |-> /PROD/datadg/tablespace/tablespace1
>    |   |-> /PROD/datadg/tablespace/tablespace2
>    |
>    |-> /PROD/datadg/data
>    |   |-> /PROD/datadg/data/pg_tblspc
>    |   |   | -> /PROD/datadg/data/pg_tblspc/16432 ->
> /PROD/datadg/tablespace/tablespace1
>    |   |   | -> /PROD/datadg/data/pg_tblspc/16433 ->
> /PROD/datadg/tablespace/tablespace2
> 
> ```
> Almost a week later now, I see almost a 2GB size difference in the
> tablespace folders on the 2 nodes. I also see some file count difference on
> both the nodes.

Not really enough to go on here.

> `autovacuum` is on on both the nodes and there aren't any `idle in
> transaction` queries on the slave node. Also there hasn't been any
> disruption on the streaming replication. I did not get any error like `WAL
> segment already removed` or so on the slave node. `pg_stat_replication` on
> the master node also doesn't show anything out of the ordinary and the
> `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not
> see a difference in counts of most tables either, haven't verified for all
> of them.
> So my **first question** is:
> 
> * Why is there a difference in the files in the tablespace folder? I can
> understand the difference in the modification timestamps, but some files
> are just missing on the slave node.

Unlogged tables would certainly be a pretty easy explanation of size
differences between the two.  There's also temporary files that might be
created on one system but not the other for in-progress queries.  These
aren't the only possibilities but just a couple of likely candidates.

> Now if I were to run `vacuumdb` on the master node, there are chances that
> the slave node will break and give an error like this
> ```
> PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at
> 875E/21A70BD0 for
>  Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel
> 16405/16419/533716933, fork 2, blk 26; blkref #1: rel
> 16405/16419/533716933, blk 853758",,,,"","startup
> ```

Did you actually run vacuumdb and actually see this?  Are you able to
reproduce it?

> In the case when slave node breaks, these are the steps I usually do to
> bring the slave node back:
> 1) Start `pg_backup_start('backup')` on the master node
> 2) rsync the files from master to slave by running the following on the
> slave node:
> ```
> rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data
> --exclude 'pg_log' --exclude 'pg_replslot'
> ```
> 3. Stop `pg_backup_stop()` on  master node

This really isn't good enough as rsync will use timestamp/size by
default and this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop().  Basically,
this is not a valid way to perform a backup/restore of PG.  Please go
read the documentation for how to use the low-level API ... or, even
better, use an existing well maintained backup/restore tool for this.

> 4. Start the slave node again and it usually works, even though the
> tablespace files might not still be the same.

If you didn't grab the backup_label from pg_backup_stop() and drop it
into place on the replica, then you've almost certainly got a corrupt
system.

> **Second question**:
> 
> * What is the best way to bring the slave node back? Is the `rsync` between
> tablespaces required? And if yes, what is the best method to do it for very
> large databases, something maybe as big as 30TB or more. I don't want to
> `rsync` all the files even if the timestamp on them is different. So is a
> command like this safe to do? Or should an option like `--checksum` be
> used?
> ```
> rsync -av --delete master_node:/PROD/datadg/tablespace/
> /PROD/datadg/tablespace --size-only
> ```

Using --size-only is an absolutely horrid idea and will miss lots of
changes.  What you're doing isn't working because you're just creating
corrupt replicas by not following the process correctly for performing a
backup and restore of PG.

This stuff isn't easy.  Use a tool for it, especially for larger
systems.

> **Third question:**
> 
> * Is it advised to run `vacuumdb` before or after bringing the slave node
> back again?

No, this isn't necessary.

Thanks,

Stephen
Attachment:
signature.asc

Description: PGP signature