Re: Difference in the tablespace folders on primary and secondary nodes

Abhishek Bhola <abhishek.bhola@xxxxxxxxxxxxxxx> · Wed, 26 Jul 2023 15:12:06 +0900

Hi Stephen
I got the latest documentation and understood that I was supposed to run`select * from pg_backup_stop();`  and store the labelfile output to a file in the data directory of the secondary node.

I also understood that this Low-level API backup is not really a recommended way and I need to move to another method.
However, till I fully set up a tool like pgBackRest (of which you are one of the contributors - very impressive!) for PROD, can you please tell me:
1) Are there any other options that are safe to use yet fast? Like you said rsync --size-only would miss a lot of changes, but anything that would also not start syncing every data file with only different timestamp, even though it is exactly the same, including the checksum.
2) While rsyncing from the master node, do I need to exclude `pg_wal` folder?
3) The replica/standby node should have access to the WAL segments archived during time between start and stop backup, right?
4) What kind of data corruption would have been introduced due to the absence of labelfile and if there is a way to identify the corruption? So for example, if I am able to reindex all the tables, select all the data and vacuum the db, do I still need to pg_dump and pg_restore those DBs?

Thanks

On Wed, Jul 26, 2023 at 11:00 AM Abhishek Bhola <abhishek.bhola@xxxxxxxxxxxxxxx> wrote:
Hi Stephen

Thank you for your reply.

Basically, this is not a valid way to perform a backup/restore of PG. 
Is it not valid only for PG 15 or even for earlier versions? I have always referred to this https://www.postgresql.org/docs/8.1/backup-online.html 

better, use an existing well maintained backup/restore tool for this
Is there any tool you could recommend? A tool to bring back the standby node when it was down for a day or so for some OS/firmware upgrade.
I have tried pgBackRest, but it requires a control node. So anything that doesn't involve a 3rd server for this situation would be helpful.

 this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop()
I tried running pg_backup_start('backup') and pg_backup_stop() on my DB, but I did not see any backup_label_file being created.
psql (15.1)
Type "help" for help.

postgres=# select pg_backup_start('backup');
 pg_backup_start
-----------------
 68/32000028
(1 row)

postgres=# select pg_backup_stop();
NOTICE:  all required WAL segments have been archived
                                pg_backup_stop
-------------------------------------------------------------------------------
 (68/32000100,"START WAL LOCATION: 68/32000028 (file 000000010000006800000032)+
 CHECKPOINT LOCATION: 68/32000060                                             +
 BACKUP METHOD: streamed                                                      +
 BACKUP FROM: primary                                                         +
 START TIME: 2023-07-26 08:51:28 JST                                          +
 LABEL: backup                                                                +
 START TIMELINE: 1                                                            +
 ","16724 /PGDATA/datadg/tbs1                                                 +
 16725 /PGDATA/datadg/tbs2                                                    +
 ")
(1 row)

I read the documentation on this page https://www.postgresql.org/docs/current/functions-admin.html 
The desired contents of the backup label file and the tablespace map file are returned as part of the result of the function and must be written to files in the backup area.
I don't understand, "must be written to files in the backup area". Does it mean we need to manually create a file first on the master node and then rsync it to the backup node?

Thanks

On Tue, Jul 25, 2023 at 9:22 PM Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
Greetings,

* Abhishek Bhola (abhishek.bhola@xxxxxxxxxxxxxxx) wrote:

> I recently set up a Postgres15 master-slave (Primary-secondary) cluster on

> 2 nodes. At the time of starting, I ensured that data files are exactly the

> same on both the nodes. The size of the DB is 1.5TB.

> The directory structure on both nodes looks as follows:

> ```

>   - /PROD/datadg/

>    |

>    |-> /PROD/datadg/tablespace

>    |   |-> /PROD/datadg/tablespace/tablespace1

>    |   |-> /PROD/datadg/tablespace/tablespace2

>    |

>    |-> /PROD/datadg/data

>    |   |-> /PROD/datadg/data/pg_tblspc

>    |   |   | -> /PROD/datadg/data/pg_tblspc/16432 ->

> /PROD/datadg/tablespace/tablespace1

>    |   |   | -> /PROD/datadg/data/pg_tblspc/16433 ->

> /PROD/datadg/tablespace/tablespace2

> 

> ```

> Almost a week later now, I see almost a 2GB size difference in the

> tablespace folders on the 2 nodes. I also see some file count difference on

> both the nodes.

Not really enough to go on here.

> `autovacuum` is on on both the nodes and there aren't any `idle in

> transaction` queries on the slave node. Also there hasn't been any

> disruption on the streaming replication. I did not get any error like `WAL

> segment already removed` or so on the slave node. `pg_stat_replication` on

> the master node also doesn't show anything out of the ordinary and the

> `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not

> see a difference in counts of most tables either, haven't verified for all

> of them.

> So my **first question** is:

> 

> * Why is there a difference in the files in the tablespace folder? I can

> understand the difference in the modification timestamps, but some files

> are just missing on the slave node.

Unlogged tables would certainly be a pretty easy explanation of size

differences between the two.  There's also temporary files that might be

created on one system but not the other for in-progress queries.  These

aren't the only possibilities but just a couple of likely candidates.

> Now if I were to run `vacuumdb` on the master node, there are chances that

> the slave node will break and give an error like this

> ```

> PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at

> 875E/21A70BD0 for

>  Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel

> 16405/16419/533716933, fork 2, blk 26; blkref #1: rel

> 16405/16419/533716933, blk 853758",,,,"","startup

> ```

Did you actually run vacuumdb and actually see this?  Are you able to

reproduce it?

> In the case when slave node breaks, these are the steps I usually do to

> bring the slave node back:

> 1) Start `pg_backup_start('backup')` on the master node

> 2) rsync the files from master to slave by running the following on the

> slave node:

> ```

> rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data

> --exclude 'pg_log' --exclude 'pg_replslot'

> ```

> 3. Stop `pg_backup_stop()` on  master node

This really isn't good enough as rsync will use timestamp/size by

default and this also doesn't grab and restore the absolutely required

backup_label file that's returned from pg_backup_stop().  Basically,

this is not a valid way to perform a backup/restore of PG.  Please go

read the documentation for how to use the low-level API ... or, even

better, use an existing well maintained backup/restore tool for this.

> 4. Start the slave node again and it usually works, even though the

> tablespace files might not still be the same.

If you didn't grab the backup_label from pg_backup_stop() and drop it

into place on the replica, then you've almost certainly got a corrupt

system.

> **Second question**:

> 

> * What is the best way to bring the slave node back? Is the `rsync` between

> tablespaces required? And if yes, what is the best method to do it for very

> large databases, something maybe as big as 30TB or more. I don't want to

> `rsync` all the files even if the timestamp on them is different. So is a

> command like this safe to do? Or should an option like `--checksum` be

> used?

> ```

> rsync -av --delete master_node:/PROD/datadg/tablespace/

> /PROD/datadg/tablespace --size-only

> ```

Using --size-only is an absolutely horrid idea and will miss lots of

changes.  What you're doing isn't working because you're just creating

corrupt replicas by not following the process correctly for performing a

backup and restore of PG.

This stuff isn't easy.  Use a tool for it, especially for larger

systems.

> **Third question:**

> 

> * Is it advised to run `vacuumdb` before or after bringing the slave node

> back again?

No, this isn't necessary.

Thanks,

Stephen

This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).本メッセージに記載および添付されている情報（以下、総称して「本情報」といいます。）は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。