Re: Standby Mechanics: WAL vs Streaming

Guillaume Lelarge <guillaume@xxxxxxxxxxxx> · Sun, 27 Aug 2017 08:40:44 +0200

2017-08-27 5:16 GMT+02:00 Don Seiler <don@xxxxxxxxx>:
On Sat, Aug 26, 2017 at 9:42 PM, MichaelDBA <MichaelDBA@xxxxxxxxxxx> wrote:

pgbackrest archives WALs 
faster than a custom archive_command command.  I use pgbackrest to 
automate archiving of WALs from primary to pgbackrest server.  From 
there, my restore_command on the slave pulls WALs if needed to catch up 
to SR (streaming replication). 

Interesting, I'll look into pgbackrest. Thanks!

The standby will recover from the streaming xlog data if local WAL files
 are not available. (So the standby prefers local WAL files to 
streaming?)

The other way around: The standby
 prefers streaming replication.  It only pulls WALs (restore_command) if
 it is too far behind.

OK. That's what I had though initially but for some reason convinced myself that it would think it's easier to get it locally when available.

The way PostgreSQL does it is a lot easier.

When you start the standby, it first looks for archived WAL to replay. When there's none left, it tries to connect to the primary. Once connected, it asks for a specific WAL location. If the primary has it, the streaming starts. It stays that way (using the streaming) as long as it can. If there's an error, for example the primary had a lot of writing to do, and recycle the WAL too quickly, the WAL information may have been gone before being sent to the WAL. In such a case, the primary and the standby will start displaying on the logs that kind of messages: FATAL:  requested WAL segment XXXX has already been removed. When this happens, the streaming connection between the primary and the standby is dropped, and the standby tries to replay the available archived WAL, according the ones it needs. When it finally replays them all, it tries back to connect to the primary (which hopefully will now have the WAL it needs).

So, that's kind of simple:
* first replaying archived WAL
* when no more archived WAL, streaming
* when error, back to WAL replay

I saw that the WAL archives were being removed from the archive 
directory. If the standby was already ahead of the WAL files because of 
streaming replication, would it still use the restore_command to restore
 WAL files from archive to px_log directory? Or would it just delete the
 archive files?

The archived files in your case 
(stored on the standby) would only be removed if you had  
pg_archivecleanup command also in your recovery.conf file for WAL 
cleanup (archive_cleanup_command).

Yes we do have that, sorry I didn't specify. I just wanted to confirm that it would be used to clean up archives that arrive well after their usefulness with streaming replication.

As far as the lag goes, what can we do to 
increase the throughput for streaming replication? We tried to give it 
almost all of our 100Mb pipe between data centers and there was still 
significant lag until it recovered past the point of my manual vacuums.

What do you expect?  Manual 
vacuum full on big tables will generate a lot of WAL thereby causing SR 
lag. 

As I've said in other recent emails to this list, I'm relatively new to these waters, having spent the last 15+ years as an Oracle DBA. I figured that manual vacuum would generate a lot of changes for replication. That's why I'm asking where I can improve the throughput (if possible), given the additional activity spike.

I'm
 going to assume that having both streaming replication and WAL rsyncing
 are going to compete for bandwidth. Would it be advisable to only do 
streaming most of the day and then maybe rsync WAL files in a nightly 
batch? Now that I think of it, we do hourly snapmirrors of that WAL 
storage already (also using the same bandwidth), perhaps we just use 
that for the WAL replication.

How about doing your manual 
vacuums at night time during low load times?

You're assuming that night time is our off-peak time (although I did say "nightly batch", it doesn't mean it's a better time for maintenance work in our case). There's a reason I was running the vacuum when I was this morning, and it isn't something we regularly do. We had a table with 3 corrupt indexes that had prevented autovacuum from running on that table for many months. Today I recreated those indexes and ran a manual vac freeze (it was well past the min freeze age), during an off-peak time negotiated with the business side of the shop.

Thanks again for the pg_backrest information. I'll check it out. I'm thinking at the very least we could just present that NFS snapmirror as a read-only volume to the standby and point restore_command at it. That snapmirror is taking place regardless, so the additional rsync we're doing in our archive_command script on the primary is just piling on and eating up more bandwidth.

-- 
Guillaume.