Hello,
I posted here a couple of months ago about a high replication lag on PG 9.2. (I've switched to another mailserver, therefore, can't reply to the original thread).
I have done a couple of enhancements for the past few months and would like to share them with you, to ask your suggestions as the problem is still there.
Yes, I know I use a very old PG version. But a migration plan is in place to PG 13. I wish I had finished that project already, but it's taking longer than expected.
If you remember correctly, my original setup was: 1 Master and 1 Slave (100% of read-only traffic) running on AWS EC2 instances, and the replication lag was getting up to 10 minutes in some cases.
Since then, I have conducted the following improvements:
- Deployed more slaves to the stack, replacing that original slave with 4 new ones using AWS EBS GP3 volumes.
- Upgraded the Master's volumes from GP2 to GP3.
- Deployed Zabbix with the Zabbix PostgreSQL plugin to better monitor the databases and their hosts, to have better visibility.
- Have decreased max_standby_streaming_delay from 300s to 30s.
- Because I'm dealing with new Instances Types, now that I have deployed more slaves, I've tuned postgresql.conf file according to pgtune and pgconfig. You can check my postgresql.conf for a r4.4xlarge (16 vCPU, 122GB RAM) below:
- data_directory = '/pgsql/9.2/main'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' ident_file = '/etc/postgresql/9.2/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/9.2-main.pid' hot_standby = on listen_addresses = '*' port = 5432 random_page_cost = 1.1 max_connections = 500 unix_socket_directory = '/var/run/postgresql' shared_buffers = 31232MB statement_timeout = 0 work_mem = 63963kB maintenance_work_mem = 2GB shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all track_activity_query_size = 102400 wal_level = hot_standby fsync = on synchronous_commit = on wal_buffers = 16MB checkpoint_segments = 32 checkpoint_completion_target = 0.9 archive_mode = on archive_command = '/pgsql/pg-archive-wal-to-slaves.sh "%p"' archive_timeout = 1800 max_wal_senders = 20 wal_keep_segments = 1024 effective_cache_size = 93696MB logging_collector = on log_directory = '/data/postgresql/log' log_filename = 'postgresql-9.2-main.log.%a' log_rotation_age = 1440 log_rotation_size = 0 log_truncate_on_rotation = on log_min_duration_statement = 1000 log_lock_waits = on log_statement = 'ddl' log_timezone = 'UTC' stats_temp_directory = '/var/run/postgresql/9.2-main.pg_stat_tmp' autovacuum = on log_autovacuum_min_duration = 1000 autovacuum_max_workers = 5 autovacuum_naptime = 40s autovacuum_vacuum_threshold = 200 autovacuum_analyze_threshold = 150 autovacuum_vacuum_scale_factor = 0.02 autovacuum_analyze_scale_factor = 0.005 deadlock_timeout = 2s max_files_per_process = 4096 effective_io_concurrency = 200 hot_standby_feedback = on # https://dba.stackexchange.com/a/280727 max_standby_streaming_delay = 30s default_statistics_target = 100
After all those changes, our replication lag now gets up to 3 minutes (tops) with an average of 1:30 minutes. Even though it has improved a lot, it is still not great and I was hopping to get a few suggestions from you guys.
Any suggestions/comments will be much appreciated.
Cheers!
|