Hi all. We have already described [0] our problem on pgsql-general@ and written to pgsql-performance@ but the problem was described too confused so let me ask for help one for time in more clear way. We have a database where one table contains 3+ billion records. The definition of the table is quite simple: rpopdb04d/rpopdb M # \dS+ rpop.rpop_uidl Table "rpop.rpop_uidl" Column | Type | Modifiers | Storage | Stats target | Description --------+------------------------+-----------+----------+--------------+------------- popid | bigint | not null | plain | | uidl | character varying(200) | not null | extended | | Indexes: "pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl) rpopdb04d/rpopdb M # When vacuuming of this table finishes on master we see hanging of replay_location on all replicas on specific WAL position for up to 15 minutes. Other positions (sent_location, write_location, flush_location) in pg_stat_replication view continue increasing as expected. rpopdb04d/postgres M # select client_hostname, sent_location, write_location, flush_location, replay_location from pg_stat_replication; \watch 1 <...> Watch every 1s Fri Mar 20 14:15:16 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 413F/CD8223D8 | 413F/CD81EDF8 | 413F/CD81EDF8 | 413F/CD81D4C0 rpopdb04e.domain.com | 413F/CD8223D8 | 413F/CD81D4C0 | 413F/CD81D4C0 | 413F/CD81D3E0 (2 rows) Watch every 1s Fri Mar 20 14:15:17 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 413F/CD9887C0 | 413F/CD9884C0 | 413F/CD9881F0 | 413F/CD988058 rpopdb04e.domain.com | 413F/CD9887C0 | 413F/CD988570 | 413F/CD9884C0 | 413F/CD9884C0 (2 rows) Watch every 1s Fri Mar 20 14:15:18 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 413F/CDA89F38 | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA131C8 rpopdb04e.domain.com | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA89EA0 | 413F/CDA131C8 (2 rows) Watch every 1s Fri Mar 20 14:15:19 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 413F/CDB8DA20 | 413F/CDB8D7C0 | 413F/CDB8D630 | 413F/CDA131C8 rpopdb04e.domain.com | 413F/CDB8DA20 | 413F/CDB8D630 | 413F/CDB8D630 | 413F/CDA131C8 <...> <9 minutes of stuck replay_location> <...> Watch every 1s Fri Mar 20 14:24:11 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 4140/1F7E28C8 | 4140/1F7E2838 | 4140/1F7E2838 | 413F/CDA131C8 rpopdb04e.domain.com | 4140/1F7E28C8 | 4140/1F7E2550 | 4140/1F7E2550 | 413F/CDA131C8 (2 rows) Watch every 1s Fri Mar 20 14:24:12 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CDA131C8 rpopdb04e.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CE258D48 (2 rows) Watch every 1s Fri Mar 20 14:24:13 2015 client_hostname | sent_location | write_location | flush_location | replay_location ---------------------------+---------------+----------------+----------------+----------------- rpopdb04g.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CDA131C8 rpopdb04e.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CE853C38 (2 rows) The content of this WAL position according to pg_xlogdump is the following (actually it is the output of different playback of the problem so WAL positions differ but the essence is the same): rmgr: Btree len (rec/tot): 20/ 52, tx: 0, lsn: 4115/56126DC0, prev 4115/56126D90, bkp: 0000, desc: vacuum: rel 1663/16420/16796; blk 31222118, lastBlockVacuumed 0 This OID (16420/16796) belongs to PK of this table. We have also noticed that while replay_location is stuck, startup process on the replica consumes a lot of I/O reading lots of data from array with $PGDATA (we have xlogs on a separate array). Most of the backtraces of startup process in this time looks like that: 0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6 #0 0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6 #1 0x000000000065d2f5 in FileRead (file=<value optimized out>, buffer=0x7f53ac0dba20 ";9", amount=8192) at fd.c:1286 #2 0x000000000067acad in mdread (reln=<value optimized out>, forknum=<value optimized out>, blocknum=12063036, buffer=0x7f53ac0dba20 ";9") at md.c:679 #3 0x0000000000659b4e in ReadBuffer_common (smgr=<value optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=12063036, mode=RBM_NORMAL_NO_LOG, strategy=0x0, hit=0x7fff898a912f "") at bu fmgr.c:476 #4 0x000000000065a61b in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=12063036, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:287 #5 0x00000000004cfb78 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=12063036, mode=RBM_NORMAL_NO_LOG) at xlogutils.c:324 #6 0x00000000004a3651 in btree_xlog_vacuum (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:522 #7 btree_redo (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:1144 #8 0x00000000004c903a in StartupXLOG () at xlog.c:6827 #9 0x000000000062f8bf in StartupProcessMain () at startup.c:224 #10 0x00000000004d3e9a in AuxiliaryProcessMain (argc=2, argv=0x7fff898a98a0) at bootstrap.c:416 #11 0x000000000062a99c in StartChildProcess (type=StartupProcess) at postmaster.c:5146 #12 0x000000000062e9e2 in PostmasterMain (argc=3, argv=<value optimized out>) at postmaster.c:1237 #13 0x00000000005c7d68 in main (argc=3, argv=0x1e22910) at main.c:228 Investigating the source code of btree_xlog_vacuum function we have found that it calls XLogReadBufferExtended with mode=RBM_NORMAL_NO_LOG only in one case: /* * If queries might be active then we need to ensure every leaf page is * unpinned between the lastBlockVacuumed and the current block, if there * are any. This prevents replay of the VACUUM from reaching the stage of * removing heap tuples while there could still be indexscans "in flight" * to those particular tuples (see nbtree/README). * * It might be worth checking if there are actually any backends running; * if not, we could just skip this. * * Since VACUUM can visit leaf pages out-of-order, it might issue records * with lastBlockVacuumed >= block; that's not an error, it just means * nothing to do now. * * Note: since we touch all pages in the range, we will lock non-leaf * pages, and also any empty (all-zero) pages that may be in the index. It * doesn't seem worth the complexity to avoid that. But it's important * that HotStandbyActiveInReplay() will not return true if the database * isn't yet consistent; so we need not fear reading still-corrupt blocks * here during crash recovery. */ if (HotStandbyActiveInReplay()) { BlockNumber blkno; for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++) { /* * We use RBM_NORMAL_NO_LOG mode because it's not an error * condition to see all-zero pages. The original btvacuumpage * scan would have skipped over all-zero pages, noting them in FSM * but not bothering to initialize them just yet; so we mustn't * throw an error here. (We could skip acquiring the cleanup lock * if PageIsNew, but it's probably not worth the cycles to test.) * * XXX we don't actually need to read the block, we just need to * confirm it is unpinned. If we had a special call into the * buffer manager we could optimise this so that if the block is * not in shared_buffers we confirm it as unpinned. */ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL_NO_LOG); And if I understand everything right, while replaying this WAL record startup process reads from disk 31222118*8/1024/1024 ~ 238 GB of data (seems like the full size of the index). Below are some details about our environment but my question is what we can do not to have delaying replicas? Any help would be really appropriate. All hosts (master and two replicas) run RHEL 6.6 with PostgreSQL 9.4.1 (upgraded from 9.3.6 where we had the same problem). We use built-in streaming replication and the same configuration on all hosts: rpopdb04g/postgres R # select name, setting from pg_settings where setting != boot_val and source = 'configuration file'; name | setting ---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------- archive_command | rsync -a --contimeout=1 --timeout=5 --password-file=/etc/barman.passwd %p rsync://barman@xxxxxxxxxxxxxxxxxxxxxxx:/barman/rpopdb04/incoming/%f archive_mode | on autovacuum_analyze_scale_factor | 0.001 autovacuum_max_workers | 10 autovacuum_naptime | 15 autovacuum_vacuum_cost_delay | 5 autovacuum_vacuum_cost_limit | 2000 autovacuum_vacuum_scale_factor | 1e-05 autovacuum_vacuum_threshold | 200 bgwriter_delay | 50 bgwriter_lru_maxpages | 250 checkpoint_completion_target | 0.8 checkpoint_segments | 384 checkpoint_timeout | 600 checkpoint_warning | 420 DateStyle | ISO, DMY default_text_search_config | pg_catalog.english effective_cache_size | 13107200 hot_standby | on hot_standby_feedback | on lc_messages | en_US.UTF-8 lc_monetary | en_US.UTF-8 lc_numeric | en_US.UTF-8 lc_time | en_US.UTF-8 listen_addresses | * log_autovacuum_min_duration | 0 log_checkpoints | on log_hostname | on log_line_prefix | < %m > log_lock_waits | on log_rotation_size | 0 log_statement | ddl log_timezone | Europe/Moscow log_truncate_on_rotation | on logging_collector | on maintenance_work_mem | 2097152 max_connections | 4000 max_wal_senders | 10 pg_stat_statements.max | 10000 pg_stat_statements.track | all shared_buffers | 524288 shared_preload_libraries | pg_stat_statements TimeZone | Europe/Moscow track_functions | all track_io_timing | on update_process_title | off vacuum_defer_cleanup_age | 200000 wal_buffers | 8192 wal_keep_segments | 64 wal_level | hot_standby wal_receiver_status_interval | 1 wal_sender_timeout | 3000 work_mem | 16384 (53 rows) Time: 1.955 ms rpopdb04g/postgres R # On replicas we issue select-queries that do index only scans on this table. We would probably partition this table to reduce the size of the index so that btree_xlog_vacuum function would work faster. But perhaps there is anything else we can do right now? Thanks in advance. |