Hanging startup process on the replica after vacuuming on master

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all.

We have already described [0] our problem on pgsql-general@ and written to pgsql-performance@ but the problem was described too confused so let me ask for help one for time in more clear way.

We have a database where one table contains 3+ billion records. The definition of the table is quite simple:

rpopdb04d/rpopdb M # \dS+ rpop.rpop_uidl
                               Table "rpop.rpop_uidl"
 Column |          Type          | Modifiers | Storage  | Stats target | Description
--------+------------------------+-----------+----------+--------------+-------------
 popid  | bigint                 | not null  | plain    |              |
 uidl   | character varying(200) | not null  | extended |              |
Indexes:
    "pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl)

rpopdb04d/rpopdb M #

When vacuuming of this table finishes on master we see hanging of replay_location on all replicas on specific WAL position for up to 15 minutes. Other positions (sent_location, write_location, flush_location) in pg_stat_replication view continue increasing as expected.

rpopdb04d/postgres M # select client_hostname, sent_location, write_location, flush_location, replay_location from pg_stat_replication; \watch 1
<...>
                           Watch every 1s       Fri Mar 20 14:15:16 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 413F/CD8223D8 | 413F/CD81EDF8  | 413F/CD81EDF8  | 413F/CD81D4C0
 rpopdb04e.domain.com | 413F/CD8223D8 | 413F/CD81D4C0  | 413F/CD81D4C0  | 413F/CD81D3E0
(2 rows)

                           Watch every 1s       Fri Mar 20 14:15:17 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 413F/CD9887C0 | 413F/CD9884C0  | 413F/CD9881F0  | 413F/CD988058
 rpopdb04e.domain.com | 413F/CD9887C0 | 413F/CD988570  | 413F/CD9884C0  | 413F/CD9884C0
(2 rows)

                           Watch every 1s       Fri Mar 20 14:15:18 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 413F/CDA89F38 | 413F/CDA89F38  | 413F/CDA89EA0  | 413F/CDA131C8
 rpopdb04e.domain.com | 413F/CDA89F38 | 413F/CDA89EA0  | 413F/CDA89EA0  | 413F/CDA131C8
(2 rows)

                           Watch every 1s       Fri Mar 20 14:15:19 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 413F/CDB8DA20 | 413F/CDB8D7C0  | 413F/CDB8D630  | 413F/CDA131C8
 rpopdb04e.domain.com | 413F/CDB8DA20 | 413F/CDB8D630  | 413F/CDB8D630  | 413F/CDA131C8

<...>
<9 minutes of stuck replay_location>
<...>
                           Watch every 1s       Fri Mar 20 14:24:11 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 4140/1F7E28C8 | 4140/1F7E2838  | 4140/1F7E2838  | 413F/CDA131C8
 rpopdb04e.domain.com | 4140/1F7E28C8 | 4140/1F7E2550  | 4140/1F7E2550  | 413F/CDA131C8
(2 rows)

                           Watch every 1s       Fri Mar 20 14:24:12 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 4140/1F9F6A28 | 4140/1F9F6508  | 4140/1F9F6508  | 413F/CDA131C8
 rpopdb04e.domain.com | 4140/1F9F6A28 | 4140/1F9F6508  | 4140/1F9F6508  | 413F/CE258D48
(2 rows)

                           Watch every 1s       Fri Mar 20 14:24:13 2015

      client_hostname      | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
 rpopdb04g.domain.com | 4140/203BA108 | 4140/203BA108  | 4140/203BA108  | 413F/CDA131C8
 rpopdb04e.domain.com | 4140/203BA108 | 4140/203BA108  | 4140/203BA108  | 413F/CE853C38
(2 rows)

The content of this WAL position according to pg_xlogdump is the following (actually it is the output of different playback of the problem so WAL positions differ but the essence is the same):
rmgr: Btree       len (rec/tot):     20/    52, tx:          0, lsn: 4115/56126DC0, prev 4115/56126D90, bkp: 0000, desc: vacuum: rel 1663/16420/16796; blk 31222118, lastBlockVacuumed 0
This OID (16420/16796) belongs to PK of this table.

We have also noticed that while replay_location is stuck, startup process on the replica consumes a lot of I/O reading lots of data from array with $PGDATA (we have xlogs on a separate array). Most of the backtraces of startup process in this time looks like that:

0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6
#0  0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6
#1  0x000000000065d2f5 in FileRead (file=<value optimized out>, buffer=0x7f53ac0dba20 ";9", amount=8192) at fd.c:1286
#2  0x000000000067acad in mdread (reln=<value optimized out>, forknum=<value optimized out>, blocknum=12063036, buffer=0x7f53ac0dba20 ";9") at md.c:679
#3  0x0000000000659b4e in ReadBuffer_common (smgr=<value optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=12063036, mode=RBM_NORMAL_NO_LOG, strategy=0x0, hit=0x7fff898a912f "") at bu
fmgr.c:476
#4  0x000000000065a61b in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=12063036, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:287
#5  0x00000000004cfb78 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=12063036, mode=RBM_NORMAL_NO_LOG) at xlogutils.c:324
#6  0x00000000004a3651 in btree_xlog_vacuum (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:522
#7  btree_redo (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:1144
#8  0x00000000004c903a in StartupXLOG () at xlog.c:6827
#9  0x000000000062f8bf in StartupProcessMain () at startup.c:224
#10 0x00000000004d3e9a in AuxiliaryProcessMain (argc=2, argv=0x7fff898a98a0) at bootstrap.c:416
#11 0x000000000062a99c in StartChildProcess (type=StartupProcess) at postmaster.c:5146
#12 0x000000000062e9e2 in PostmasterMain (argc=3, argv=<value optimized out>) at postmaster.c:1237
#13 0x00000000005c7d68 in main (argc=3, argv=0x1e22910) at main.c:228

Investigating the source code of btree_xlog_vacuum function we have found that it calls XLogReadBufferExtended with mode=RBM_NORMAL_NO_LOG only in one case:

    /*
     * If queries might be active then we need to ensure every leaf page is
     * unpinned between the lastBlockVacuumed and the current block, if there
     * are any.  This prevents replay of the VACUUM from reaching the stage of
     * removing heap tuples while there could still be indexscans "in flight"
     * to those particular tuples (see nbtree/README).
     *
     * It might be worth checking if there are actually any backends running;
     * if not, we could just skip this.
     *
     * Since VACUUM can visit leaf pages out-of-order, it might issue records
     * with lastBlockVacuumed >= block; that's not an error, it just means
     * nothing to do now.
     *
     * Note: since we touch all pages in the range, we will lock non-leaf
     * pages, and also any empty (all-zero) pages that may be in the index. It
     * doesn't seem worth the complexity to avoid that.  But it's important
     * that HotStandbyActiveInReplay() will not return true if the database
     * isn't yet consistent; so we need not fear reading still-corrupt blocks
     * here during crash recovery.
     */
    if (HotStandbyActiveInReplay())
    {
        BlockNumber blkno;

        for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++)
        {
            /*
             * We use RBM_NORMAL_NO_LOG mode because it's not an error
             * condition to see all-zero pages.  The original btvacuumpage
             * scan would have skipped over all-zero pages, noting them in FSM
             * but not bothering to initialize them just yet; so we mustn't
             * throw an error here.  (We could skip acquiring the cleanup lock
             * if PageIsNew, but it's probably not worth the cycles to test.)
             *
             * XXX we don't actually need to read the block, we just need to
             * confirm it is unpinned. If we had a special call into the
             * buffer manager we could optimise this so that if the block is
             * not in shared_buffers we confirm it as unpinned.
             */
            buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,
                                            RBM_NORMAL_NO_LOG);

And if I understand everything right, while replaying this WAL record startup process reads from disk 31222118*8/1024/1024 ~ 238 GB of data (seems like the full size of the index).

Below are some details about our environment but my question is what we can do not to have delaying replicas? Any help would be really appropriate.



All hosts (master and two replicas) run RHEL 6.6 with PostgreSQL 9.4.1 (upgraded from 9.3.6 where we had the same problem). We use built-in streaming replication and the same configuration on all hosts:

rpopdb04g/postgres R # select name, setting from pg_settings where setting != boot_val and source = 'configuration file';
              name               |                                                                      setting
---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------
 archive_command                 | rsync -a --contimeout=1 --timeout=5 --password-file=/etc/barman.passwd %p rsync://barman@xxxxxxxxxxxxxxxxxxxxxxx:/barman/rpopdb04/incoming/%f
 archive_mode                    | on
 autovacuum_analyze_scale_factor | 0.001
 autovacuum_max_workers          | 10
 autovacuum_naptime              | 15
 autovacuum_vacuum_cost_delay    | 5
 autovacuum_vacuum_cost_limit    | 2000
 autovacuum_vacuum_scale_factor  | 1e-05
 autovacuum_vacuum_threshold     | 200
 bgwriter_delay                  | 50
 bgwriter_lru_maxpages           | 250
 checkpoint_completion_target    | 0.8
 checkpoint_segments             | 384
 checkpoint_timeout              | 600
 checkpoint_warning              | 420
 DateStyle                       | ISO, DMY
 default_text_search_config      | pg_catalog.english
 effective_cache_size            | 13107200
 hot_standby                     | on
 hot_standby_feedback            | on
 lc_messages                     | en_US.UTF-8
 lc_monetary                     | en_US.UTF-8
 lc_numeric                      | en_US.UTF-8
 lc_time                         | en_US.UTF-8
 listen_addresses                | *
 log_autovacuum_min_duration     | 0
 log_checkpoints                 | on
 log_hostname                    | on
 log_line_prefix                 | < %m >
 log_lock_waits                  | on
 log_rotation_size               | 0
 log_statement                   | ddl
 log_timezone                    | Europe/Moscow
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 2097152
 max_connections                 | 4000
 max_wal_senders                 | 10
 pg_stat_statements.max          | 10000
 pg_stat_statements.track        | all
 shared_buffers                  | 524288
 shared_preload_libraries        | pg_stat_statements
 TimeZone                        | Europe/Moscow
 track_functions                 | all
 track_io_timing                 | on
 update_process_title            | off
 vacuum_defer_cleanup_age        | 200000
 wal_buffers                     | 8192
 wal_keep_segments               | 64
 wal_level                       | hot_standby
 wal_receiver_status_interval    | 1
 wal_sender_timeout              | 3000
 work_mem                        | 16384
(53 rows)

Time: 1.955 ms
rpopdb04g/postgres R #

On replicas we issue select-queries that do index only scans on this table.

We would probably partition this table to reduce the size of the index so that btree_xlog_vacuum function would work faster. But perhaps there is anything else we can do right now?

Thanks in advance.


--
May the force be with you…


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux