Search Postgresql Archives

Transaction id wraparound and autovacuum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Late last week we experienced stoppage on two of our high volume production database servers due to imminent transaction id wrap- around. The first one caught us unawares and caused a bit of a panic during lunch 8^). The second one was noticed after we scrutinized all of the servers and noticed one was throwing the warning that less than 10 million txids remained before shutdown. Unfortunately due to the high transaction volume on these servers, we didn't have very much time to react (in this case < 30 minutes once we saw the warning).

Both of the servers are identically configured, and all run the autovacuum daemon. At first I thought that I was being naive and had assumed that the autovac daemon took care of the txid maintenance when it doesn't. However after re-reviewing the docs, it clearly is supposed to address this, only in our case it clearly failed. I know the daemon is running, since I see it working as expected fairly often when monitoring the boxes. I find it frustrating that the daemon does not log anything to give you a history of its activity, but based on previous discussions on the lists it seems the prevailing opinion is that such activity is not interesting enough. Especially given the above, I disagree since I now have a severe lack of forensic evidence to figure out if and when the daemon ever tried a database-wide vacuum and what became of it.

Here is my autovac config (this is pg 8.1.3 btw, and yes we are planning to upgrade to 8.1.5):

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on

autovacuum = on         # enable autovacuum subprocess?
autovacuum_naptime = 300        # time between autovacuum runs, in secs
autovacuum_vacuum_threshold = 500   # min # of tuple updates before
                    # vacuum
autovacuum_analyze_threshold = 250  # min # of tuple updates before
                    # analyze
autovacuum_vacuum_scale_factor = 0.2    # fraction of rel size before
                    # vacuum
autovacuum_analyze_scale_factor = 0.1   # fraction of rel size before
                    # analyze
#autovacuum_vacuum_cost_delay = -1  # default vacuum cost delay for
                    # autovac, -1 means use
                    # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1  # default vacuum cost limit for
                    # autovac, -1 means use
                    # vacuum_cost_limit

vacuum_cost_delay = 50          # 0-1000 milliseconds
#vacuum_cost_page_hit = 1       # 0-10000 credits
#vacuum_cost_page_miss = 10     # 0-10000 credits
#vacuum_cost_page_dirty = 20    # 0-10000 credits
vacuum_cost_limit = 200         # 0-10000 credits

The cost options have been tuned for minimal impact under heavy load.

So I wonder if anyone has any insights into why the autovac daemon failed to detect the imminent txid wraparound? Looking at the code, I suspect it may have detected the issue, but not had enough time to do a vacuum before it ran aground. OTOH, I do know that it was monitoring the "postgres" scratch database, which has no data in it and would therefore vacuum nearly instantaneously. I'm fairly certain that this database was not vacuumed because it was still the database that was referenced in the txid warnings after the server stopped. Once I vacuumed that db manually it referenced a different database in the warning.

Here are some random thoughts/suggestions:

- The autovac daemon doesn't initiate a database-wide vacuum until 100,000 txns before the txn id warning hits. In my case this gives only a few minutes under peak load. According to the docs, this maintenance is recommended every 500M transactions, seems like it might be beneficial to multiply this by 100, or maybe more.

- My tuning to make vacuum less impacting seems detrimental to database-wide vacuums. It takes many hours to perform a database-wide vacuum (much more time than it takes me to use up 10M txids from what I have seen). Perhaps this warning should also be increased (or configurable).

On modern hardware it is possible to chew through txids at more than 1000/sec. At that rate, even 10M ids isn't all that many.

Thanks.

-Casey




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux