error: database is not accepting commands to avoid wraparound data loss in database ....

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
I would very much appreciate any help / advise with this problem which includes logical replication and transaction IDs wraparound error:

We have production database collecting metrics data from web widgets - hundreds of millions of records per day, most of them inserted in separate transactions. Data are partitioned by day, we have 10 different parent tables for different types of metrics.

Collected data are also copied into our main data warehouse database. Previously we synchronized daily partitions 1x per day but this became very heavy task because partitions have like 20 - 40 GB each. In some days even more.... So even with parallel run of dump-restore it took several hours to synchronize data causing quite a big load on both replica dedicated just for sync and data warehouse.

Therefore I implemented logical replication to shift data continuously from collection database into our main data warehouse database + one other which contains only some data. Publication and subscription contained always partitions for tomorrow, today + 2 days backwards because sometimes we experience quite big delays in delivering of data.

It worked absolutely great for some time but in last several days we had really heavy spikes in usage of widgets so we collected much more records and we ended up with both data warehouse database being blocked by transaction IDs wraparound error.

It all happened very quickly during the night and once we got first alerts databases were already in troubles. We so far did not have alerts for messages "database "xxxxxx" must be vacuumed within xxxxxxx transactions" because we so far never encountered them. It would help of course - I will implement them once I solve main problem....

We have so far never seen this wraparound problem before. We also never had it on collecting database even with this enormous amount of transactions.

But it looks like subscriptions on data warehouse databases conflicted with autovacuum of partitions. Logical replication most likely transfers all transactions 1:1 so suddenly we had incredible number of separate transactions on data warehouse - which never happened before.

So we can now only select data from both data warehouse databases but we cannot do anything else - we always get message:

error: database is not accepting commands to avoid wraparound data loss in database "xxxxx"

Disabling and dropping subscription does not work either - same message.

Because logical replication was not running we have seen huge accumulation of unarchived WAL segments on collecting database - they have been blocked by logical replication slots. Because I needed to avoid any problems on collecting database I decided to drop all publications and logical replication slots there. This immediately solved the problem with WAL segments.

I checked solution for wraparound error and since all sources I found suggested to do VACUUM under single-user mode I decided to test it on second smaller data warehouse database (~16TB of data) which is not so crucial for us to have it still running since I knew it will be inaccessible during that time. Vacuum is running for 24 hours already on this database and I can only wait for it to finish....

But our main data warehouse database is even bigger ~22TB of data and I cannot shut it down for several days. Therefore I started to prepare replacement for it - since we can still read data from it I can copy them somewhere else. I needed it also for some daily aggregations - so new database currently contains only several days of data. Question is what to do next....

Data warehouse database uses standard settings for vacuuming:

autovacuum    on
autovacuum_analyze_scale_factor    0.1
autovacuum_analyze_threshold    50
autovacuum_freeze_max_age    200000000
autovacuum_max_workers    3
autovacuum_multixact_freeze_max_age    400000000
autovacuum_naptime    60
autovacuum_vacuum_cost_delay    20
autovacuum_vacuum_cost_limit    -1
autovacuum_vacuum_scale_factor    0.2
autovacuum_vacuum_threshold    50
autovacuum_work_mem    -1
log_autovacuum_min_duration    -1
vacuum_cleanup_index_scale_factor    0.1
vacuum_cost_delay    0
vacuum_cost_limit    200
vacuum_cost_page_dirty    20
vacuum_cost_page_hit    1
vacuum_cost_page_miss    10
vacuum_defer_cleanup_age    0
vacuum_freeze_min_age    50000000
vacuum_freeze_table_age    150000000
vacuum_multixact_freeze_min_age    5000000
vacuum_multixact_freeze_table_age    150000000

Can I do something else with main data warehouse database than just this painfully slow VACUUM? If I would change some settings for freeze would it help in way that database would accept commands again so I would be able to drop subscription and vacuum it in normal way?

Thank you very much in advance for any help / advice you might have.

Josef Machytka
Berlin

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux