Re: Freeze break request: koji autovacuum_freeze_max_age

Adam Williamson <adamwill@xxxxxxxxxxxxxxxxx> · Mon, 02 Mar 2020 08:17:38 -0800



On Sun, 2020-03-01 at 11:26 -0800, Kevin Fenzi wrote:
> Greetings. 
> 
> Last night koji alerted due to slowness. It was not backups or anything,
> but rather the database hitting the limit I raised in commit c678f73b:
> 
> -autovacuum_freeze_max_age = 200000000   # maximum XID age before forced vacuum 
> +autovacuum_freeze_max_age = 300000000   # maximum XID age before forced vacuum 
> 
> What this means is basicially: postgres records the xid (transaction id)
> that can 'see' other transactions in the table rows. However, xid is a
> 32 bit value, meaning there can only be about 2.1billion transactions
> before it 'wraps around'. When it does so, all the 'old' XID's need to
> be gone or it will confuse it. It removes the old xids by marking old
> transactions as 'frozen' (so any other transaction should see them). 
> 
> So, this value tells the autovacuumer to start processing the table for
> old xids and frezzing them, so by the time the wrap around happens
> everything will be set. 
> 
> Unfortunately, it's doing this on the buildroot_listing table, which is:
>  public | buildroot_listing         | table | koji  | 219 GB     | 
> 
> So, the i/o load is heavy and koji is slow to respond to real requests. 
> 
> There's (at least) tree things we could do: 
> 
> 1. Bump the autovacuum_freeze_max_age up to 600million. The 100million
> bump I did in january gave us about 1.5 months, so if we do 600, we
> might last until june, when we will be migrating to the new datacenter.
> 600million is still a long way from 2.1 billion, so it should be fine. 
> At that point I hope to move db-koji01 to a rhel8 instance and much
> newer postgresql. We could also run the vacuum duing downtime and let it
> finish. 
> 
> 2. Just let it finish now. Things will be slow, I don't know for how
> long. Users will complain and it will take longer for people to get
> things done, but at the end we should be in better shape and there's
> basically no action we need to take (other than handling complaints)
> 
> 3. Schedule an outage and take the db offline and run the vacuum. This
> might be quicker than letting the autovac finish, I am not sure. 
> 
> Thoughts? please +1 the freeze break of the option you thing is best, or
> feel free to ask more info or suggest other options. 

I'm fine with 1 or 2, but think we should definitely *not* do 3 unless
a) we know quite precisely how long it will take and b) that is
substantially faster than the online autovac.
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx