On Thu, May 23, 2024 at 1:22 PM Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
On Thu, 2024-05-23 at 02:46 +0530, sud wrote:
> It's RDS postgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs"
> reach to ~1.5billion and got alerted by team members who mentioned the database
> is going to be in shutdown/hung if this value reaches to ~2billion and won't be
> able to serve any incoming transactions. It was a panic situation.
>
> I have heard of it before , because of the way postgres works and the XID being
> a datatype of length 32 bit integer can only represent (2^32)/2=~2 billion
> transactions. However, as RDS performs the auto vacuum , we thought that we need
> not worry about this issue. But it seems we were wrong. And we found one adhoc
> "SELECT '' query was running on the reader instance since the last couple of
> days and when that was killed, the max xid (MaximumUsedTransactionIDs) dropped
> to 50million immediately.
This has nothing to do with autovacuum running.
PostgreSQL won't freeze any rows above the xmin horizon (see the "backend_xmin"
column in "pg_stat_activity").
> So I have few questions,
>
> 1)This system is going to be a 24/7 up and running system which will process
> ~500million business transactions/day in future i.e. ~4-5billion rows/day
> inserted across multiple tables each day. And as I understand each row will
> have XID allocated. So in that case , does it mean that, we will need
> (5billion/24)=~200million XID/hour and thus , if any such legitimate
> application "SELECT" query keeps running for ~10 hours (and thus keep the
> historical XID alive) , then it can saturate the "MaximumUsedTransactionIDs"
> and make the database standstill in 2billion/200million=~10hrs. Is this
> understanding correct? Seems we are prone to hit this limit sooner going forward.
Yes, that is correct. You cannot run such long-running queries with a
transaction rate like that.
When you mean transaction ,does it mean one commit ? For example if it's inserting+committing ~1000 rows in one batch then all the 1000 rows will be marked as one XID rather than 1000 different XID. and so we should look for batch processing rather than row by row types processing. Is the understanding correct?
One thing you could consider is running the long-running queries on a standby
server. Replication will get delayed, and you have to keep all the WAL
around for the standby to catch up once the query is done, but it should work.
You'd set "max_streaming_standby_delay" to -1 on the standby.
We have the "Select query" running on a reader instance , but still the writer instance was showing up "MaximumUsedTransactionIDs" reaching 1.5billion, so it means both the instance as part of same cluster so sharing same XIDs, and as per your suggestion we should run this in separate standby cluster altogether which does not share same XID. Is this understanding correct? or it can be handled even with another reader instance by just tweaking some other parameter so that they won't share the same XID?