Re: Long running query causing XID limit breach

David HJ <chuxiongzhong@xxxxxxxxx> · Sun, 26 May 2024 13:56:16 +0800

anyone know how to describe from this mailing list? 

On Thu, May 23, 2024 at 5:16 AM sud <suds1434@xxxxxxxxx> wrote:
Hello , 
It's RDS postgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team members who mentioned the database is going to be in shutdown/hung if this value reaches to ~2billion and won't be able to serve any incoming transactions. It was a panic situation.

I have heard of it before , because of the way postgres works and the XID being a datatype of length 32 bit integer can only represent (2^32)/2=~2 billion transactions. However, as RDS performs the auto vacuum , we thought that we need not worry about this issue. But it seems we were wrong. And we found one adhoc "SELECT '' query was running on the reader instance since the last couple of days and when that was killed, the max xid (MaximumUsedTransactionIDs) dropped to 50million immediately. 

So I have few questions,

1)This system is going to be a 24/7 up and running system which will process ~500million business transactions/day in future i.e. ~4-5billion rows/day inserted across multiple tables each day. And as I understand each row will have XID allocated. So in that case , does it mean that, we will need (5billion/24)=~200million XID/hour and thus , if any such legitimate application "SELECT" query keeps running for ~10 hours (and thus keep the historical XID alive) , then it can saturate the "MaximumUsedTransactionIDs" and make the database standstill in 2billion/200million=~10hrs. Is this understanding correct? Seems we are prone to hit this limit sooner going forward.

2)We have some legitimate cases where the reporting queries can run for 5-6hrs. So in such cases if the start of this SELECT query happen at 100th XID on table TAB1, then whatever transactions happen after that time, across all other tables(table2, table3 etc) in the database won't get vacuum until that SELECT query on table1 get vacuumed(as database will try to keep that same 100th XID image) and the XID will just keep incrementing for new transaction, eventually reaching the max limit. Is my understanding correct here?

3)Although RDS does the auto vacuum by default. but should we also consider doing manual vacuum without impacting ongoing transactions? Something as below options
vacuum freeze tab1;
vacuum freeze; 
vacuum;
vacuum analyze tab1;
vacuum tab1;

4)Had worked in past in oracle database where the similar transaction identifier is called as "system change number" , but never encountered that being exhausted and also there it used to have UNDO record and if a SELECT query needs anything beyond certain limit(set undo_retention parameter) the select query used to fail with snapshot too old error but not impacting any write transactions. But in postgres it seems nothing like that happens and every "Select query" will try to run till its completion without any such failure, until it gets skilled by someone. Is my understanding correct?

 And in that case, It seems we have to mandatorily set "statement_timeout" to some value e.g. 4hrs(also i am not seeing a way to set it for any specific user level, so it will be set for all queries including application level) and also "idle_in_transaction_session_timeout" to 5minutes, even on all the prod and non prod databases, to restrict the long running transactions/queries and avoid such issues in future. Correct me if I'm wrong.

Regards
Sud