Re: Guideline To Resolve LWLock:SubtransControlLock

Jeremy Schneider <schnjere@xxxxxxxxxx> · Thu, 6 Sep 2018 08:16:29 -0700

On 8/22/18 13:07, Fd Habash wrote:
> In your statement, what constitutes ‘heavy use of exception blocks’?  
> 
> "I have seen
> applications grind to a halt on SubtransControlLock when they make heavy
> use of exception blocks in plpgsql code; in fact it's pretty
> straightforward to demonstrate this behavior with pgbench on community
> PostgreSQL."

In one of the most dramatic cases I saw, the customer was migrating from
another database system and had a very large workload running on the
largest instance class we currently offer. They were quite savvy and had
already gone through all of the procedural code they migrated and
removed all of the exception blocks. Nonetheless, when they hit their
peak workload, we observed this wait event.

It was finally discovered that the framework/ORM they were using had a
capability to automatically use savepoints for partial rollback. They
had not explicitly configured it (afaik) - but their framework was using
savepoints. In some complex code paths we were seeing several hundred
subtransactions within one master transaction.

I haven't thoroughly tested yet, but anecdotally I don't think that
you'll have a problem with contention on this lock until you get to a
sufficiently large database server.  The machine I described above was a
32-core box; I suspect that a box with 2 cores is going to be waiting on
something else before it gets stuck here.  If you want to see a system
choke on this lock, just spin up a 32-core box and run two separate
pgbenchs in parallel (needs to be two)... the first as select-only and
the second modified to create some savepoints while updating
pgbench_accounts.

To directly answer the question "what constitutes heavy use": if folks
are building high-throughput applications that they expect to scale
nicely on PostgreSQL up to 32-core boxes and beyond, I'd suggest
avoiding savepoints in any key codepaths that are part of the primary
transactional workload (low-latency operations that are executed many
times per second).

On 8/22/18 08:48, Fd Habash wrote:
> “ That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child
> transaction relationships pretty much the same way that community
> PostgreSQL 9.6.3 does …”
>
> This is why I posted here first. This particular wait state did not
> appear to be Aurora specific and was not listed as part of
>
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraPostgreSQL.Reference.html#AuroraPostgreSQL.Reference.Waitevents//
>
> I go back and forth posting issues between the two forums depending on
> the nature of it.

Just added it to the aforementioned Aurora docs, hopefully heading off a
few future questions.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services