Re: How to perform a long running dry run transaction without blocking

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Thu, 6 Feb 2025 14:23:57 -0800

On 2/6/25 12:08, Robert Leach wrote:

Alright I am trying to reconcile this with from below, 'The largest 
studies take just under a minute'.

The context of the 'The largest studies take just under a minute' 
statement is that it's not loading the hefty/time-consuming raw data. 
  It's only validating the metadata.  That's fast (5-60s).  And that 
data is a portion of the transaction in the back-end load.  There are 
errors that validation can miss that are due to not touching the raw 
data, and in fact, those errors are addressed by curators editing the 
excel sheets.  That's why it's all in the load transaction instead of 

As a scientist that makes me start to twitch.

Is there an audit trail for that?

I'm unfamiliar with retry functionality, but those options sound logical 
to me as a good path forward, particularly using celery to spread out 
validations and doing the back end loads at night (or using some sort of 
fast dump/load).  The thing that bothers me about the celery solution is 
that most of the time, 2 users validating different data will not block, 
so I would be making users wait for no reason.  Ideally, I could 
anticipate the block and only at that point, separate those validations.

Aah, time travel.

For fast dump/load on validated data see:

https://www.postgresql.org/docs/current/sql-copy.html

Though note in Postgres 16- COPY is all or nothing, so if there is an 
error nothing will be loaded. With version 17 you get ON_ERROR and 
LOG_VERBOSITY. One way to deal with is to load to a staging table and do 
your validation there and then move the data to the final table.

As to retry that depends on where you want to do it. For subtransactions 
(SAVEPOINT) see:

https://www.postgresql.org/docs/current/sql-savepoint.html

https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/

In Python there is try/except.

This brings up a question though about a possibility I suspect is not 
practical.  My initial read of the isolation levels documentation found 
this section really promising:

 > The Repeatable Read isolation level only sees data committed before 
the transaction began; it never sees either uncommitted data or changes 
committed during transaction execution by concurrent transactions.

This was before I realized that the actions of the previously started 
transaction would include "locks" that would block validation even 
though the load transaction hasn't committed yet:

 > a target row might have already been updated (or deleted or *locked*) 
by another concurrent transaction by the time it is found. In this case, 
the repeatable read transaction will wait for the first updating 
transaction to commit or roll back

Other documentation I read referred to the state of the DB (when a 
transaction starts) as a "snapshot" and I thought... what if I could 
save such a snapshot automatically just *before* a back-end load starts, 
and use that snapshot for validation, such that my validation processes 
could use that to validate against and not encounter any locks?  The 
validation will never commit, so there's no risk.

Hmm. I don't think so.

I know Django's ORM wouldn't support that, but I kind of hoped that 
someone in this email list might suggest a snapshot functionality as a 
possible solution.  Since the validations never commit, the only 
downside would be if the backend load changed something that introduces 
a problem with the validated data that would not be fixed until we 
actually attempt to load it.

Is that too science-fictiony of an idea?

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx