Re: How to perform a long running dry run transaction without blocking

Robert Leach <rleach@xxxxxxxxxxxxx> · Thu, 6 Feb 2025 12:58:08 -0500

Please reply to list also.
Ccing list

👍🏻

They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences 

Long term is there a thought to have them enter directly into database where validation can happen in real time

No, not really.  Sample metadata tends to be an afterthought to researchers.  They have it in their notebooks and getting them to enter it at all is like pulling teeth.  The validation interface actually has a bunch of other features I haven't mentioned that streamline the process for them.  Before it gets to actually validating the data, it tries to lighten the manual burden on the researchers (and help with consistent nomenclature) by pulling sample names out of the raw files, massaging them, and filling those in along with a mass of common data that is used to populate drop-downs in the excel columns to avoid researcher typos and value variants.

Having everything work with excel actually made the site more attractive to the researchers, because they're comfortable with it and use it already, so it lowered the bar for using our software.

Besides, we don't trust the users enough to enter data unsupervised.  There are a lot of aspects of the data that cannot be automatically validated and involve experimental parameters that are adjacent to the purpose of our site.  We have curators that need to look at everything to ensure consistency, and looking at all the data in context is necessary before any of it is entered.

That said, back in the aughts, I wrote a perl cgi site for a toxin and virulence factor database that used a web interface for data entry and achieved the curation goal by saving a form of all inter-related data.  The submit button sent that form to a list of curators who could approve the insert/update and make it actually happen.  I think I had actually suggested that form of data entry when this current project first started, but I was overruled.  However, in this project, the equivalent procedure would be per-sample, and you'd lose out on the overall context.  It's an interesting challenge, but I think we're pretty committed now on this file load path.

Where are the background processes loading data to?
We first run our loads on a development server with a separate copy of the database, but then we re-run those same loads on the production server, where users perform their validations.

The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?

It's the same code that executes in both cases (with or without the `--validate` flag).  All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not).

So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective.  I've been thinking of adding a `--failfast` option for use on the back end.  Haven't done it yet.  I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug.  I fixed the bug and ran the load again.  It finished sometime between COB yesterday and this morning (successfully!).

One of the thoughts I'd had to work around the problem was to somehow dump the data from the development load and load it onto production in some sort of scheduled downtime or something.  However, even if we do that, I'm concerned that multiple users concurrently validating different submissions would encounter this blocking issue, and since those validations can take (on the upper end) just under a minute, it's enough for at least 1 user to encounter a timeout.  I have not yet proven that can happen, but based on my reading of the postgres documentation, it seems logical.

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.

I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much.  Though, at the time, all I was trying to do was a progress bar for a query stats feature.  So proposing celery in this instance may get more traction with the rest of the team.

Most of the small validation processes finish in under a dozen seconds.  The largest studies take just under a minute.  I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds.  If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations.  The raw data loads will still likely take a much longer time.

Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544