The software is still in a new major version beta. We're adding features and fixing bugs. It's not unusual to encounter a new bug, fix it on dev to get the load to work, then deploy a point release on prod. And that means repeated load attempts that interfere with the validation interface. Besides, beyond this, we're planning on a separate staging database that dev effectively now is. Sometimes, a curator only finds a technical data issue after the initial load while browsing the newly loaded data on the dev site.
The context of the 'The largest studies take just under a minute' statement is that it's not loading the hefty/time-consuming raw data. It's only validating the metadata. That's fast (5-60s). And that data is a portion of the transaction in the back-end load. There are errors that validation can miss that are due to not touching the raw data, and in fact, those errors are addressed by curators editing the excel sheets. That's why it's all in the load transaction instead of loaded separately, but those problems are somewhat rare (and we currently have a new feature in the design phase that should almost completely eliminate those issues).
The validation skips the time-consuming raw data load. That raw data is collectively hundreds of gigs in size and could not be uploaded on the validation page anyway. The feature I alluded to above that would make errors associated with the raw data almost completely eliminated is one where the researcher can drop the raw data folder into the form and it just walks the directory to get all the raw data file names and relative paths. It's those data relationships whose validations are currently skipped.
I'm unfamiliar with retry functionality, but those options sound logical to me as a good path forward, particularly using celery to spread out validations and doing the back end loads at night (or using some sort of fast dump/load). The thing that bothers me about the celery solution is that most of the time, 2 users validating different data will not block, so I would be making users wait for no reason. Ideally, I could anticipate the block and only at that point, separate those validations. This brings up a question though about a possibility I suspect is not practical. My initial read of the isolation levels documentation found this section really promising: > The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. This was before I realized that the actions of the previously started transaction would include "locks" that would block validation even though the load transaction hasn't committed yet: > a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back Other documentation I read referred to the state of the DB (when a transaction starts) as a "snapshot" and I thought... what if I could save such a snapshot automatically just before a back-end load starts, and use that snapshot for validation, such that my validation processes could use that to validate against and not encounter any locks? The validation will never commit, so there's no risk. I know Django's ORM wouldn't support that, but I kind of hoped that someone in this email list might suggest a snapshot functionality as a possible solution. Since the validations never commit, the only downside would be if the backend load changed something that introduces a problem with the validated data that would not be fixed until we actually attempt to load it. Is that too science-fictiony of an idea? |