👍🏻
No, not really. Sample metadata tends to be an afterthought to researchers. They have it in their notebooks and getting them to enter it at all is like pulling teeth. The validation interface actually has a bunch of other features I haven't mentioned that streamline the process for them. Before it gets to actually validating the data, it tries to lighten the manual burden on the researchers (and help with consistent nomenclature) by pulling sample names out of the raw files, massaging them, and filling those in along with a mass of common data that is used to populate drop-downs in the excel columns to avoid researcher typos and value variants. Having everything work with excel actually made the site more attractive to the researchers, because they're comfortable with it and use it already, so it lowered the bar for using our software. Besides, we don't trust the users enough to enter data unsupervised. There are a lot of aspects of the data that cannot be automatically validated and involve experimental parameters that are adjacent to the purpose of our site. We have curators that need to look at everything to ensure consistency, and looking at all the data in context is necessary before any of it is entered. That said, back in the aughts, I wrote a perl cgi site for a toxin and virulence factor database that used a web interface for data entry and achieved the curation goal by saving a form of all inter-related data. The submit button sent that form to a list of curators who could approve the insert/update and make it actually happen. I think I had actually suggested that form of data entry when this current project first started, but I was overruled. However, in this project, the equivalent procedure would be per-sample, and you'd lose out on the overall context. It's an interesting challenge, but I think we're pretty committed now on this file load path.
It's the same code that executes in both cases (with or without the `--validate` flag). All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not). So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective. I've been thinking of adding a `--failfast` option for use on the back end. Haven't done it yet. I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug. I fixed the bug and ran the load again. It finished sometime between COB yesterday and this morning (successfully!).
I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much. Though, at the time, all I was trying to do was a progress bar for a query stats feature. So proposing celery in this instance may get more traction with the rest of the team. Most of the small validation processes finish in under a dozen seconds. The largest studies take just under a minute. I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds. If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations. The raw data loads will still likely take a much longer time. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 |