Re: Using the database to validate data

Jeff Janes <jeff.janes@xxxxxxxxx> · Fri, 24 Jul 2015 09:47:16 -0700

On Thu, Jul 23, 2015 at 5:55 AM, JPLapham <lapham@xxxxxxxxx> wrote:
Hello,

I have an application that occasionally performs large batch inserts of user

hand-generated data. Input is a tab delimited file with typically hundreds

to a thousand lines of data.

Because the data is generated by hand, there are always many

transaction-stopping errors in a typical input run. For example, missing

datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type

mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,

we chose PostgreSQL exactly because of these problems, because of the robust

transactional control, rollback on errors, etc.

My question is the following. I would like to *test* the data input for

integrity in such a way that I can create a report to the user informing

them of exactly where in their input file to correct the problems.

IDEA 1: My first attempt at this was to simply slurp the data into the

database, collect the errors, and then rollback. Of course (as I now know),

this doesn't work because after the first problem, the database reports,

"current transaction is aborted, commands ignored until end of transaction

block". This means that I can only report to the user the location of the

first problem, and then they run the data again, and keep looping through

the process until the data is good, a huge waste of time.

This is how I usually do it, until it become unbearable or an order comes down from on high to do it differently.  If the errors are due to people being sloppy, then it ought to be annoying for them to be sloppy.  Why make it convenient for them?  If the errors are more excusable than just sloppiness, or if the annoyance is more to you than to the people creating the errors, then you have to go on to other methods.

IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.

This allows me to check for things like NOT NULL and data type issues, but

not violations of UNIQUE within the new data.

Issue a savepoint before each insert, and then issue a "release savepoint" if the insert succeeds or a "rollback savepoint" if it does not.  If you release a savepoint, remember that fact so that at the end you rollback the entire transaction instead of committing it.  I rarely actually resort to this.  It might miss some errors in which one failed row failed for multiple reasons, or where one row would have failed had another row not already failed for a different reason.

IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel

like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc

is not trivial. It seems ridiculous to do this when I have the *actual*

database available to test against!

I do this one a lot when I can't get away with method 1, and I don't see why it is a nightmare.  Writing queries against the existing database to see if the new proposed keys exist *is* trivial.  You write them in SQL, not a low level language.  

Testing for internal duplication within the new dataset is a bit harder, I usually do that in Perl with a hash.  There might be cases where Perl and PostgreSQL disagree about when two values are equal, but I've almost never run into them in practise.

Check constraints or character encoding issues or typing issues can be harder to deal with.  If those are likely to be a problem, create a temp or unlogged table with the same check constraints as the real table but without the unique or foreign key constraints and see if each row inserts.

Cheers,

Jeff