Re: Using the database to validate data

Allan Kamau <kamauallan@xxxxxxxxx> · Mon, 27 Jul 2015 08:28:55 +0300

To add onto what others a have said. I would use a bash script (and awk) to prepared each record of the raw CSV file with a dataset name, name of the file, timestamp and a serial number and place the newly generated data into a new file. In this bash script, the value of the dataset name, name of file and timestamp will come from bash variables declared and initialized at the start ensuring that all the records of a given file will have the same values for these three fields.
The dataset name is to help identify that dataset to which these data belongs when we load it into the database. The file name would indicate the name of the file from which these data was obtained. The timestamp is mainly used for versioning purposes. The sequential numbers (one unique number per row) is for auditing purposes.

I would then create a staging schema inside the postgresql database to which you intend to load the data. In this staging schema I would create a table containing all the fields as present in the newly generated CSV file and use TEXT datatype (and NULL constraint) for all these fields with the exception of the field that would hold the serial number which would be set to INTEGER.

Now load the data using COPY staging_schema.sometable(field1, field2, ...) FROM STDIN WITH (FORMAT csv, ...)

Now that we have the data loaded into the "staging_schema.sometable" table, we may now create another table ("staging_schema.sometable_good") also in the staging schema having the same structure as the previous table but this time having all the constrains as well as the "correct" datatypes you plan to have. Create yet another table identical to the first table and call is something like "staging_schema.sometable_bad" to this table we shall write the records that fail insertion into the "staging_schema.sometable_good" table.

Create a PLPGSQL script (with exception handling) to read the staging table one record at a time (perhaps using CURSORS), populate the "staging_schema.sometable_good" table with the data you have just read (from "staging_schema.sometable"). On failure of insert, write the record into the "staging_schema.sometable_bad" table.

After processing of these data, check to see if "staging_schema.sometable_bad" table has no records matching the specific dataset name, file name and timestamp (the first three fields). If any records are found, have a look at them to determine why the insertions to the "staging_schema.sometable_good" table failed.

If no matching records are found in the "staging_schema.sometable_bad" table, write another table ("schema.sometable") in the main schema of the same database. This table should have the same structure as "staging_schema.sometable_good". Simply insert into this table the records returned by doing a filtered query on the "staging_schema.sometable_good". This query should be filtered on the given dataset name, file name and timestamp (the first three fields).

Allan.

On Sat, Jul 25, 2015 at 4:19 PM, rob stone <floriparob@xxxxxxxxx> wrote:
On Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:

> Tim Clarke wrote

> > Shouldn't be too difficult to import those new rows into one table,

> > write a procedure that inserts them into the real table one by one

> > and

> > logs the validation failure if any - committing good rows and

> > rolling

> > back bad. In fact if you could then write the failures to a third

> > table

> > with a completely relaxed (or no) validation?

>

> Tim-

>

> Thanks for your response. Yes, you are right, it shouldn't be too

> difficult,

> and in fact I have already implemented basically what you suggest,

> see "Idea

> 2" from my original message.

>

> The problem with this approach is that it fails to find violations

> such as

> UNIQUE (there are probably others) from within the input data until

> after

> the first has been committed to the database. But, the error may have

> been

> with that earlier row, not the later.

>

> I want my users to fix all the problems with their data and then load

> it in

> an "all or none" fashion.

>

> -Jon

>

If you have multiple users loading (possibly) multiple files and

(possibly) concurrently, then the only solution is to write some code

to process the data.

You also need to consider load sequence. If user A creates a file that

contains data that will end up creating a new primary key and a file

from user B refers to that, then user B needs to wait until user A's

file has been processed successfully.

Without knowing all the details I can envisage a scenario where data

being loaded could reference "good" data already in the DB as well as

referencing data that exists within that file load, possibly giving a

rejection or an update of the "good" data.

My 2 cents worth!

Cheers,

Rob (in Floripa)

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general