Re: Immutable datastore library?

"Mike Sofen" <msofen@xxxxxxxxxx> · Tue, 18 Oct 2016 05:50:06 -0700

From: Guyren Howe   Sent: Monday, October 17, 2016 1:40 PM
I would like to use Postgres as an immutable data store. A subject table
would have a timestamp column, and that would be added to what would
otherwise be the primary key.  Trap updates and turn them into inserts. Have
an associated _deleted table. Trap deletes and turn them into inserts of the
primary key into that table.  Create a view that only shows the non-deleted
records with the most recent timestamps.

Stored procedure to do all that to a table. Event trigger to drop and
re-create the view on changes to the table columns.
----------------------
A couple years ago at another firm, I designed and built a real time ODS
(operational data store) for a large healthcare company, on SQL Server 2012
that supported your exact requirements, since the ODS was the primary data
feed for their data warehouse.

My solution leveraged the Merge tsql function ("Upsert" in PG) to detect
inserts/updates/deletes.  We don't allow physical row deletes in medical
data, so these were logical deletes aka an update to an
"InactivatedDatetime" column making it not null.  I used a checksum function
in the Update branch to detect if the inbound data had changed at all, to
avoid creating dry updates (no change in the data but a new Update row would
be written otherwise).

Ok that's the internals for the write to the ODS.  I wrapped the entire
Merge statement inside of another insert statement using the equivalent of
the PG "Into" function, which took every column from the ODS write and wrote
the same data to a "History" table of the same name - those writes were
always inserts, creating a persistent, complete picture of every write to
the ODS.  Each row going into the History tables was marked with a "D"
(delete), "I" (insert) or "U" (update).  The History data was used for both
auditing and for nightly batches feeding the data warehouse, where row type
(D, I, or U) drove their Type 2 processing.  As you can imagine, the table
design was crucial to the success of this model.

This was ultra-efficient on the real time data flowing in from the hospital
and clinical EMRs (different systems) - in one transaction/one query, I was
able to double-write the data and ensure both writes completed or fail both
and error.  The "batches" were small - up to 100k rows or less, and
processed in under 100ms.  But even when there was a network outage and we
had to do a catch up load with millions of rows, it ran very quickly.  IOWs,
the double write overhead was very modest, especially with modern disk
performance.

Mike Sofen (Synthetic Genomics)

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general