Re: table versioning approach (not auditing)

Adam Brusselback <adambrusselback@xxxxxxxxx> · Thu, 2 Oct 2014 10:38:20 -0400

Ended up running for 28 min, but it did work as expected.

On Thu, Oct 2, 2014 at 10:27 AM, Adam Brusselback <adambrusselback@xxxxxxxxx> wrote:
Testing that now.  Initial results are not looking too performant.  
I have one single table which had 234575 updates done to it.  I am rolling back 13093 of them.  It's been running 20 min now, using 100% of a single core, and almost 0 disk.  No idea how long it'll run at this point.

This is on an i5 desktop with 16 gigs of ram and an ssd.

This is a pretty good test though, as it's a real world use case (even if the data was generated with PGBench).  We now know that area needs some work before it can be used for anything more than a toy database.

Thanks,
-Adam

On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-kunde@xxxxxx> wrote:

Hey there

Thanks again for the fix. I was able to merge it into my repo.

Also thanks for benchmarking audit. Very interesting results.

I wonder how the recreation of former database states scales when processing many deltas.

Haven’t done a lot of testing in that direction.

I will transfer the code soon to a more public repo on GitHub. As far as I see I have to create an organization for that.

Cheers

Felix

Gesendet: Mittwoch, 01. Oktober 2014 um 17:09 Uhr

Von: "Adam Brusselback" <adambrusselback@xxxxxxxxx>

An: "Felix Kunde" <felix-kunde@xxxxxx>

Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>

Betreff: Re:  table versioning approach (not auditing)

I know we're kinda hijacking this thread, so sorry for that.  If you'd like to do that, i'd be more than happy to use it and push any fixes / changes upstream.  I don't have much of a preference on the name either, as long as it's something that makes sense. 

I would consider myself far from an expert though! Either way, more people using a single solution is a good thing.

As a side note, I did some benchmarking this morning and wanted to share the results:

pgbench -i -s 140 -U postgres pgbench

pgbench -c 4 -j 4 -T 600 -U postgres pgbench

no auditing tps: 2854

NOTE: Accounts are audited

auditing tps: 1278

pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench

no auditing tps: 2504

NOTE: Accounts are audited

auditing tps: 822

pgbench -c 2 -j 2 -T 300 -U postgres pgbench

no auditing tps: 1836

NOTE: branches and tellers are audited, accounts are not

auditing tps: 505

I'd love to see if there are some easy wins to boost the performance.

On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-kunde@xxxxxx> wrote:

Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way more convincing for users to go for a solution that is tested by some experts than just by a random developer :)

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which parts are already good, which parts could be improved and where to go next. I think switching to JSONB for example will be easy, as it offers the same functions than JSON afaik.

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr

Von: "Adam Brusselback" <adambrusselback@xxxxxxxxx>

An: "Felix Kunde" <felix-kunde@xxxxxx>

Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>

Betreff: Re:  table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.

I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 

On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@xxxxxx> wrote:Hey

yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.

To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.

If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.

Felix

Gesendet: Montag, 29. September 2014 um 23:25 Uhr

Von: "Abelard Hoffman" <abelardhoffman@xxxxxxxxx[abelardhoffman@xxxxxxxxx]>

An: "Felix Kunde" <felix-kunde@xxxxxx[felix-kunde@xxxxxx]>

Cc: "pgsql-general@xxxxxxxxxxxxxx[pgsql-general@xxxxxxxxxxxxxx]" <pgsql-general@xxxxxxxxxxxxxx[pgsql-general@xxxxxxxxxxxxxx]>

Betreff: Re:  table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.

Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?

Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.

So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.

I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.

AH

On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@xxxxxx[felix-kunde@xxxxxx]> wrote:Hey

i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]

I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.

Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.

Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers

Felix

Gesendet: Montag, 29. September 2014 um 04:00 Uhr

Von: "Abelard Hoffman" <abelardhoffman@xxxxxxxxx[abelardhoffman@xxxxxxxxx]>

An: "pgsql-general@xxxxxxxxxxxxxx[pgsql-general@xxxxxxxxxxxxxx]" <pgsql-general@xxxxxxxxxxxxxx[pgsql-general@xxxxxxxxxxxxxx]>

Betreff:  table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).

I have studied these two audit trigger examples:

https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]

https://wiki.postgresql.org/wiki/Audit_trigger_91plus

I've also read about two other approaches to versioning:

1. maintain all versions in one table, with a flag to indicate which is the current version

2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.

My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.

But there are two types of queries I need to run.

1. Find all changes made by a specific user

2. Find all changes related to a specific record

#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.

#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).

So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?

Thanks

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx[pgsql-general@xxxxxxxxxxxxxx])

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]