Re: background triggers?

Rafal Pietrak <rafal@xxxxxxxxxxxxxxxxxx> · Wed, 24 May 2006 08:45:20 +0200

On Tue, 2006-05-23 at 13:01 -0400, Kenneth Downs wrote:
> Rafal Pietrak wrote:
> >some other INSERT, return imediately if so, but turn into background for
> >a long-lasting job if not".
> >
> >  
> >
> Rafal, I'm wondering why you want to do this.  You may be "fighting the 
> framework".

Yes, most probably. I'm afraid of that :(

Still...

> If you are trying to do something that is totally unsupported, it is 
> probably for a pretty good reason, usually dealing with security or data 
> loss.  You can probably get what you want by supported methods, but it 
> may require looking at the problem in a different way.
> 
> What is it you are trying to accomplish?  Is it just performance? 

OK. here is a 'real life' example. It works more like a post office. Now
and then, there arrive a 'delivery man' with a bunch of post to deliver.
Post office takes the batch, checks and stamps each and every item, and
hands over a receipt. But the actual bin-ing (into delivery channels)
and routing (dispatch) is handled without the delivery man standing and
waiting for the process to end.

In my *real*life* case I have a file with hundreds of tousends of
'transactions' uploaded by operator (several times times a day, and in
fact, by 10-30 operators) - those are checked at INSERT time. But after
that check and INSERT, I'd like to say to the operator: "OK, jour job is
done, don't warry about the rest". 

But there is more work to do with the batch. 

I have to 'route the transactions' to their relevant accounts, and see
how those change the 'status' of those accounts, consequently, store the
updated status within the account itself. This is tedious and time
consuming. But it have to be done, because 'customers' query account
status for those 'agregate status information' and it would be *very*
haevy for the database if those queries required browsing of the entire
'transaction log'. Number of 'Transactions' to number of accounts is
like milions to thousends.

A plain INSERT of batch takes 5-10minutes on desktop postgresql (800MHz
machine, ATA disks). When I attach trigger (*Very* simple funciton) to
update the accounts, the INSERT take hours (2-4). But when I make just
one single update of all accounts at the end of the batch insert, it
takes 20-30min.

So my solution was 1) to have an additional table "TABLE dirty_log(tiem
timestamp, who text)", which gets inserted a row *after* a batch of
INSERTS. 2) and a 'server side process', run every 5-10 minutes, which
makes accounts update, and which: A) does NOT launche when another such
process is currently running; B) purges DIRTY_LOG table after it's done.
This is quite obvoisly a 'user space implementation' of the 'background
triggers'. I needed that.

Natuaraly, having this 'bacground trigger' I loose acuracy of the
account information. But I gain on system efficiency - this is
engineering decision. When we have to take those (like the
implementation above), it's good to have 'system tools' (like
'background triggers') that support us.

But of cource I may be wrong all togather. I'd really like to know the
techniq, which is 'along the line' of RDBM systems design, which serves
that same purpose. 

Is there a better solution?

NB: the 'batch INSERT' I mentioned above is done by www server. It's
quite vital to have the the server process terminated (meaning: not keep
it waiting for LISTEN/NOPTIFY event - meaning not use LISTEN/NOTIFY), as
apache will keep the connection opened until the process ends.

In 'real life', this scenario is applicable also to 'service network':
1. Say, you have 100_000 - 1000_000 vending machines (VM) network.
2. each is is loaded with c.a. 100 item types (300 types in the entire
network).
3. each VM dispatches an item every 2-3 minutes. which make overall
'transaction traffic' at the level of over hundreds per second.
4. assume, that for 'customer management', you need to store quite a bit
of data with each item-dispense 'transaction'. Meaning: transaction are
not very light, and their details have to be kept for long time.
5. obviously, you need to manage your stock (each of the 300
item-types): you keep VM loaded and keep some stock at central store.

(ATMs are a good example of such netowrk)

So:
1. 'transaction traffic' is so signifficant, that you really have to
'optimise for that'
2. you don't really have to know *exactly* when you run out of stock,
because each VM has signifficant local item store, so if you get
notified, that a particular VM gets close to the bottom with particular
item, you may dispatch a reload in 10min, but it's also OK to dispatch
that in 2hours - meaning, the 'acocunt information' does not have to be
'immediately acurate'. Far more important is 'dispatch transaction'
performance.
3. normally, you 'keep an eye' on you VM network - meaning, you issue a
'statistics' query quite frequently. If that was a 'haevy query' it
would degrade your database performance quite signifficantly - we really
need the 'agregate information' stored within 'item-accounts'.

Is there a clean, 'along the framework' design that serves this reality?

-R