Re: When to store data that could be derived

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Sun, 24 Mar 2019 13:41:09 +0100

On 2019-03-24 10:05:02 +0200, Frank wrote:
> 
> 
> On 2019-03-24 9:25 AM, Ron wrote:
> > On 3/24/19 1:42 AM, Frank wrote:
> > > As I understand it, a  general rule of thumb is that you should
> > > never create a physical column if the data could be derived from
> > > existing columns.

The main reason for this rule (which leads to the 3rd normal form) is to
avoid inconsistencies when data is changed.

> > Is this a historical data set that's never updated, or current data
> > that's constantly added to?
> > 
> 
> It is the latter - current data constantly added to.

So the important part here is not whether data is added, but whether
data is changed. Sure, new transactions are added all the time. But is
it expected that the data used to derive amount_cust and amount_local
(e.g. the exchange rate) is changed retroactively, and if it is should
the computed amount change? (I'm a bit worried about the join with the
customers table here - what happens when a customer moves their
headquarters to a country with a different currency?)

> > Sure the second query joins a lot of tables, but is pretty straightforward.
> > 
> > What REALLY worries me is whether or not the query optimiser would look
> > at the WHERE CASE, run away screaming and then make it use sequential
> > scans. Thus, even query #1 would be slow.
> > 
> 
> I had not realised that. I hope someone else chimes in on this.

Your condition is:

    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

Think about how you would check that. The straightforward way is to
compute the value of the case/end clause and compare that to 1. But to
compute that value you first need the value of a.tran_type. There are
two possible values here, so maybe an index scan on a.tran_type might be
possible, but I'm not sure whether the optimizer is even smart enought
to figure that out and if it is, whether those to values are selective
enough (Maybe all or most records are either ar_rec or cb_rec). After
that you can retrieve the posted value from the correct table.

As a human I see that the condition can only ever be true for records
from y and w with posted = 1. So it might be better to use index scans
on those columns. But this is not that easy to see, and I don't know
whether the optimizer can do it.

Rewriting the condition as

    (a.tran_type = 'ar_rec' and y.posted = 1) or
    (a.tran_type = 'cb_rec' and w.posted = 1)

might make it easier for the optimizer to find a good plan.

(Please note that I haven't tested any of this. Use EXPLAIN to check
what the optimizer really does.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@xxxxxx         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment:
signature.asc

Description: PGP signature