Re: Approach to Data Summary and Analysis

Roxanne Reid-Bennett <rox@xxxxxxxxxxx> · Tue, 15 Apr 2014 18:26:28 -0400

On 4/14/2014 12:27 PM, Robert DiFalco wrote:
And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a 
user answered a call" or "How many times has a user been called".
...
Sometimes I might want to get this data for a whole bunch of users.
...
So the other option is to create a call_summary table that is updated 
with triggers.
...
My only issue with a summary table is that I don't want a bunch of 
null fields.
...
But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach 
has the down side that it creates a lot of tables and triggers. It has 
the upside of being pretty efficient without having to deal with NULL 
values.  It's also pretty easy to reason about.
...
So for my question -- is the choice between these a personal 
preference sort of thing or is there a right or wrong approach? Am I 
missing another approach that would be better?  I'm okay with SQL but 
I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN 
for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for 
any specific scenario.  In my experience, most people have time/money to 
get to an 80% "better" design than all the other answers during design 
and then it gets refined over time.  And yes, personal experience does 
play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -  
they have to be asked.
1. How much data are you feeding into your system how fast?
       this directly affects your choices on distribution, parallel 
processing...  writes vs updates vs triggers for copying vs all reads
       [and if on bare metal - potentially where you place your logs, 
indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
       you have given "simple" use cases (how many calls completed 
within a time frame or to a number)
       you have given very slightly more complex use cases (when was 
the last time John answered a call)
       you have given a slightly more bulky processing question of (how 
many times have these users been called)
So...
      a) How many users executing read queries do you have?
      b) What is the expected load for simple queries (per 
week/day/hour/minute - depending upon your resolution on speed)
      c) What is the expected load for your mid-line complex queries
      d) What is the "maximum" volume you expect a bulk query to go 
after (like all users in the last 18 years, or this city's users in the 
last day?)  and how frequently will that kind of query be executed?  How 
much tolerance for delay do your users have?
      e) do you have any known really complex queries that might bog 
the system down?
      f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which 
defines the style of schema you need.  Queries can be written to pull 
data from any schema design - but how fast they can perform or how 
easily they can be created...

Chris and Vincent both targeted a balance between writes and reads - 
which adequately answers 80-85% of the usages out there.  But you didn't 
give us any of the above - so their recommendation (while very likely 
valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational 
system because normalization results in fewer writes/updates and lowers 
the risk of corruption if a failure takes place. It also isolates 
updates for any specific value to one location minimizing internally 
caused data corruption.
Reporting systems are generally less normalized because writes are more 
one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that 
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review 
information about how it is accomplished
- see the blogs on kimballgroup DOT com  they cover a lot of high(er) 
level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't 
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots 
of" "NULL" values.  They provide meaning in a number of different 
ways... not the least of which is that you don't know (yet) - which is 
knowledge in and of itself.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general