Re: Approach to Data Summary and Analysis

Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> · Wed, 16 Apr 2014 13:27:25 +1200



    On 16/04/14 13:10, Robert DiFalco
      wrote:

    
      1. >500K rows per day into the calls table.
        2. Very rarely. The only common query is gathering users
          that have not been called "today" (along with some other
          qualifying criteria). More analytical queries/reports are done
          for internal use and it is not essential that they be
          lickity-split.  
        a. Usually just one connection at a time executes read
          queries.
        b. the users not called today query will be done once a
          day.
        c. Daily
        d. All users for the last year (if you are asking about
          retention). We will also rarely have to run for all time.
        e. Not that I'm aware of (or seen) today.
        f. For the simple queries we cannot afford latency between
          calls and querying who was already called.
        

        While I don't seem to be getting much support for it here
          :D my write performance (which is most essential) has been
          much better since I further normalized the tables and made it
          so that NULL is never used and data is never updated (i.e. it
          is immutable once it is written). 
        

        As for wanting to avoid NULLs I don't really know what to
          say. Obviously some times NULL's are required. For this design
          I don't really need them and they make the data harder to
          reason about (because they are kind of open to
          interpretation).  They can also give you different results
          than you sometimes expect (for example when looking for a non
          matching key, you start having to inject some OR IS NULLs and
          such). Also, the absence of null can make a lot of queries
          more optimal). That said, I understand where you all are
          coming from with de-normalization. It's definitely the path of
          the least resistance. Our instinct is to want to see all
          related data in a single table when possible. 
        

        The summary table was really a separate point from whether
          or not people liked my schema or not -- I mean whether I
          de-normalize as people are asking or not, there would still be
          the question of a summary table for MAX and COUNT queries or
          to not have a summary table for those. I probably made the
          original question too open ended. 
         
      
        On Tue, Apr 15, 2014 at 3:26 PM,
          Roxanne Reid-Bennett <rox@xxxxxxxxxxx> wrote:

          
            On 4/14/2014 12:27 PM, Robert DiFalco wrote:

            
                And so on for calls_connected, calls_completed,
                call_errors, etc.

                
                Occasionally I will want to know things like "When was
                the last time a user answered a call" or "How many times
                has a user been called".

              
              ...
              

                Sometimes I might want to get this data for a whole
                bunch of users.

              
              ...
              

                So the other option is to create a call_summary table
                that is updated with triggers.

              
              ...
              

                My only issue with a summary table is that I don't want
                a bunch of null fields.

              
              ...
              

                But over time all fields would eventually become
                non-NULL.

                
              So that leads me to a summary table for EACH call state.
              This approach has the down side that it creates a lot of
              tables and triggers. It has the upside of being pretty
              efficient without having to deal with NULL values.  It's
              also pretty easy to reason about.

              ...
              

                So for my question -- is the choice between these a
                personal preference sort of thing or is there a right or
                wrong approach? Am I missing another approach that would
                be better?  I'm okay with SQL but I'm not expert so I'm
                not sure if there is an accepted DESIGN PATTERN for this
                that I am missing.

                
            There is no right or wrong - there is better, worse, best,
            and worst for any specific scenario.  In my experience, most
            people have time/money to get to an 80% "better" design than
            all the other answers during design and then it gets refined
            over time.  And yes, personal experience does play a part in
            how people interpret better/worse [aka religion] ;)

            
            I didn't see anybody ask these questions - and to identify
            "better" -  they have to be asked.

            1. How much data are you feeding into your system how fast?

                   this directly affects your choices on distribution,
            parallel processing...  writes vs updates vs triggers for
            copying vs all reads

                   [and if on bare metal - potentially where you place
            your logs, indexes, core lookup tables, etc]

            2. How much data are you reading out of your system - how
            fast?

                   you have given "simple" use cases (how many calls
            completed within a time frame or to a number)

                   you have given very slightly more complex use cases
            (when was the last time John answered a call)

                   you have given a slightly more bulky processing
            question of (how many times have these users been called)

            So...

                  a) How many users executing read queries do you have?

                  b) What is the expected load for simple queries (per
            week/day/hour/minute - depending upon your resolution on
            speed)

                  c) What is the expected load for your mid-line complex
            queries

                  d) What is the "maximum" volume you expect a bulk
            query to go after (like all users in the last 18 years, or
            this city's users in the last day?)  and how frequently will
            that kind of query be executed?  How much tolerance for
            delay do your users have?

                  e) do you have any known really complex queries that
            might bog the system down?

                  f) How much lag time can you afford between capture
            and reporting?

            
            Answers to the above define your performance requirements -
            which defines the style of schema you need.  Queries can be
            written to pull data from any schema design - but how fast
            they can perform or how easily they can be created...

            
            Chris and Vincent both targeted a balance between writes and
            reads - which adequately answers 80-85% of the usages out
            there.  But you didn't give us any of the above - so their
            recommendation (while very likely valid) may not actually
            fit your case at all.

            
            As to design patterns -

            "Generally" a database schema is more normalized for an
            operational system because normalization results in fewer
            writes/updates and lowers the risk of corruption if a
            failure takes place. It also isolates updates for any
            specific value to one location minimizing internally caused
            data corruption.

            Reporting systems are generally less normalized because
            writes are more one-time and reads are where the load
            occurs.

            Sometimes you have to use data replication to have a system
            that appropriately supports both.

            
            you have shown you are already aware of normalization.

            If you weren't aware of approaches to Data Warehousing...
            you can review information about how it is accomplished

            - see the blogs on kimballgroup DOT com  they cover a lot of
            high(er) level concepts with enough specificity to be of
            some direct use.

            [that website also covers some ideas for "Big Data" which
            aren't necessarily limited to RDBMS']

            
            Specify your performance requirements, then figure out your
            schema design.

            
            FWIW I don't understand your (or any other person's)
            hesitancy for "lots of" "NULL" values.  They provide meaning
            in a number of different ways... not the least of which is
            that you don't know (yet) - which is knowledge in and of
            itself.

                
                Roxanne
            
              
                -- 

                Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

                To make changes to your subscription:

                http://www.postgresql.org/mailpref/pgsql-general

              
    Have you considered partial indexes? Using the WHERE predicate option of CREATE
      INDEX.

    
    This can be useful if you often look for things that are often only
    a small subset of keys.  For example a partial index on sex would
    useful for nurses, only indexing those that are male as they are in
    a very small minority.

    
    Cheers,

    Gavin