Thanks Roxanne, I suppose when it comes down to it -- for the current use cases and data size -- my only concern is the "calling" query that will need to use max to determine if a user has already had a call today. For a large user set, for each user I would either have to MAX on the answered timestamp to compare it against today or do an exist query to see if any timestamp for that user is greater or equal than "today".
But I suppose I just need to construct a huge dataset and see. I was thinking by keeping a summary so I always knew the last answer or call time for each user that I could mitigate this becoming an issue. Over time a single user could have answered a call thousands of times. So that would make a "<=" timestamp query be just # of users instead of # of users X 1000 (or however many calls they have answered over the non-archived time period).
On Wed, Apr 16, 2014 at 8:42 AM, Roxanne Reid-Bennett <rox@xxxxxxxxxxx> wrote:
On 4/15/2014 9:10 PM, Robert DiFalco wrote:Based on the above you are primarily capturing data and feeding back essentially one easy to find result set [who has NOT been successfully called] on an ongoing single threaded basis [once per day?]. So you are absolutely correct - tune for writing speed.
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.
While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written).
Do you know your answer?
The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended.
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with latency in response for those.
So finding Min/Max/Count quickly really *don't* matter for tuning.
So the only reason I can see to add a summary table is to ... simplify maintenance [note I did NOT say "development"] and then only IF it doesn't impact the write speeds beyond an acceptable level. Proper internal / external documentation can mitigate maintenance nightmares. If your developer(s) can't figure out how to get the data they need from the schema - then give them the queries to run. [you are likely better at tuning those anyway]
Last consideration - business consumption of data does change over time. Disk space is cheap [but getting and keeping speed sometimes isn't]. You might consider including ongoing partial archival of the operational data during slow usage (write) periods.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general