Rob W wrote:
Can anyone point me towards good articles or books that would help a PostgreSQL novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?
In this particular case, I need to efficiently analyze approximately 300 million system log events (i.e. time series data). It's log data, so it's only appended to the table, not inserted and is never modified. Only 90 days worth of data will be retained, so old records need to be deleted periodically. Query performance will only be important for small subsets of the data (e.g. when analyzing a week or day's worth of data), the rest of the reports will be run in batch mode. There will likely only be one user at a time doing ad-hoc queries.
This is a a follow-up to the earlier suggestions that PostgreSQL will handle the volumes of data I plan to work with, so I figured I'd give it a shot.
one approach to speeding up the handling of time expired data like this
is to partition it, maybe by week. eg, you create a seperate table for
each of 14 weeks, and have a view that joins them all for doing
queries. you insert your new records to the latest week table, then
each week truncate the oldest week table and switch to using that one
for the new inserts.... this is more efficient than having one large
table and deleting individual rows.
you can speed up the inserts some by doing them in batches, for
instance, collecting a few minutes worth of new records, and inserting
them all as one transaction. depending on how many fields of these
tables are indexed, this can greatly reduce the overhead of maintaining
those indices.
see http://www.postgresql.org/docs/current/static/ddl-partitioning.html
for more on this sort of partitioning. Above, I mentioned using a view
to read the whole table as a join, this page discusses using inheritance
instead, which has advantages.
if your reporting requirements include the sorts of statistics that can
be precomputed, it can be advantageous to keep a set of running tallies
in separate tables, like per hour and per day counts for each event
class, which can be used to reduce the amount of bulk querying required
to generate statistical count reports. of course, these tally tables
also need aging, but there's much MUCH less data in them so conventional
row deletes is probably fine.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general