Search Postgresql Archives

Re: Optimizing tuning and table design for large analytics DB

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Rob W wrote:
Can anyone point me towards good articles or books that would help a PostgreSQL novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?

In this particular case, I need to efficiently analyze approximately 300 million system log events (i.e. time series data). It's log data, so it's only appended to the table, not inserted and is never modified. Only 90 days worth of data will be retained, so old records need to be deleted periodically. Query performance will only be important for small subsets of the data (e.g. when analyzing a week or day's worth of data), the rest of the reports will be run in batch mode. There will likely only be one user at a time doing ad-hoc queries.

This is a a follow-up to the earlier suggestions that PostgreSQL will handle the volumes of data I plan to work with, so I figured I'd give it a shot.


one approach to speeding up the handling of time expired data like this is to partition it, maybe by week. eg, you create a seperate table for each of 14 weeks, and have a view that joins them all for doing queries. you insert your new records to the latest week table, then each week truncate the oldest week table and switch to using that one for the new inserts.... this is more efficient than having one large table and deleting individual rows.

you can speed up the inserts some by doing them in batches, for instance, collecting a few minutes worth of new records, and inserting them all as one transaction. depending on how many fields of these tables are indexed, this can greatly reduce the overhead of maintaining those indices.

see http://www.postgresql.org/docs/current/static/ddl-partitioning.html for more on this sort of partitioning. Above, I mentioned using a view to read the whole table as a join, this page discusses using inheritance instead, which has advantages.

if your reporting requirements include the sorts of statistics that can be precomputed, it can be advantageous to keep a set of running tallies in separate tables, like per hour and per day counts for each event class, which can be used to reduce the amount of bulk querying required to generate statistical count reports. of course, these tally tables also need aging, but there's much MUCH less data in them so conventional row deletes is probably fine.



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux