david@xxxxxxx wrote:
On Tue, 21 Apr 2009, Stephen Frost wrote:
* david@xxxxxxx (david@xxxxxxx) wrote:
while I fully understand the 'benchmark your situation' need, this isn't
that simple.
It really is. You know your application, you know it's primary use
cases, and probably have some data to play with. You're certainly in a
much better situation to at least *try* and benchmark it than we are.
rsyslog is a syslog server. it replaces (or for debian and fedora, has
replaced) your standard syslog daemon. it recieves log messages from
every app on your system (and possibly others), filters, maniulates
them, and then stores them somewhere. among the places that it can store
the logs are database servers (native support for MySQL, PostgreSQL, and
Oracle. plus libdbi for others)
Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)
The first two are vital, the third takes you a step further. The fourth
is a long-term admin thing.
And possibly
5. Have two connections, one for fatal/error etc and one for info/debug
level log statements (configurable split?). Then you can use the
synchronous_commit setting on the less important ones. Might buy you
some performance on a busy system.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
other apps then search and report on the data after it is stored. what
apps?, I don't know either. pick your favorite reporting tool and you'll
be a step ahead of me (I don't know a really good reporting tool)
as for sample data, you have syslog messages, just like I do. so you
have the same access to data that I have.
how would you want to query them? how would people far less experianced
that you want to query them?
I can speculate that some people would do two columns (time, everything
else), others will do three (time, server, everything else), and others
will go further (I know some who would like to extract IP addresses
embedded in a message into their own column). some people will index on
the time and host, others will want to do full-text searches of everything.
Well, assuming it looks much like traditional syslog, I would do
something like: (timestamp, host, facility, priority, message). It's
easy enough to stitch back together if people want that.
PostgreSQL's full-text indexing is quite well suited to logfiles I'd
have thought, since it knows about filenames, urls etc already.
If you want to get fancy, add a msg_type column and one subsidiary table
for each msg_type. So - you might have smtp_connect_from (hostname,
ip_addr). A set of perl regexps can match and extract the fields for
these extra tables, or you could do it with triggers inside the
database. I think it makes sense to do it in the application. Easier for
users to contribute new patterns/extractions. Meanwhile, the core table
is untouched so you don't *need* to know about these extra tables.
If you have subsidiary tables, you'll want to partition those too and
perhaps stick them in their own schema (logs200901, logs200902 etc).
--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance