Re: performance for high-volume log insertion

Richard Huxton <dev@xxxxxxxxxxxx> · Tue, 21 Apr 2009 09:56:23 +0100

david@xxxxxxx wrote:
On Tue, 21 Apr 2009, Stephen Frost wrote:

* david@xxxxxxx (david@xxxxxxx) wrote:
while I fully understand the 'benchmark your situation' need, this isn't
that simple.

It really is.  You know your application, you know it's primary use
cases, and probably have some data to play with.  You're certainly in a
much better situation to at least *try* and benchmark it than we are.

rsyslog is a syslog server. it replaces (or for debian and fedora, has 
replaced) your standard syslog daemon. it recieves log messages from 
every app on your system (and possibly others), filters, maniulates 
them, and then stores them somewhere. among the places that it can store 
the logs are database servers (native support for MySQL, PostgreSQL, and 
Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth 
is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug 
level log statements (configurable split?). Then you can use the 
synchronous_commit setting on the less important ones. Might buy you 
some performance on a busy system.

http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

other apps then search and report on the data after it is stored. what 
apps?, I don't know either. pick your favorite reporting tool and you'll 
be a step ahead of me (I don't know a really good reporting tool)

as for sample data, you have syslog messages, just like I do. so you 
have the same access to data that I have.

how would you want to query them? how would people far less experianced 
that you want to query them?

I can speculate that some people would do two columns (time, everything 
else), others will do three (time, server, everything else), and others 
will go further (I know some who would like to extract IP addresses 
embedded in a message into their own column). some people will index on 
the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do 
something like: (timestamp, host, facility, priority, message). It's 
easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd 
have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table 
for each msg_type. So - you might have smtp_connect_from (hostname, 
ip_addr). A set of perl regexps can match and extract the fields for 
these extra tables, or you could do it with triggers inside the 
database. I think it makes sense to do it in the application. Easier for 
users to contribute new patterns/extractions. Meanwhile, the core table 
is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and 
perhaps stick them in their own schema (logs200901, logs200902 etc).

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance