Re: performance for high-volume log insertion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



david@xxxxxxx wrote:
On Tue, 21 Apr 2009, Stephen Frost wrote:

* david@xxxxxxx (david@xxxxxxx) wrote:
while I fully understand the 'benchmark your situation' need, this isn't
that simple.

It really is.  You know your application, you know it's primary use
cases, and probably have some data to play with.  You're certainly in a
much better situation to at least *try* and benchmark it than we are.

rsyslog is a syslog server. it replaces (or for debian and fedora, has replaced) your standard syslog daemon. it recieves log messages from every app on your system (and possibly others), filters, maniulates them, and then stores them somewhere. among the places that it can store the logs are database servers (native support for MySQL, PostgreSQL, and Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug level log statements (configurable split?). Then you can use the synchronous_commit setting on the less important ones. Might buy you some performance on a busy system.

http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

other apps then search and report on the data after it is stored. what apps?, I don't know either. pick your favorite reporting tool and you'll be a step ahead of me (I don't know a really good reporting tool)

as for sample data, you have syslog messages, just like I do. so you have the same access to data that I have.

how would you want to query them? how would people far less experianced that you want to query them?

I can speculate that some people would do two columns (time, everything else), others will do three (time, server, everything else), and others will go further (I know some who would like to extract IP addresses embedded in a message into their own column). some people will index on the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do something like: (timestamp, host, facility, priority, message). It's easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table for each msg_type. So - you might have smtp_connect_from (hostname, ip_addr). A set of perl regexps can match and extract the fields for these extra tables, or you could do it with triggers inside the database. I think it makes sense to do it in the application. Easier for users to contribute new patterns/extractions. Meanwhile, the core table is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and perhaps stick them in their own schema (logs200901, logs200902 etc).

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux