Re: performance for high-volume log insertion

david@xxxxxxx · Tue, 21 Apr 2009 00:49:59 -0700 (PDT)

On Tue, 21 Apr 2009, Stephen Frost wrote:

* david@xxxxxxx (david@xxxxxxx) wrote:
while I fully understand the 'benchmark your situation' need, this isn't
that simple.

It really is.  You know your application, you know it's primary use
cases, and probably have some data to play with.  You're certainly in a
much better situation to at least *try* and benchmark it than we are.

rsyslog is a syslog server. it replaces (or for debian and fedora, has 
replaced) your standard syslog daemon. it recieves log messages from every 
app on your system (and possibly others), filters, maniulates them, and 
then stores them somewhere. among the places that it can store the logs 
are database servers (native support for MySQL, PostgreSQL, and Oracle. 
plus libdbi for others)

other apps then search and report on the data after it is stored. what 
apps?, I don't know either. pick your favorite reporting tool and you'll 
be a step ahead of me (I don't know a really good reporting tool)

as for sample data, you have syslog messages, just like I do. so you have 
the same access to data that I have.

how would you want to query them? how would people far less experianced 
that you want to query them?

I can speculate that some people would do two columns (time, everything 
else), others will do three (time, server, everything else), and others 
will go further (I know some who would like to extract IP addresses 
embedded in a message into their own column). some people will index on 
the time and host, others will want to do full-text searches of 
everything.

I can talk about the particular use case I have at work, but that would be 
far from normal (full text searches on 10s of TB of data, plus reports, 
etc) but we don't (currently) use postgres to do that, and I'm not sure 
how I would configure postgres for that sort of load. so I don't think 
that my personal situation is a good fit. I looked at bizgres a few years 
ago, but I didn't know enough about what I was trying to do or how much 
data I was trying to deal with to go forward with it at that time.

do I do the benchmark on the type of hardware that I use for the system 
above (after spending how much time experimenting to find corret tuning)? 
or on a stock, untuned postgres running on a desktop-type system (we all 
know how horrible the defaults are), how am I supposed to know if the 
differences that I will see in my 'benchmarks' are the result of the 
differences between the commands, and not that I missed some critical knob 
to turn?

benchmarking is absolutly the right answer for some cases, especially when 
someone is asking exactly how something will work for them. but in this 
case I don't have the specific use case. I am trying to find out where the 
throretical advantages are for these things that 'everybody knows you 
should do with a database' to understand the probability that they will 
make a difference in this case.

in this case we are trying to decide what API/interface to use in a
infrastructure tool that will be distributed in common distros (it's now
the default syslog package of debian and fedora), there are so many
variables in hardware, software, and load that trying to benchmark it
becomes effectivly impossible.

You don't need to know how it will perform in every situation.  The main
question you have is if using prepared queries is faster or not, so pick
a common structure, create a table, get some data, and test.  I can say
that prepared queries will be more likely to give you a performance
boost with wider tables (more columns).

this is very helpful, I can't say what the schema would look like, but I 
would guess that it will tend towards the narrow side (or at least not 
update very many columns explicitly)

based on Stephan's explination of where binary could help, I think the
easy answer is to decide not to bother with it (the postgres text to X
converters get far more optimization attention than anything rsyslog
could deploy)

While that's true, there's no substitute for not having to do a
conversion at all.  After all, it's alot cheaper to do a bit of
byte-swapping on an integer value that's already an integer in memory
than to sprintf and atoi it.

but it's not a integer in memory, it's text that arrived over the network 
or through a socket as a log message from another application.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance