Re: performance for high-volume log insertion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 21 Apr 2009, Stephen Frost wrote:

* david@xxxxxxx (david@xxxxxxx) wrote:
while I fully understand the 'benchmark your situation' need, this isn't
that simple.

It really is.  You know your application, you know it's primary use
cases, and probably have some data to play with.  You're certainly in a
much better situation to at least *try* and benchmark it than we are.

rsyslog is a syslog server. it replaces (or for debian and fedora, has replaced) your standard syslog daemon. it recieves log messages from every app on your system (and possibly others), filters, maniulates them, and then stores them somewhere. among the places that it can store the logs are database servers (native support for MySQL, PostgreSQL, and Oracle. plus libdbi for others)

other apps then search and report on the data after it is stored. what apps?, I don't know either. pick your favorite reporting tool and you'll be a step ahead of me (I don't know a really good reporting tool)

as for sample data, you have syslog messages, just like I do. so you have the same access to data that I have.

how would you want to query them? how would people far less experianced that you want to query them?


I can speculate that some people would do two columns (time, everything else), others will do three (time, server, everything else), and others will go further (I know some who would like to extract IP addresses embedded in a message into their own column). some people will index on the time and host, others will want to do full-text searches of everything.


I can talk about the particular use case I have at work, but that would be far from normal (full text searches on 10s of TB of data, plus reports, etc) but we don't (currently) use postgres to do that, and I'm not sure how I would configure postgres for that sort of load. so I don't think that my personal situation is a good fit. I looked at bizgres a few years ago, but I didn't know enough about what I was trying to do or how much data I was trying to deal with to go forward with it at that time.


do I do the benchmark on the type of hardware that I use for the system above (after spending how much time experimenting to find corret tuning)? or on a stock, untuned postgres running on a desktop-type system (we all know how horrible the defaults are), how am I supposed to know if the differences that I will see in my 'benchmarks' are the result of the differences between the commands, and not that I missed some critical knob to turn?

benchmarking is absolutly the right answer for some cases, especially when someone is asking exactly how something will work for them. but in this case I don't have the specific use case. I am trying to find out where the throretical advantages are for these things that 'everybody knows you should do with a database' to understand the probability that they will make a difference in this case.

in this case we are trying to decide what API/interface to use in a
infrastructure tool that will be distributed in common distros (it's now
the default syslog package of debian and fedora), there are so many
variables in hardware, software, and load that trying to benchmark it
becomes effectivly impossible.

You don't need to know how it will perform in every situation.  The main
question you have is if using prepared queries is faster or not, so pick
a common structure, create a table, get some data, and test.  I can say
that prepared queries will be more likely to give you a performance
boost with wider tables (more columns).

this is very helpful, I can't say what the schema would look like, but I would guess that it will tend towards the narrow side (or at least not update very many columns explicitly)

based on Stephan's explination of where binary could help, I think the
easy answer is to decide not to bother with it (the postgres text to X
converters get far more optimization attention than anything rsyslog
could deploy)

While that's true, there's no substitute for not having to do a
conversion at all.  After all, it's alot cheaper to do a bit of
byte-swapping on an integer value that's already an integer in memory
than to sprintf and atoi it.

but it's not a integer in memory, it's text that arrived over the network or through a socket as a log message from another application.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux