Re: performance for high-volume log insertion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 21 Apr 2009, Stephen Frost wrote:

David,

* david@xxxxxxx (david@xxxxxxx) wrote:
I thought that part of the 'efficiancy' and 'performance' to be gained
from binary modes were avoiding the need to parse commands, if it's only
the savings in converting column contents from text to specific types,
it's much less important.

No, binary mode is about the column contents.  Prepared queries is about
avoiding having to parse commands (which is why the command and the data
elements are seperately done).

thanks for the clarification.

I'd avoid having the user provide actual SQL, because that becomes
difficult to deal with unless you embed an SQL parser in rsyslog, and
I don't really see the value in that.

there's no need for rsyslog to parse the SQL, just to be able to escape
it appropriately and then pass it to the database for execution

If the user is providing SQL, then you need to be able to parse that SQL
if you're going to do prepared queries.  It might not require you to be
able to fully parse SQL the way the back-end does, but anything you do
that's not a full SQL parser is going to end up having limitations that
won't be easy to find or document.

the current situation is that rsyslog never parses the SQL (other than as text for a template, just like if you were going to write the log message to disk)

if we stick with the string based API we never need to

the user gives us one string 'prepare...' that we send to the database. the user then gives us another string 'execute...' that we send to the database. at no point do we ever need to parse the SQL, or even really know that it is SQL (the one exception is an escapeing routine that replace ' with '' in the strings comeing from the outside world), it's just strings assembled using the same string assembly logic that is used for writing files to disk, crafting the payload of network packets to other servers, etc.

I do agree that there is a reduction in security risk. but since rsyslog is rather draconian about forcing the escaping, I'm not sure this is enough to tip the scales.


one huge advantage of putting the sql into the configuration is the
ability to work around other users of the database.

See, I just don't see that.

moving a bit away from the traditional syslog use case for a moment. with the ability to accept messages from many different types of sources (some unreliable like UDP syslog, others very reliably with full application-level acknowledgements), the ability to filter messages to different destination, and the ability to configure it to craft arbatrary SQL statements, rsyslog can be useful as an 'impedance match' between different applications. you can coherse just about any app to write some sort of message to a file/pipe, and rsyslog can take that and get it into a database elsewhere. yes, you or I could write a quick program that would reformat the message and submit it (in perl/python/etc, but extending that to handle outages, high-volume bursts of traffic, etc starts to be hard.

this is very much _not_ a common use case, but it's a useful side-use of rsyslog today.

I guess you could give examples of the SQL in the documentation for how
to create the prepared statement etc in the databases, but how is that
much better than having it in the config file?

for many users it's easier to do middlein -fancy stuff in the SQL than
loading things into the database (can you pre-load prepared statements in
the database? or are they a per-connection thing?)

Prepared statements, at least under PG, are a per-connection thing.
Triggers aren't the same, those are attached to tables and get called
whenever a particular action is done on those tables (defined in the
trigger definition).  The trigger is then called with the row which is
being inserted, etc, and can do whatever it wants with that row (put it
in a different table, ignore it, etc).

that sounds like a lot of work at the database level to avoid some complexity on the app side (and it seems that the need to fire a trigger probably cost more than the prepared statement ever hoped to gain.)

so back to the main questions of the advantages

prepared statements avoid needing to escape things, but at the
complication of a more complex API.

there's still the question of the performance difference. I have been
thinking that the overhead of doing the work itself would overwelm the
performance benifits of prepared statements.

What work is it that you're referring to here?

doing the inserts themselves (putting the data in the tables, updating indexes, issuing a fsync)

Based on what you've
said about your application so far, I would expect that the run-time
cost to prepare the statement (which you do only once) to be a bit of a
cost, but not much, and that the actual inserts would be almost free
from the application side, and much easier for the database to
parse/use.

the inserts are far from free ;-)

but I agree that with prepared statements, the overhead of the insert is small. I'm trying to get a guess as to how small.

as I understand it, the primary performance benifit is the ability to
avoid the parsing and planning stages of the command. for simple commands
(which I assume inserts to be, even if inserting a lot of stuff), the
planning would seem to be cheap compared to the work of doing the inserts

The planning isn't very expensive, no, but it's also not free.  The
parsing is more expensive.

ok, that makes the multi-value insert process (or copy) sound more attractive (less parsing)

on a fully tuned database are we talking about 10% performance? 1%? 0.01%?

any ideas

It depends a great deal on your application..  Do you have some example
data that we could use to test with?  Some default templates that you
think most people end up using which we could create dummy data for?

take the contents of /var/log/messages on your system, split it into timestamp, server, log and you have at least a reasonable case to work with.

Of course, in the end, batching your inserts into fewer transactions
than 1-per-insert should give you a huge benefit right off the bat.

agreed, that was my expectation, but then people started saying how important it was to use prepared statements.

What we're talking about is really the steps after that, which might not
ever be necessary for your particular application.  On the other hand,
I'd never let production code go out that isn't using prepared queries
wherever possible.

I do understand the sentiment, really I do. I'm just trying to understand the why.

in this case, moving from being able to insert one record per transaction to inserting several hundred per transaction will _probably_ swamp everything else. but I really hate the word 'probably', I've been surprised to many times.

one of the more recent surprises was on my multi-TB (4-core, 32G ram, 16 spindles of data) log query server I discovered that there was no noticable difference between raid 10 and raid 6 for the bulk of my query workload (an example is finding 14 log entries out of 12B). it turns out that if you are read-only with a seek-heavy workload you can keep all spindles busy with either approach. so in spite of the fact that 'everybody knows' that raid 10 is _far_ better than raid 6, especially for databases, I discovered that that really only applies to writes.


so I really do want to understand the 'why' if possible.

you have helped a lot.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux