Re: performance for high-volume log insertion

david@xxxxxxx · Mon, 20 Apr 2009 23:00:54 -0700 (PDT)

On Tue, 21 Apr 2009, Stephen Frost wrote:

David,

* david@xxxxxxx (david@xxxxxxx) wrote:
the database structure is not being defined by (or specificly for)
rsyslog. so at compile time we have _no_ idea how many variables of what
type there are going to be. my example of ($timestamp,$msg) was intended
to just be a sample (avoiding typing out some elaberate set of
parameters)

That's fine.  I don't see any reason that the API I suggested be a
compile-time option.  Certainly, libpq has no idea how many, or what
kind, of types are being passed to it by the caller, that's why it *has*
the API that it does.  You just need to work out what each prepared
queries' parameters are and construct the necessary arrays.

I misunderstood what you were saying (more below)

rsyslog provides the following items, which can be sliced and diced with
substatutions, substrings, and additional inserted text.

[...]

Looks like mainly text fields still, so you might want to just stick
with text for now.

yes, almost exclusivly text fields, and the fields that could be numbers 
(or dates) are in text formats when we get them anyway.

rsyslog message formatting provides tools for doing the nessasary
escaping (and is using it for the single insert messages today)

prepared statements in text mode have similar problems (although they
_are_ better in defending against sql injection attacks, so a bit safer).

Uhh, if you use prepared statements with PQexecPrepared, there *is no*
escaping necessary.  I'm not sure what you mean by 'similar problems'.
Can you elaborate on that?  If you mean doing 'prepared queries' by
using creating a string and then using PQexec with
'EXECUTE blah (1,2,3);' then you really need to go read the
documentation I suggested.  That's *not* what I'm getting at when I say
'prepared queries', I'm talking about a protocol-level well-defined
format for passing arguments independently of commands.  A call to
PQexecPrepared looks like this:

PQprepare(conn, "myquery", "INSERT INTO TAB1 VALUES ($1, $2);", 0, NULL);

values[0] = "a";
values[1] = "b";
PQexecPrepared(conn, "myquery", 2, values, NULL, NULL, 0);

Note that we don't ever send an 'EXECUTE myquery (1,2,3);' type of thing
to libpq.  libpq will handle the execute and the parameters and whatnot
as part of the PG 3.0 protocol.

when you said to stick with text mode, I thought you were meaning that we 
would create a string with EXECUTE.... in it and send that. it would have 
similar escaping issues (although with fewer vunerabilities if they mess 
up)

so the binary mode only makes a difference on things like timestamps and
numbers? (i.e. no significant added efficiancy in processing the command
itself?)

I'm slightly confused by what you mean by this?  Binary mode is for
parameters only, commands are never 'binary'.  Binary mode just means
that the application gives the value to the database in the format which
it expects, and so the database doesn't have to translate a textual
representation of a value into the binary format the database needs.

I thought that part of the 'efficiancy' and 'performance' to be gained 
from binary modes were avoiding the need to parse commands, if it's only 
the savings in converting column contents from text to specific types, 
it's much less important.

I think the huge complication is that when RedHat compiles rsyslog to
ship it in the distro, they have no idea how it is going to be used (if
it will go to a database, what database engine it will interface with, or
what the schema of that database would look like). Only the
sysadmin(s)/dba(s) know that and they need to be able to tell rsyslog
what to do to get the data where they want it to be, and in the format
they want it to be in.

That's really all fine.  You just need to get from the user, at runtime,
what they want their commands to look like.  Once you have that, it
should be entirely possible to dynamically construct the prepared
queries, most likely without the user having to know anything about COPY
or prepared statements or anything.  For my part, I'd want something
like:

table = "mytable";
data = "$Y, $m, $d, $H, $msg";

if the user creates the data this way, you just reintroduced the escaping 
problem. they would have to do something like

data = "$Y"
data = "$m"
data = "$d"
data = "$H"
data = "$msg"

one key thing is that it's very probable that the user will want to 
manipulate the string, not just send a single variable as-is

I'd avoid having the user provide actual SQL, because that becomes
difficult to deal with unless you embed an SQL parser in rsyslog, and
I don't really see the value in that.

there's no need for rsyslog to parse the SQL, just to be able to escape it 
appropriately and then pass it to the database for execution

If the user wants to do
something fancy with the data in the database, I would encourage them
to put an 'ON INSERT' trigger on 'mytable' to do whatever they want with
the data that's coming in.  This gives you the freedom necessary to
build an appropriate statement for any database you're connecting to,
dynamically, using prepared queries, and even binary mode if you want.

one huge advantage of putting the sql into the configuration is the 
ability to work around other users of the database.

for example, what if the database has additional columns that you don't 
want to touch (say an item# sequence), if the SQL is in the config this is 
easy to work around, if it's seperate (or created by the module), this is 
very hard to do.

I guess you could give examples of the SQL in the documentation for how to 
create the prepared statement etc in the databases, but how is that much 
better than having it in the config file?

for many users it's easier to do middlein -fancy stuff in the SQL than 
loading things into the database (can you pre-load prepared statements in 
the database? or are they a per-connection thing?)

so back to the main questions of the advantages

prepared statements avoid needing to escape things, but at the 
complication of a more complex API.

there's still the question of the performance difference. I have been 
thinking that the overhead of doing the work itself would overwelm the 
performance benifits of prepared statements.

as I understand it, the primary performance benifit is the ability to 
avoid the parsing and planning stages of the command. for simple commands 
(which I assume inserts to be, even if inserting a lot of stuff), the 
planning would seem to be cheap compared to the work of doing the inserts

on a fully tuned database are we talking about 10% performance? 1%? 0.01%?

any ideas

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance