On Tue, Apr 21, 2009 at 08:37:54AM -0700, david@xxxxxxx wrote: > Kenneth, > could you join the discussion on the rsyslog mailing list? > rsyslog-users <rsyslog@xxxxxxxxxxxxxxxxx> > > I'm surprised to hear you say that rsyslog can already do batch inserts and > am interested in how you did that. > > what sort of insert rate did you mange to get? > > David Lang > David, I would be happy to join the discussion. I did not mean to say that rsyslog currently supported batch inserts, just that the pieces that provide "stand-by queuing" could be used to manage batching inserts. Cheers, Ken > On Tue, 21 Apr 2009, Kenneth Marshall wrote: > >> Date: Tue, 21 Apr 2009 08:33:30 -0500 >> From: Kenneth Marshall <ktm@xxxxxxxx> >> To: Richard Huxton <dev@xxxxxxxxxxxx> >> Cc: david@xxxxxxx, Stephen Frost <sfrost@xxxxxxxxxxx>, >> Greg Smith <gsmith@xxxxxxxxxxxxx>, pgsql-performance@xxxxxxxxxxxxxx >> Subject: Re: performance for high-volume log insertion >> Hi, >> >> I just finished reading this thread. We are currently working on >> setting up a central log system using rsyslog and PostgreSQL. It >> works well once we patched the memory leak. We also looked at what >> could be done to improve the efficiency of the DB interface. On the >> rsyslog side, moving to prepared queries allows you to remove the >> escaping that needs to be done currently before attempting to >> insert the data into the SQL backend as well as removing the parsing >> and planning time from the insert. This is a big win for high insert >> rates, which is what we are talking about. The escaping process is >> also a big CPU user in rsyslog which then hands the escaped string >> to the backend which then has to undo everything that had been done >> and parse/plan the resulting query. This can use a surprising amount >> of additional CPU. Even if you cannot support a general prepared >> query interface, by specifying what the query should look like you >> can handle much of the low-hanging fruit query-wise. >> >> We are currently using a date based trigger to use a new partition >> each day and keep 2 months of logs currently. This can be usefully >> managed on the backend database, but if rsyslog supported changing >> the insert to the new table on a time basis, the CPU used by the >> trigger to support this on the backend could be reclaimed. This >> would be a win for any DB backend. As you move to the new partition, >> issuing a truncate to clear the table would simplify the DB interfaces. >> >> Another performance enhancement already mentioned, would be to >> allow certain extra fields in the DB to be automatically populated >> as a function of the log messages. For example, logging the mail queue >> id for messages from mail systems would make it much easier to locate >> particular mail transactions in large amounts of data. >> >> To sum up, eliminating the escaping in rsyslog through the use of >> prepared queries would reduce the CPU load on the DB backend. Batching >> the inserts will also net you a big performance increase. Some DB-based >> applications allow for the specification of several types of queries, >> one for single inserts and then a second to support multiple inserts >> (copy). Rsyslog already supports the queuing pieces to allow you to >> batch inserts. Just some ideas. >> >> Regards, >> Ken >> >> >> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote: >>> david@xxxxxxx wrote: >>>> On Tue, 21 Apr 2009, Stephen Frost wrote: >>>>> * david@xxxxxxx (david@xxxxxxx) wrote: >>>>>> while I fully understand the 'benchmark your situation' need, this >>>>>> isn't >>>>>> that simple. >>>>> >>>>> It really is. You know your application, you know it's primary use >>>>> cases, and probably have some data to play with. You're certainly in a >>>>> much better situation to at least *try* and benchmark it than we are. >>>> rsyslog is a syslog server. it replaces (or for debian and fedora, has >>>> replaced) your standard syslog daemon. it recieves log messages from >>>> every >>>> app on your system (and possibly others), filters, maniulates them, and >>>> then stores them somewhere. among the places that it can store the logs >>>> are database servers (native support for MySQL, PostgreSQL, and Oracle. >>>> plus libdbi for others) >>> >>> Well, from a performance standpoint the obvious things to do are: >>> 1. Keep a connection open, do NOT reconnect for each log-statement >>> 2. Batch log statements together where possible >>> 3. Use prepared statements >>> 4. Partition the tables by day/week/month/year (configurable I suppose) >>> >>> The first two are vital, the third takes you a step further. The fourth >>> is >>> a long-term admin thing. >>> >>> And possibly >>> 5. Have two connections, one for fatal/error etc and one for info/debug >>> level log statements (configurable split?). Then you can use the >>> synchronous_commit setting on the less important ones. Might buy you some >>> performance on a busy system. >>> >>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS >>> >>>> other apps then search and report on the data after it is stored. what >>>> apps?, I don't know either. pick your favorite reporting tool and you'll >>>> be a step ahead of me (I don't know a really good reporting tool) >>>> as for sample data, you have syslog messages, just like I do. so you >>>> have >>>> the same access to data that I have. >>>> how would you want to query them? how would people far less experianced >>>> that you want to query them? >>>> I can speculate that some people would do two columns (time, everything >>>> else), others will do three (time, server, everything else), and others >>>> will go further (I know some who would like to extract IP addresses >>>> embedded in a message into their own column). some people will index on >>>> the time and host, others will want to do full-text searches of >>>> everything. >>> >>> Well, assuming it looks much like traditional syslog, I would do >>> something >>> like: (timestamp, host, facility, priority, message). It's easy enough to >>> stitch back together if people want that. >>> >>> PostgreSQL's full-text indexing is quite well suited to logfiles I'd have >>> thought, since it knows about filenames, urls etc already. >>> >>> If you want to get fancy, add a msg_type column and one subsidiary table >>> for each msg_type. So - you might have smtp_connect_from (hostname, >>> ip_addr). A set of perl regexps can match and extract the fields for >>> these >>> extra tables, or you could do it with triggers inside the database. I >>> think >>> it makes sense to do it in the application. Easier for users to >>> contribute >>> new patterns/extractions. Meanwhile, the core table is untouched so you >>> don't *need* to know about these extra tables. >>> >>> If you have subsidiary tables, you'll want to partition those too and >>> perhaps stick them in their own schema (logs200901, logs200902 etc). >>> >>> -- >>> Richard Huxton >>> Archonet Ltd >>> >>> -- >>> Sent via pgsql-performance mailing list >>> (pgsql-performance@xxxxxxxxxxxxxx) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >>> >> > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance