Re: High inserting by syslog

Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx> · Thu, 03 Jul 2008 09:32:53 -0700

Valter Douglas Lisbôa Jr. wrote:
Hello all, I have a perl script thats load a entire day squid log to a 
postgres table. I run it at midnight by cronjob and turns off the indexes 
before do it (turning it on after). The script works fine, but I want to 
change this to a diferent approach.

I'd like to insert on the fly the log lines, so long it be generated to have 
the data on-line. But the table has some indexes and the load of lines is 
about 300.000/day, so the average inserting is 3,48/sec. I think this could 
overload the database server (i did not test yet), so if I want to create a 
no indexed table to receive the on-line inserting and do a job moving all 
lines to the main indexed table at midnight.

My question is, Does exists a better solution, or this tatic is a good way to 
do this?
The average matters less than the peak. Unless your traffic is even 
24x7, your rate will be higher. If your log is concentrated in an 8-hour 
workday, your average daytime rate will be closer to 10/second with 
peaks that are much higher. You might consider some form of buffering 
between the Squid log and the database to avoid blocking. Your current 
method has the advantage of moving the database workload to off-hours.

Instead of moving data, you might look into partitioning your data. How 
long do you keep your logs actively available in PostgreSQL? I know one 
company that partitions their log data into months (parent table with 
child table for each month). They keep 12-months of data live so they 
rotate through the child tables. At the start of a month, that month's 
table is truncated. Modify as appropriate for your load - perhaps a 
partition (child-table) for each day. Or a current-day child-table that 
is migrated into a main-table nightly. Either way you can make it appear 
that the parent-table is an up-to-date complete table.

You will need to do some reading on table partitioning if you go this 
route. Pay special attention to the requirements needed to optimize queries.

You might also want to check your stats tables to make sure the indexes 
you currently maintain are actually used by your queries and remove any 
that are unnecessary to reduce index-maintenance overhead.

Another possible technique would be to have a nightly process that 
creates partial-indexes. One set of indexes would cover all data prior 
to midnight and the other set all data after midnight. Depending on the 
nature of your "real-time" vs. historical queries, these might even be 
different indexes. You will have to tweak your queries to make use of 
your indexes but your live data won't have to update your "historical" 
indexes. Warning: the date-constraint in the partial index must be 
static - you can't do something like "...where squidlog_timestamp > 
current_date...".  Your nightly process will be creating new indexes 
with a new date-constraint. You might even be able to get away with 
having no indexes on the current-day's data and just recreate historical 
indexes nightly (similar to your no-index with nightly-insert).

But don't try the above till you determine you have a problem. On modest 
3-year-old non-dedicated (also running file-storage, rsync backup, 
mail...) hardware with basic SATA RAID1 we are handling a similar load 
without strain.

Cheers,
Steve