David Garamond <lists@zara.6.isreserved.com> writes: > Greg Stark wrote: > >>Actually, each record will be incremented probably only thousands of times a > >>day. But there are many banners. Each record has a (bannerid, campaignid, > >>websiteid, date, countrycode) "dimensions" and (impression, click) "measures". > > In the past when I had a very similar situation we kept the raw impression and > > click event data. Ie, one record per impression in the impression table and > > one record per click in the click data. > > > That makes the tables insert-only which is efficient and not prone to locking > > contention. They would never have to be vacuumed except after purging old data. > > Assuming there are 10 millions of impressions per day, the impression table > will grow at least 200-400MB per day, is that correct? What do you do and how > often do you purge old data? Do you do a mass DELETE on the impression table > itself or do you switch to another table? I've found that deleting > tens/hundreds of thousands of row, at least in InnoDB, takes long, long time > (plus it sucks CPU and slows other queries). Well this was actually under Oracle, but I can extrapolate to Postgres given my experience. The idea tool for the job is a feature that Postgres has discussed but hasn't implemented yet, "partitioned tables". Under Oracle with partitioned tables we were able to drop entire partitions virtually instantaneously. It also made copying the data out to near-line backups much more efficient than index scanning as well. Before we implemented partitioned tables we used both techniques you described. At first we had an ad-hoc procedure of creating a new table and swapping it out. But that involved a short downtime and was a manual process. Eventually we set up an automated batch job which used deletes. Deletes under postgres should be fairly efficient. The I/O use would be unavoidable, so doing it during off-peak hours would still be good. But it shouldn't otherwise interfere with other queries. There should be no locking contention, no additional work for other queries (like checking rollback segments or logs) or any of the other problems other databases suffer from with large updates. I find the 10 million impressions per day pretty scary though. That's over 100/s across the entire 24 period. Probably twice that at peak hours. That would have to be one pretty beefy server just to handle the transaction processing itself. (And updates under postgres are essentially inserts where vacuum cleans up the old tuple later, so they would be no less taxing.) A server capable of handling that ought to be able to make quick work of deleting a few hundred megabytes of records. Another option is simply logging this data to a text file. Or multiple text files one per server. Then you can load the text files with batch loads offline. This avoids slowing down your servers handling the transactions in the critical path. But it's yet more complex with more points for failure. Something else you might be interested in is using a tool like this: http://www.danga.com/memcached/ I could see it being useful for caching the counts you were looking to keep so that the ad server doesn't need to consult the database to calculate which ad to show. A separate job could periodically sync the counts to the database or from the database. -- greg ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)