On Tue, Jun 05, 2007 at 01:46:42PM +0800, Erick Papadakis wrote: > > GAME_COUNTS Table (also ~5 million rows of course) > --------------------------------------------------- > GAME_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE This is a poor normalisation. While views_count is necessarily a superset of played_count, the two values are not functions of one another really (for instance, there will be games that have been viewed but never played); therefore they should be in separate tables, I'd say. I'd also like to point out that a 5 million row table is actually not very big. In order to make this fast, I'd probably implement a dirtyish hack along the following lines. create table game_views_summ ( game_id int references games(id), views bigint, counted_at timestamp with time zone); create table game_views ( game_id int references games(id), viewed_at timestamp with time zone); Then, you have a daemon that summarizes data in game_views regularly into game_views_summ and deletes the data in game_views that's just been updated. If you do this more or less all the time, you can keep the vacuums up to date and keep the bloat to a minimum. It's an increase in overall I/O, but it ought to be smoother than just trying to cope with it in big lumps. (A similar strategy will work for the play counts.) One thing to be careful about is that you don't end up with game_views with very few rows, but a huge number of dead rows. This two-table approach can rapidly become a pessimal implementation in the event you are too efficient at eliminating the rows-to-be-summarized, but have a lot of dead rows that are unrecoverable because of running transactions. You'll end up with a seqscan on a table that contains four live rows, except that it's 400M. That pattern is a killer. > 1.5. After we update the GAME_COUNTS table, we also truncate the > GAME_PLAYED_LOG table because its records have no significance > anymore. This hourly deletion leads to fragmentation etc in the table. Are you truncating, or deleting? TRUNCATE leaves no dead rows, fragmentation, &c. That's its point. > (remember, we have 5,000,000 games). So every hour we end up running > thousands of small UPDATE queries like: > > update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... Why is this bad? (You don't actually need thousands of these, I think, because you ought to be able to design one query to do it all. But I'd be unhappy with the locking, I think, given what you're trying to do.) > 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS > table are not very fast. They take about 1 second each, even if I do a > LIMIT 20 in every query. This sounds like something's wrong in your query or your plan. EXPLAIN ANALYSE is your friend here. I'd suspect vacuum issues. Oh, one other thing. I noted you're storing the player's IP address. You do know that maps very poorly to actual individuals on the other end, right? A -- Andrew Sullivan | ajs@xxxxxxxxxxxxxxx Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris