Re: Database design wisdom needed

Andrew Sullivan <ajs@xxxxxxxxxxxxxxx> · Tue, 5 Jun 2007 09:51:24 -0400

On Tue, Jun 05, 2007 at 01:46:42PM +0800, Erick Papadakis wrote:
> 
>    GAME_COUNTS Table (also ~5 million rows of course)
>    ---------------------------------------------------
>    GAME_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE

This is a poor normalisation.  While views_count is necessarily a
superset of played_count, the two values are not functions of one
another really (for instance, there will be games that have been
viewed but never played); therefore they should be in separate
tables, I'd say.  I'd also like to point out that a 5 million row
table is actually not very big.

In order to make this fast, I'd probably implement a dirtyish hack
along the following lines.

create table game_views_summ (
	game_id int references games(id),
	views bigint,
	counted_at timestamp with time zone);

create table game_views (
	game_id int references games(id),
	viewed_at timestamp with time zone);

Then, you have a daemon that summarizes data in game_views regularly
into game_views_summ and deletes the data in game_views that's
just been updated.  If you do this more or less all the time, you can
keep the vacuums up to date and keep the bloat to a minimum.  It's
an increase in overall I/O, but it ought to be smoother than just
trying to cope with it in big lumps.  (A similar strategy will work
for the play counts.)

One thing to be careful about is that you don't end up with
game_views with very few rows, but a huge number of dead rows.  This
two-table approach can rapidly become a pessimal implementation in
the event you are too efficient at eliminating the
rows-to-be-summarized, but have a lot of dead rows that are
unrecoverable because of running transactions.  You'll end up with a
seqscan on a table that contains four live rows, except that it's
400M.  That pattern is a killer.

> 1.5. After we update the GAME_COUNTS table, we also truncate the
> GAME_PLAYED_LOG table because its records have no significance
> anymore. This hourly deletion leads to fragmentation etc in the table.

Are you truncating, or deleting?  TRUNCATE leaves no dead rows,
fragmentation, &c.  That's its point. 

> (remember, we have 5,000,000 games). So every hour we end up running
> thousands of small UPDATE queries like:
> 
>   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...

Why is this bad?  (You don't actually need thousands of these, I
think, because you ought to be able to design one query to do it all. 
But I'd be unhappy with the locking, I think, given what you're
trying to do.)

> 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
> table are not very fast. They take about 1 second each, even if I do a
> LIMIT 20 in every query.

This sounds like something's wrong in your query or your plan. 
EXPLAIN ANALYSE is your friend here.  I'd suspect vacuum issues.

Oh, one other thing.  I noted you're storing the player's IP address. 
You do know that maps very poorly to actual individuals on the other
end, right?

A

-- 
Andrew Sullivan  | ajs@xxxxxxxxxxxxxxx
Unfortunately reformatting the Internet is a little more painful 
than reformatting your hard drive when it gets out of whack.
		--Scott Morris