On 06/05/07 00:46, Erick Papadakis wrote:
Hi Sorry for this somewhat long email but I think it is relevant to most people who run online databases. I am having trouble optimizing UPDATE queries on a certain semi-large table that is only growing larger. I've come across some very interesting thoughts from this list, so I thought I'll post my conundrum here. Would truly appreciate any guidance or pointers. I have done the messy DBA's job of breaking one table into three, which should actually have been one. Just did it to make sure the large main table doesn't keep getting updated with small crumbs of information. This main table stores what users of our online service have created. These are small Flash games, but over the last seven years, we have collected about 5 million of these. The primary key of this table is "GAME_ID". In our online account management interface I also need to show each producer's creations only to himself, so of course I need a "PRODUCER_ID". Above that, in our service we have had three more bits of information for each created game: - Number of views for the game (online views) - Number of plays of the game (online people playing) - Number of unique plays of the game (based on ip or subscriber_id..some games do not require login to play) Because each of these pieces of information was individual to each game, my original table looked like this: GAME Table ----------- GAME_ID PRODUCER_ID VIEWS_COUNT PLAYED_COUNT PLAYED_COUNT_UNIQUE In the beginning, everytime someone played a game online, we updated the PLAYED_COUNT in this table. When someone viewed a game, we updated the VIEWS_COUNT. It was simple and it worked as it reflected the business logic. Over time, however, I realized that while GAME_ID and PRODUCER_ID remained basically static, the next three columns had a very high rate of change. For example, VIEWS_COUNT would increment every second for popular games. The PLAYED_COUNT would increment everytime someone played, but the fact that we have "unique" into the equation means that I had to break this away into a log table, so that I could GROUP BY ip address and then update the columns accordingly in the GAME table. So I had: GAME_PLAYED_LOG Table ---------------------- GAME_ID PLAYER_IP_ADDR From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT PLAYER_IP_ADDR), grouped by GAME_ID. I would then update the main GAME table's columns with these values. This process was done every hour to make sure we didn't have a huge number of rows to manipulate. Anyway, now we're realizing that we are becoming increasingly popular and our tables are large enough for us to require optimization. Because GAME table is mostly for static information, and is also required to be highly available because of our online interface, I have just taken away the VIEWS_COUNT and PLAYED_COUNT into another table. GAME Table (~5 million rows) ---------------------------- GAME_ID PRODUCER_ID GAME_COUNTS Table (also ~5 million rows of course) --------------------------------------------------- GAME_ID VIEWS_COUNT PLAYED_COUNT PLAYED_COUNT_UNIQUE This way, from our GAME_PLAYED_LOG, we do not need to update the main GAME table every hour, but only the GAME_COUNTS. This leaves the GAME table free to do its job. That is our main table, so keeping it static was our priority.
And good database theory, too, since the number of times that a game is played is not "tied" to the PRODUCER_ID.
<golf clap>
My problems: 1. Because we are a popular website, the GAME_PLAYED_LOG table grows at massive rates every hour. About 1,000,000 records. Doing COUNT queries on this table already is pretty resource hogging, even if we do them every hour. Especially the DISTINCT grouping to get unique played count.
Any index on GAME_PLAYED_LOG?
1.5. After we update the GAME_COUNTS table, we also truncate the GAME_PLAYED_LOG table because its records have no significance anymore. This hourly deletion leads to fragmentation etc in the table. 2. The UPDATE queries to GAME_COUNTS table are also a problem. The sheer number of UPDATE queries is massive. Of the 1,000,000 views every hour, many are just one or two views of many, many games (remember, we have 5,000,000 games). So every hour we end up running thousands of small UPDATE queries like: update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS table are not very fast. They take about 1 second each, even if I do a LIMIT 20 in every query. Now I suspect this scenario is *very* common in online databases that must report viewer statistics. How are you guys doing it? What am I missing? Isn't the decoupling of the static information from the more frequently updated information a good idea? Would using a different storage engine for different kinds of tables help -- engines that were better for INSERT and SELECT, while others that were good for UPDATE? We ran MySQL until a year ago but we have switched to Pg since we had data corruption issues. I am sure I'm missing some tricks in the Pg world, and would truly appreciate any ideas. If you are reading until here, well done! And so many thanks in advance for any insight you can shed into this matter.
What index(es) is/are on GAME and GAME_COUNTS? What version of PG are you running? Are you pushing the box too hard? -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good!