Re: High update activity, PostgreSQL vs BigDBMS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Guy,

The application is fairly straightforward, but as you say, what is working okay with BigDBMS isn't working as well under PG. I'm going to try other configuration suggestions made by others before I attempt logic changes. The core logic is unchangeable; millions of rows of data in a single table will be updated throughout the day. If PG can't handle high volume updates well, this may be brick wall.

Here are a couple things I learned.

ANALYZE is VERY important, surprisingly so even for small tables.  I had a case last week where a temporary "scratch" table with just 100 rows was joined to two more tables of 6 and 12 million rows.  You might think that a 100-row table wouldn't need to be analyzed, but it does: Without the ANALYZE, Postgres generated a horrible plan that took many minutes to run; with the ANALYZE, it took milliseconds.  Any time a table's contents change dramatically, ANALYZE it, ESPECIALLY if it's a small table.  After all, changing 20 rows in a 100-row table has a much larger affect on its statistics than changing 20 rows in a million-row table.

Postgres functions like count() and max() are "plug ins" which has huge architectural advantages.  But in pre-8.1 releases, there was a big speed penalty for this: functions like count() were very, very slow, requiring a full table scan.  I think this is vastly improved from 8.0x to 8.1 and forward; others might be able to comment whether count() is now as fast in Postgres as Oracle.  The "idiom" to replace count() was "select col from tbl order by col desc limit 1".  It worked miracles for my app.

Postgres has explicit garbage collection via VACUUM, and you have to design your application with this in mind.  In Postgres, update is delete+insert, meaning updates create garbage.  If you have very "wide" tables, but only a subset of the columns are updated frequently, put these columns in a separate table with an index to join the two tables.  For example, my original design was something like this:

  integer primary key
  very large text column
  ... a bunch of integer columns, float columns, and small text columns

The properties were updated by the application, but the large text column never changed.  This led to huge garbage-collection problems as the large text field was repeatedly deleted and reinserted by the updates.  By separating these into two tables, one with the large text column, and the other table with the dynamic, but smaller, columns, garbage is massively reduced, and performance increased, both immediately (smaller data set to update) and long term (smaller vacuums).  You can use views to recreate your original combined columns, so the changes to your app are limited to where updates occur.

If you have a column that is *frequently* updated (say, for example, a user's last-access timestamp each time s/he hits your web server) then you definitely want this in its own table, not mixed in with the user's name, address, etc.

Partitioning in Postgres is more powerful than in Oracle.  Use it if you can.

Partial indexes are VERY nice in Postgres, if your data is poorly distributed (for example, a mostly-NULL column with a small percentage of very important values).

I'm sure there are more things that others can contribute.

Craig



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux