Defining performance.

Paul Lathrop <plathrop@xxxxxxxxxxxxxxx> · Thu, 30 Nov 2006 14:59:27 -0800

Hello all,

I've been struggling with some performance questions regarding our
Postgres databases. Here's the background:

We run 4 ~25-30Gb databases which cache information from eBay. These
databases have had performance issues since before I joined the company.
The databases have gone through a number of iterations. Initially, they
were deployed as one huge database - performance was apparently
unacceptable. They were split, and tried on a variety of hardware
platforms. When I joined the company last year, the databases were
deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
RAM, running Debian Woody and Postgres 7.2. These systems seemed to
suffer a gradually decreasing performance accompanied by a gradually
growing disk space usage. The DBA had come to the conclusion that the
VACUUM command did/does not work on these systems, because even after a
VACUUM FULL, the size of the database was continually increasing. So, as
things stand with the PG7.2 machines, vacuuming is run nightly, and
whenever the database size reaches 40Gb on disk (the point at which
performance has degraded below tolerance), the DBA exports the data,
deletes the database, and then imports the data, shrinking it to the
actual size of the dataset.

This process is time-consuming, costly, and the servers that we are
deployed on do not meet our stability requirements. So, after much
pushing, I was able to deploy a 12-disk RAID5 dual-proc AMD64 machine
with 16Gb of RAM running FreeBSD and Postgres 8.1.

The performance increase was immediate, obvious, and dramatic, as you
might expect from such a large boost in the underlying hardware.

This upgrade appears to have solved the VACUUM issue - regular VACUUM
commands now seem able to maintain the database size at a steady-state
(taking into account fluctuations associated with actual changes in the
dataset size!). We are now planning on moving the other three databases
to the new platform and hardware.

However, we still are suffering a gradual decrease in performance over
time - or so the application engineers claim. The DBA and I have been
banging our heads against this for a month.

Which brings me to the questions:

1) How does one define 'performance' anyway? Is it average time to
complete a query? If so, what kind of query? Is it some other metric?

2) I've combed the archives and seen evidence that people out there are
running much much larger databases on comparable hardware with decent
performance. Is this true, or is my dataset at about the limit of my
hardware?

3) Though this may seem irrelevant, since we are moving away from the
platform, it would still be good to know - was VACUUM actually
completely useless on PG7.2 or is there some other culprit on these
legacy machines?

4) Much of my reading of the PG docs and list archives seems to suggest
that much of performance tuning is done at the query level - you have to
know how to ask for information in an efficient way. To that end, I took
a look at some of the queries we get on a typical day. On average, 24
times per minute, our application causes a unique key violation. This
struck me as strange, but the VP of Engineering says this is a
performance ENHANCEMENT - the code doesn't bother checking for the
unique key because it depends on the database to enforce that. My
interpretation of reading the archives & docs seems to indicate that
this could be causing the constantly increasing database size... so now
that I've rambled about it, does an INSERT transaction that is rolled
back due to a unique key violation leave dead rows in the table? If so, why?

I really appreciate any information you guys can give me. I'm convinced
that PG is the best database for our needs, but I need to be able to get
this database performing well enough to convince the bigwigs.

Regards,
Paul Lathrop
Systems Administrator

Attachment:
signature.asc

Description: OpenPGP digital signature