Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't
failed me yet. But I've been having some serious performance problems
on a database that I've been using in my grad-school research group, and
it's clear that I need help from some more experienced hands.
Basically, we've collected large amounts of data from students who used
our chemistry simulation software. That data, collected in XML logfiles
and then imported into a normalized set of PostgreSQL tables, keeps
track of every action the student took, as well as what the chemistry
simulation was doing. So if the number of molecules changes, we keep
track of that. If the user changes the color representation of the
molecules, we keep track of that, too. There is a timestamp
(unfortunately, with one-second resolution) associated with each
action. The simulations take place in several different activities,
each of which has a set of sections and nodes through which the user passes.
We're trying to look for patterns in this data, which means that we're
heading into the worlds of data warehousing and data mining. These are
relatively new topics for me, although I'm trying to get up to speed on
them. (The patterns for which we're searching have already been
determined. So we're trying to describe particular patterns, rather
than simply let data-mining software go wild and find correlations.)
My first plan was to create a single fact table (which I called
"transactions"), with foreign keys to various other relevant tables, and
pre-computed data that I would need in my calculations. I then figured
that I would write some queries, in a combination of Perl, Pl/PgSQL, and
straight SQL, to tag particular rows has being in a pattern (or not).
Once I've tagged the rows that are in the pattern, I can retrieve them
with a SQL query, and hand them off to the researcher analyzing them.
In theory, this would be the way to go. In practice, every step has
become a performance nightmare, taking many more hours than I might have
hoped or expected. For example, I've got a Pl/PgSQL function that goes
through each variable-assignment row, and UPDATEs is previous_value
column with whatever the previous value might have been. This function
takes 4-6 hours to run, across 2 million rows, representing two of our
six activities. (I've removed the other four activities' worth, in the
hopes that I'll see a performance improvement.)
When I only had 750,000 rows in our fact table, things ran at a somewhat
acceptable speed. Now, the database is getting seriously bogged down
with what I'm doing.
I've tried this on a few different pieces of hardware, including some
with multiple CPUs and lots of memory. And yet, nothing is going
quickly, or even remotely quickly. I'm stuck with the nagging feeling
that (a) I have some seriously unoptimized parts of my query, (b) faster
disks would be helpful, and/or (c) what I'm doing is inherently slow and
painful, and there's no way around it.
Numerous invocations of EXPLAIN, and frequent uses of VACUUM tell me
that I'm dealing with a fair amount of data here. But let's be honest;
the whole database is about 5 GB unvacuumed, and I know that people are
doing all sorts of things with 100 GB and larger disks. So perhaps I've
hit a hardware bottleneck, and need some zippier disks? I'm not sure.
So I'm looking for suggestions and advice from the PostgreSQL
community. I haven't included a lot of details here, because I'm not
even sure what details would be relevant, given the complexity of our
situation. So if there are some particular metrics that I should share
with this group, I'd be happy to do so, from hardware configurations to
table definitions, to the queries that are getting bogged down.
Thanks in advance,
Reuven