Re: Large tables (was: RAID 0 not as fast as

Markus Schaber <schabi@xxxxxxxxxxxx> · Thu, 21 Sep 2006 23:42:56 +0200

Hi, Bucky,

Bucky Jordan wrote:

> Each postgres process also uses shared memory (aka the buffer cache) so
> as to not fetch data that another process has already requested,
> correct?

Yes.

Additinally, the OS caches disk blocks. Most unixoid ones like Linux use
(nearly) all unused memory for this block cache, I don't know about Windows.

> Commonly these are referred to as OLAP applications, correct? Which is
> where I believe my application is more focused (it may be handling some
> transactions in the future, but at the moment, it follows the "load lots
> of data, then analyze it" pattern). 

Yes, most OLAP apps fall into this category. But I also think that most
OLAP apps mainly generate sequential data access (sequential scans), for
which the OS prefetching of data works fine.

Btw, some weeks ago, there was a patch mentioned here that improves the
linux kernel I/O scheduler wr/t those prefetching capabilities.

> Does this have anything to do with postgres indexes not storing data, as
> some previous posts to this list have mentioned? (In otherwords, having
> the index in memory doesn't help? Or are we talking about indexes that
> are too large to fit in RAM?)

Yes, it has, but only for the cases where your query fetches only
columns in that index. In case where you fetch other columns, PostgreSQL
has to access the Heap nevertheless to fetch those.

The overhead for checking outdated row versions (those that were updated
or deleted, but not yet vacuumed) is zero, as those "load bulk, then
analyze" applications typically don't create invalid rows, so every row
fetched from the heap is valid. This is very different in OLTP applications.

> So this issue would be only on a per query basis? Could it be alleviated
> somewhat if I ran multiple smaller queries? For example, I want to
> calculate a summary table on 500m records- fire off 5 queries that count
> 100m records each and update the summary table, leaving MVCC to handle
> update contention?

Yes, you could do that, but only if you're CPU bound, and have a
multi-core machine. And you need table partitioning, as LIMIT/OFFSET is
expensive. Btw, the Bizgres people do exactly this under their hood, so
it may be worth a look.

If you're I/O bound, and your query is a full table scan, or something
else that results in (nearly) sequential disk access, the OS prefetch
algorithm will work.

You can use some I/O monitoring tools to compare the actual speed the
data comes in when PostgreSQL does the sequential scan, and compare it
to DD'ing the database table files. For simple aggregates like sum(),
you usually get near the "raw" speed, and the real bottlenecks are the
disk I/O rate, bad RAID implementations or PCI bus contention.

> Actually, now that I think about it- that would only work if the
> sections I mentioned above were on different disks right? So I would
> actually have to do table partitioning with tablespaces on different
> spindles to get that to be beneficial? (which is basically not feasible
> with RAID, since I don't get to pick what disks the data goes on...)

If you really need that much throughput, you can always put the
different partitions on different RAIDs. But hardware gets very
expensive in those dimensions, and it may be better to partition the
data on different machines altogether. AFAIK, Bizgres MPP does exactly that.

> Are there any other workarounds for current postgres?

Are your questions of theoretical nature, or do you have a concrete
problem? In latter case, you could post your details here, and we'll see
whether we can help.

Btw, I'm not related with Bizgres in any way. :-)

HTH,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org