Re: Transparent table partitioning in future version of PG?

david@xxxxxxx · Fri, 8 May 2009 11:20:57 -0700 (PDT)

On Fri, 8 May 2009, Robert Haas wrote:

On Thu, May 7, 2009 at 10:52 PM,  <david@xxxxxxx> wrote:
Hopefully, notions of partitioning won't be directly tied to chunking of
data for parallel query access. Most queries access recent data and
hence only a single partition (or stripe), so partitioning and
parallelism and frequently exactly orthogonal.

Yes, I think those things are unrelated.

I'm not so sure (warning, I am relativly inexperianced in this area)

it sounds like you can take two basic approaches to partition a database

1. The Isolation Plan
[...]
2. The Load Balancing Plan

Well, even if the table is not partitioned at all, I don't see that it
should preclude parallel query access.  If I've got a 1 GB table that
needs to be sequentially scanned for rows meeting some restriction
clause, and I have two CPUs and plenty of I/O bandwidth, ISTM it
should be possible to have them each scan half of the table and
combine the results.  Now, this is not easy and there are probably
substantial planner and executor changes required to make it work, but
I don't know that it would be particularly easier if I had two 500 MB
partitions instead of a single 1 GB table.

IOW, I don't think you should need to partition if all you want is
load balancing.  Partitioning should be for isolation, and load
balancing should happen when appropriate, whether there is
partitioning involved or not.

actually, I will contridict myself slightly.

with the Isolation Plan there is not nessasarily a need to run the query 
on each parition in parallel.

  if parallel queries are possible, it will benifit Isolation Plan 
paritioning, but the biggest win with this plan is just reducing the 
number of paritions that need to be queried.

with the Load Balancing Plan there is no benifit in partitioning unless 
you have the ability to run queries on each parition in parallel

using a seperate back-end process to do a query on a seperate partition is 
a fairly straightforward, but not trivial thing to do (there are 
complications in merging the result sets, including the need to be able to 
do part of a query, merge the results, then use those results for the next 
step in the query)

  I would also note that there does not seem to be a huge conceptual 
difference between doing these parallel queries on one computer and 
shipping the queries off to other computers.

however, trying to split the work on a single table runs into all sorts of 
'interesting' issues with things needing to be shared between the multiple 
processes (they both need to use the same indexes, for example)

so I think that it is much easier for the database engine to efficiantly 
search two 500G tables instead of one 1T table.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance