Re: Transparent table partitioning in future version of PG?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 7 May 2009, Robert Haas wrote:

On Wed, May 6, 2009 at 6:08 PM, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote:
Agreed. Perhaps I should say then that the syntax needs to express the
requirements of the planner/executor behaviour, rather than being the
main aspect of the feature, as some have suggested.

Agreed.

Hopefully, notions of partitioning won't be directly tied to chunking of
data for parallel query access. Most queries access recent data and
hence only a single partition (or stripe), so partitioning and
parallelism and frequently exactly orthogonal.

Yes, I think those things are unrelated.

I'm not so sure (warning, I am relativly inexperianced in this area)

it sounds like you can take two basic approaches to partition a database

1. The Isolation Plan

   you want to have it so that your queries match your partitioning.

this is with the goal of only having to query a small number of paritions, minimizing the total amount of data touched (including minimumizing the number of indexes searched)

this matches the use case mentioned above, with the partition based on date and only looking at the most recent date range.

2. The Load Balancing Plan

you want to have your partitioning and your queries _not_ match as much as possible

this is with the goal of having the query hit as many partitions as possible, so that the different parts of the search can happen in parallel


However, with either partitioning plan, you will have queries that degenerate to look like the other plan.

In the case of the isolation plan, you may need to search for all instances of a rare thing over the entire history (after all, if you never need to access that history, why do you pay for disks to store it? ;-)

and even when you are searching a narrow time window, it may still span multiple partitions. I have a log analysis setup using the Splunk prioriatary database, it paritions by time, creating a new parition as the current one hits a configurable size (by default 10G on 64 bit systems). for my volume of logs I end up with each parition only covering a few hours. it's very common to want to search over a few days, which can be a few dozen partitions (this is out of many hundreds of partitions, so it's still a _huge_ win to narrow the timeframe)


In the case of the load balancing plan, you may run into a query that happens to only fall into one partition (the query matches your paritioning logic)




I think the only real difference is how common it is to need to search multiple partitions.

If the expectation is that you will frequently need to search most/all of the partitions (the load balancing plan), then it's a waste of time to analyse the query to try and figure out which paritions you need to look at.

If the expectation is that you will frequently only need to search a small number of the partitions (the isolation plan), then it's extremely valuble to spend as much time as needed working to analyse the query to try and figure out which partitions you need to look at.


I believe that the isolation plan is probably more common than the load balancing plan, but I don't see them as being that different for the database engine point of view. To tune a system that can handle the isolation plan for load balancing, the key thing to do would be to have a knob to disable the partition planning, and just blindly send the search out to every partition.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux