On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@xxxxxxxxx> wrote:
Greetings,
I have been thinking a lot about pgsql performance when it is dealing
with tables with lots of rows on one table (several millions, maybe
thousands of millions). Say, the Large Object use case:
one table has large objects (have a pointer to one object).
The large object table stores the large object in 2000 bytes chunks
(iirc), so, if we have something like 1TB of data stored in large
objects, the large objects table would have something like 550M rows,
if we get to 8TB, we will have 4400M rows (or so).
I have read at several places that huge tables should be partitioned,
to improve performance.... now, my first question comes: does the
large objects system automatically partitions itself? if no: will
Large Objects system performance degrade as we add more data? (I guess
it would).
Now... I can't fully understand this: why does the performance
actually goes lower? I mean, when we do partitioning, we take a
certain parameter to "divide" the data,and then use the same parameter
to issue the request against the correct table... shouldn't the DB
actually do something similar with the indexes? I mean, I have always
thought about the indexes, well, exactly like that: approximation
search, I know I'm looking for, say, a date that is less than
2010-03-02, and the system should just position itself on the index
around that date, and scan from that point backward... as far as my
understanding goes, the partitioning only adds like this "auxiliary"
index, making the system, for example, go to a certain table if the
query goes toward one particular year (assuming we partitioned by
year), what if the DB actually implemented something like an Index for
the Index (so that the first search on huge tables scan on an smaller
index that points to a position on the larger index, thus avoiding the
scan of the large index initially).
Well.... I'm writing all of this half-sleep now, so... I'll re-read it
tomorrow... in the meantime, just ignore anything that doesn't make a
lot of sense :) .
Partitioning helps in a number of ways. First, if running a query which must scan an entire table, if the table is very large, that scan will be expensive. Partitioning can allow the query planner to do a sequential scan over just some of the data and skip the rest (or process it via some other manner, such as an index lookup). Also, the larger an index is, the more expensive the index is to maintain. Inserts and lookups will both take longer. Partitioning will give you n indexes, each with m/n entries (assuming fairly even distribution of data among partitions), so any given index will be smaller, which means inserts into a partition will potentially be much faster. Since large tables often also have very high insert rates, this can be a big win. You can also gain better resource utilization by moving less frequently used partitions onto slower media (via a tablespace), freeing up space on your fastest i/o devices for the most important data. A lot of partitioning tends to happen by time, and the most frequently run queries are often on the partitions containing the most recent data, so it often can be very beneficial to keep only the most recent partitions on fastest storage. Then there is caching. Indexes and tables are cached by page. Without clustering a table on a particular index, the contents of a single page may be quite arbitrary. Even with clustering, depending upon the usage patterns of the table in question, it is entirely possible that any given page may only have some fairly small percentage of highly relevant data if the table is very large. By partitioning, you can (potentially) ensure that any given page in cache will have a higher density of highly relevant entries, so you'll get better efficiency out of the caching layers. And with smaller indexes, it is less likely that loading an index into shared buffers will push some other useful chunk of data out of the cache.
As for the large object tables, I'm not sure about the internals. Assuming that each table gets its own table for large objects, partitioning the main table will have the effect of partitioning the large object table, too - keeping index maintenance more reasonable and ensuring that lookups are as fast as possible. There's probably a debate to be had on the benefit of storing very large numbers of large objects in the db, too (as opposed to keeping references to them in the db and actually accessing them via some other mechanism. Both product requirements and performance are significant factors in that discussion).
As for your suggestion that the db maintain an index on an index, how would the database do so in an intelligent manner? It would have to maintain such indexes on every index and guess as to which values to use as boundaries for each bucket. Partitioning solves the same problem, but allows you to direct the database such that it only does extra work where the dba, who is much more knowledgable about the structure of the data and how it will be used than the database itself, tells it to. And the dba gets to tell the db what buckets to use when partitioning the database - via the check constraints on the partitions. Without that, the db would have to guess as to appropriate bucket sizes and the distribution of values within them.
I'm sure there are reasons beyond even those I've listed here. I'm not one of the postgresql devs, so my understanding of how it benefits from partitioning is shallow, at best. If the usage pattern of your very large table is such that every query tends to use all of the table, then I'm not sure partitioning really offers much gain. The benefits of partitioning are, at least in part, predicated on only a subset of the data being useful to any one query, and the benefits get that much stronger if some portion of the data is rarely used by any query.