On Tue, May 15, 2012 at 7:53 PM, Greg Sabino Mullane <greg@xxxxxxxxxxxx> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > >>>> Is it established practice in the Postgres world to separate indexes >>>> from tables? I would assume that the reasoning of Richard Foote - >>>> albeit for Oracle databases - is also true for Postgres: >> >>> Yes, it's an established practice. I'd call it something just short of >>> a best practice though, as it really depends on your situation. >> >> What are the benefits? > > Disk seeks, basically. Yes, there are a lot of complications regarding > all the various hardware/OS/PG level cachings, but at the end of the > day, it's less work to have each drive concentrate on a single area Hmm... I see your point. OTOH, the whole purpose of using NAS or SAN with cache, logical volumes and multiple spindles per volume is to reduce the impact of slow disk operations like seeks. If in such a situation your database operations are impacted by those seek operations then the setup does not seem optimal anyway. Bottom line is: with a setup properly tailored to the workload there should be no seeks "visible" to the database. > (especially as we always require a heap scan at the moment). Are you referring to the scan along tuple versions? http://facility9.com/2011/03/postgresql-row-storage-fundamentals/ >>> I also find his examples a bit contrived, and the whole "multi-user" >>> argument irrelevant for common cases. >> >> Why is that? > > Because most Postgres servers are dedicated to serving the same data > or sets of data, and the number of "other users" calling ad-hoc queries > against lots of different tables (per his example) is small. I don't see how it should be relevant for this discussion whether selects are "ad hoc" or other. The mere fact that concurrent accesses to the same set of tables and indexes albeit to different data (keys) is sufficient to have a potential for seeks - even if disks for index and table are separated. And this will typically happen in a multiuser application - even if all users use the same set of queries. > So this sentence just doesn't ring true to me: > > " ... by the time weâve read the index leaf block, processed and > read all the associated table blocks referenced by the index leaf > block, the chances of there being no subsequent physical activity > in the index tablespace due to another user session is virtually > nil. We would still need to re-scan the disk to physically access > the next index leaf block (or table block) anyways." > > That's certainly not true for Postgres servers, and I doubt if it > is quite that bad on Oracle either. I don't think this has much to do with the brand. Richard just describes logical consequences of concurrent access (see my attempt at explanation above). Fact remains that concurrent accesses rarely target for the same data and because of that you would see quite erratic access patterns to blocks. How they translate to actual disk accesses depends on various caching mechanisms in place and the physical distribution of data across disks (RAID). But I think we cannot ignore the fact that the data requested by concurrent queries most likely resides on different blocks. >>> I lean towards using separate tablespaces in Postgres, as the >>> performance outweighs the additional>> complexity. > >> What about his argument with regards to access patterns (i.e. >> interleaving index and table access during an index scan)? Also, >> Shaun's advice to have more spindles available sounds convincing to >> me, too. > > I don't buy his arguments. To do so, you'd have to buy a key point: > > "when most physical I/Os in both index and table segments are > effectively random, single block reads" > > They are not; hence, the rest of his argument falls apart. Certainly, > if things were as truly random and uncached as he states, there would > be no benefit to separation. Your argument with seeks also only works in absence of caching (see above). I think Richard was mainly pointing out that /in absence of caching/ different blocks need to be accessed here. > As far as spindles, yes: like RAM, it's seldom the case to have > too litte :) But as with all things, one should get some benchmarks > on your specific workload before making hardware changes. (Well, RAM > may be an exception to that, up to a point). Can you share some measurement data which backs the thesis that the distribution of index and table to different disks is advantageous? That would be interesting to see. Then one could also balance performance benefits against other effects (manageability etc.) and see on which side the advantage comes out. Even though I'm not convinced: Thank you for the interesting discussion! Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance