Hi, Bucky, Bucky Jordan wrote: >> We can implement multiple scanners (already present in MPP), or we > could >> implement AIO and fire off a number of simultaneous I/O requests for >> fulfillment. > > So this might be a dumb question, but the above statements apply to the > cluster (e.g. postmaster) as a whole, not per postgres > process/transaction correct? So each transaction is blocked waiting for > the main postmaster to retrieve the data in the order it was requested > (i.e. not multiple scanners/aio)? No, that's a wrong assumption. It applies per active backend. When connecting, the Postmaster forks a new backend process. Each backend process has its own scanner and executor. The main postmaster is only for coordination (forking, config reload etc.), all the work is done in the forked per-connection backends. Furthermore, the PostgreSQL MVCC system ensures that readers are neither ever blocked nor blocking other backends. Writers can block each other due to the ACID transaction semantics, however the MVCC limits that to a minimum. > In this case, the only way to take full advantage of larger hardware > using normal postgres would be to run multiple instances? (Which might > not be a bad idea since it would set your application up to be able to > deal with databases distributed on multiple servers...) Typical OLTP applications (Web UIs, Booking systems, etc.) have multiple connections, and those run fully parallel. So if your application is of this type, it will take full advantage of larger hardware. In the list archive, you should find some links to benchmarks that prove this statement, PostgreSQL scales linearly, up to 8 CPUs and 32 "hyperthreads" in this benchmarks. Our discussion is about some different type of application, where you have a single application issuing a single query at a time dealing with a large amount (several gigs up to teras) of data. Now, when such a query is generating sequential disk access, the I/O scheduler of the underlying OS can easily recognize that pattern, and prefetch the data, thus giving the full speed benefit of the underlying RAID. The discussed problem arises when such large queries generate random (non-continous) disk access (e. G. index scans). Here, the underlying RAID cannot effectively prefetch data as it does not know what the application will need next. This effectively limits the speed to that of a single disk, regardless of the details of the underlying RAID, as it can only process a request at a time, and has to wait for the application for the next one. Now, Bizgres MPP goes the way of having multiple threads per backend, each one processing a fraction of the data. So there are always several outstanding read requests that can be scheduled to the disks. My proposal was to use posix_fadvise() in the single-threaded scanner, so it can tell the OS "I will need those blocks in the near future". So the OS can pre-fetch those blocks into the cache, while PostgreSQL still processes the previous block of data. Another proposal would be to use so-called asynchroneous I/O. This is definitely an interesting and promising idea, but needs much more changes to the code, compared to posix_fadvise(). I hope that this lengthy mail is enlightening, if not, don't hesitate to ask. Thanks for your patience, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org