Hi Mark, Greg, On 9/10/07 3:08 PM, "Mark Mielke" <mark@xxxxxxxxxxxxxx> wrote: > One suggestion: The plan is already in a tree. With some dependency analysis, > I assume the tree could be executed in parallel (multiple threads or event > triggered entry into a state machine), and I/O to fetch index pages or table > pages could be scheduled in parallel. At this point, AIO becomes necessary to > let the underlying system (and hardware with tagged queueing?) schedule which > pages should be served best first. Right now the pattern for index scan goes like this: - Find qualifying TID in index - Seek to TID location in relfile - Acquire tuple from relfile, return When the tuples are widely distributed in the table, as is the case with a very selective predicate against an evenly distributed attribute on a relation 2x larger than the I/O cache + bufcache, this pattern will result in effectively "random I/O". In actual fact, the use of the in-memory bitmap index will make the I/Os sequential, but sparse, which is another version of "random" if the sequential I/Os are larger than the gather/scatter I/O aggregation in the OS scheduler (say 1MB). This is a very common circumstance for DSS / OLAP / DW workloads. For plans that qualify with the above conditions, the executor will issue blocking calls to lseek(), which will translate to a single disk actuator moving to the needed location in seek_time, approximately 8ms. The seek_time for a single query will not improve with the increase in number of disks in an underlying RAID pool, so we can do about 1000/8 = 125 seeks per second no matter what I/O subsystem we have. If we implement AIO and allow for multiple pending I/Os used to prefetch groups of qualifying tuples, basically a form of random readahead, we can improve the throughput for any given query by taking advantage of multiple disk actuators. This will work for RAID5, RAID10 and other disk pooling mechanisms because the lseek() will be issued as parallel events. Note that the same approach would also work to speed sequential access by overlapping compute and I/O. - Luke ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings