Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

"Luke Lonergan" <llonergan@xxxxxxxxxxxxx> · Mon, 10 Sep 2007 15:45:26 -0700

Hi Mark, Greg,

On 9/10/07 3:08 PM, "Mark Mielke" <mark@xxxxxxxxxxxxxx> wrote:

> One suggestion: The plan is already in a tree. With some dependency analysis,
> I assume the tree could be executed in parallel (multiple threads or event
> triggered entry into a state machine), and I/O to fetch index pages or table
> pages could be scheduled in parallel. At this point, AIO becomes necessary to
> let the underlying system (and hardware with tagged queueing?) schedule which
> pages should be served best first.

Right now the pattern for index scan goes like this:

- Find qualifying TID in index
  - Seek to TID location in relfile
    - Acquire tuple from relfile, return

When the tuples are widely distributed in the table, as is the case with a
very selective predicate against an evenly distributed attribute on a
relation 2x larger than the I/O cache + bufcache, this pattern will result
in effectively "random I/O".  In actual fact, the use of the in-memory
bitmap index will make the I/Os sequential, but sparse, which is another
version of "random" if the sequential I/Os are larger than the
gather/scatter I/O aggregation in the OS scheduler (say 1MB).  This is a
very common circumstance for DSS / OLAP / DW workloads.

For plans that qualify with the above conditions, the executor will issue
blocking calls to lseek(), which will translate to a single disk actuator
moving to the needed location in seek_time, approximately 8ms.  The
seek_time for a single query will not improve with the increase in number of
disks in an underlying RAID pool, so we can do about 1000/8 = 125 seeks per
second no matter what I/O subsystem we have.

If we implement AIO and allow for multiple pending I/Os used to prefetch
groups of qualifying tuples, basically a form of random readahead, we can
improve the throughput for any given query by taking advantage of multiple
disk actuators.  This will work for RAID5, RAID10 and other disk pooling
mechanisms because the lseek() will be issued as parallel events.  Note that
the same approach would also work to speed sequential access by overlapping
compute and I/O.

- Luke

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings