James Mansion wrote:
Mark Mielke wrote:
PostgreSQL or the kernel should already have the hottest pages in
memory, so the value of doing async I/O is very likely the cooler
pages that are unique to the query. We don't know what the cooler
pages are until we follow three tree down.
I'm assuming that at the time we start to search the index, we have
some idea of value or values that
we are looking for. Or, as you say, we are applying a function to
'all of it'.
Think of a 'between' query. The subset of the index that can be a
match can be bounded by the leaf
pages that contain the end point(s). Similarly if we have a merge
with a sorted intermediate set from
a prior step then we also have bounds on the values.
How do you find the bounding points for these pages? Does this not
require descending through the tree in a more traditional way?
I'm not convinced that your assertion that the index leaf pages must
necessarily be processed in-order
is true - it depends what sort of operation is under way. I am
assuming that we try hard to keep
interior index nodes and data in meory and that having identified the
subset of these that we want, we
can immediately infer the set of leaves that are potentially of interest.
It is because you are missing my point. In order to find a reduced set
of pages to load, one must descend through the B-Tree. Identifying the
subset requires sequential loading of pages. There is some theoretical
potential for async I/O, but I doubt your own assertion that this is
significant in any significant way. I ask you again - how do you know
which lower level pages to load before you load the higher level pages?
The only time the B-Tree can be processed out of order in this regard is
if you are doing a bitmap index scan or some similar operation that will
scan the entire tree, and you do not care what order the pages arrive
in. If you are looking for a specific key, one parent page leads to one
child page, and the operation is sequential.
The difference between preload and handling async I/O in terms of
performance is debatable. Greg reports that async I/O on Linux sucks,
but posix_fadvise*() has substantial benefits. posix_fadvise*() is
preload not async I/O (he also reported that async I/O on Solaris
seems to work well). Being able to do work as the first page is
available is a micro-optimization as far as I am concerned at this
point (one that may not yet work on Linux), as the real benefit comes
from utilizing all 12 disks in Matthew's case, not from guaranteeing
that data is processed as soon as possible.
I see it as part of the same problem. You can partition the data
across all the disks and run queries in parallel
against the partitions, or you can lay out the data in the RAID array
in which case the optimiser has very little idea
how the data will map to physical data layout - so its best bet is to
let the systems that DO know decide the
access strategy. And those systems can only do that if you give them
a lot of requests that CAN be reordered,
so they can choose a good ordering.
You can see it however you like - what remains, is that the 12X speed up
is going to come from using 12 disks, and loading the 12 disks in
parallel. More theoretical improvements with regard to the ability for a
particular hard disk to schedule reads and return results out of order,
have not, in my reading, been shown to reliably improve performance to a
significant degree. Unfortunately, Linux doesn't support my SATA NCQ
yet, so I haven't been able to experiment myself. Gregory provided
numbers showing a 2X - 3X performance when using three disks. This has
the potential for significant improvement with real hardware, modest
cost, and perhaps trivial changes to PostgreSQL. What you describe is a
re-design of the I/O strategy that will be designed for asynchronous
I/O, with some sort of state machine that will be able to deal
efficiently with either index pages or table pages out of order. Do you
have numbers to show that such a significant change would result in a
reasonable return on the investment?
Micro-optimization.
Well, you like to assert this - but why?
I'll quote from:
http://en.wikipedia.org/wiki/Native_Command_Queuing
Most reading I have done shows NCQ to have limited gains, with most
benchmarks being suspect. Also, latency is only for the first page. One
presumes that asynch I/O will be mostly valuable in the case where many
pages can be scheduled for reading at the same time. In the case that
many pages are scheduled for reading, the first page will be eventually
served, and the overall bandwidth is still the same. In your theoretical
model, you are presuming the CPU is a bottleneck, either for processing,
or scheduling the next batch of reads. I think you are hand waving, and
given that Linux doesn't yet support asynch I/O well, Gregory's model
will serve more PostgreSQL users than yours with less complexity.
Clearly a hint to preload is better than nothing. But it seems to me
that the worst case is that we wait for
the slowest page to load and then start processing hoping that the
rest of the data stays in the buffer cache
and is 'instant', while AIO and evaluate-when-ready means that process
is still bound by the slowest
data to arrive, but at that point there is little processing still to
do, and the already-processed buffers can be
reused earlier. In the case where there is significant presure on the
buffer cache, this can be significant.
It seems to me that you have yet to prove that there will be substantial
gains in overall performance for preload. Leaping on to presuming that
PostgreSQL should switch to a fully asynch I/O model seems a greater
stretch. By the sounds of it, Gregory could have the first implemented
very soon. When will you have yours? :-)
In your hand waving, you have assumed that PostgreSQL B-Tree index
might need to be replaced? :-)
Sure, if the intent is to make the system thread-hot or AIO-hot, then
the change is potentially very
invasive. The strategy to evaluate queries based on parallel
execution and async IO is not necessarily
very like a strategy where you delegate to the OS buffer cache.
I'm not too bothered for the urpose of this discussion whether the way
that postgres currently
navigates indexes is amenable to this. This is bikeshed land, right?
I am only interested by juicy projects that have a hope of success. This
subject does interest me - I am hoping my devil's advocate participation
encourages people to seek a practical implementation that will benefit me.
I think it is foolish to disregard strategies that will allow
overlapping IO and processing - and we want to
keep disks reading and writing rather than seeking. To me that
suggests AIO and disk-native queuing
are quite a big deal. And parallel evaluation will be too as the
number of cores goes up and there is
an expectation that this should reduce latency of individual query,
not just allow throughput with lots
of concurrent demand.
I am more amenable to multi-threaded index processing for the same query
than async I/O to take advantage of NCQ. Guess we each come from a
different background. :-)
Cheers,
mark
--
Mark Mielke <mark@xxxxxxxxx>
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate