"Gregory Stark" <stark@xxxxxxxxxxxxxxxx> writes: > "Luke Lonergan" <llonergan@xxxxxxxxxxxxx> writes: > >> Right now the pattern for index scan goes like this: >> >> - Find qualifying TID in index >> - Seek to TID location in relfile >> - Acquire tuple from relfile, return >>... >> If we implement AIO and allow for multiple pending I/Os used to prefetch >> groups of qualifying tuples, basically a form of random readahead > > Ah, I see what you mean now. It makes a lot more sense if you think of it for > bitmap index scans. So, for example, the bitmap index scan could stream tids > to the executor and the executor would strip out the block numbers and pass > them to the i/o layer saying "i need this block now but following that I'll > need these blocks so get them moving now". Wow, I've done some preliminary testing here on Linux using posix_fadvise and Solaris using libaio to prefetch blocks and then access them randomly and I think there's a lot of low hanging fruit here. The use case where this helps is indeed on a raid array where you're not maxing out the bandwidth of the array and care about the transaction latency, perhaps a narrow use case but still, quite common. Since our random access is synchronous it means we have to wait for one seek, process that page, then wait for the next seek on another drive which was sitting idle while we were processing the first page. By prefetching the pages we'll need next we can get all the members of the array working for us simultaneously even if they're all doing seeks. What I've done is write a test program which generates a 1G file, syncs it and drops the caches (not working yet on Solaris but doesn't seem to affect the results) and then picks 4096 8k buffers and reads them in random order. The machines it's running on have a small raid array with 4 drives. Just seeking without any prefetch it takes about 12.5s on Linux and 13.5s on Solaris. If I prefetch even a single buffer using posix_fadvise or libaio I see a noticeable improvement, over 25%. At 128 buffers of prefetch both systems are down to about 2.5-2.7s. That's on the small raid array. On the boot both have a small beneficial effect but only at very large prefetch set sizes which I would chalk down to being able to re-order the reads even if it can't overlap them. I want to test how much of this effect evaporates when I compare it to a bitmap index style scan but that depends on a lot of factors like the exact pattern of file extensions on the database files. In any case bitmap index scans get us the reordering effect, but not the overlapping i/o requests assuming they're spread quite far apart in the data files. > I think this seems pretty impractical for regular (non-bitmap) index probes > though. You might be able to do it sometimes but not very effectively and you > won't know when it would be useful. How useful this is depends a lot on how invasively we let it infect things like regular index scans. If we can prefetch right siblings and deeper index pages as we descend an index tree and future heap pages it could help a lot as those aren't sorted like bitmap index scans. But even if we only fetch heap pages all together before processing the heap pages it could be a big help. Incidentally we do need to try to make use of both as Solaris doesn't have posix_fadvise as far as I can tell and Linux's libaio doesn't support non-O_DIRECT files. Raw data: Blocks Linux Solaris Prefetched posix_fadvise libaio --------------------------------------- 1 12.473 13.597 2 9.053 9.830 4 6.787 7.594 8 5.303 6.588 16 4.209 5.120 32 3.388 4.014 64 2.869 3.216 128 2.515 2.710 256 2.312 2.327 512 2.168 2.099 1024 2.139 1.974 2048 2.242 1.903 4096 2.222 1.890 -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings