Re: Readahead - is 128KB the limit ?

shailesh jain <coolworldofshail@xxxxxxxxx> · Sat, 26 Sep 2009 00:53:40 -0400

mm/readahead.c has the logic for rampup. It detects sequentiality..

http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html

On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:

On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain

<coolworldofshail@xxxxxxxxx> wrote:

> Yes I understand that. Cases for random-reads and other non-sequential

> workloads, readahead logic will

> not ramp up to max size anyway. What I want is to bump up max size, so that

> when kernel detects sequential worklaod

it puzzled me how to distinguished between sequential and random

read......does the kernel actually detect and check that a series of

read are contiguous?   not sensible either.   read-ahead means reading

ahead of expectation, so by the time it detect and check that the

series of read are contiguous, it really does not classified into

"read-ahead" anymore.

any way, i did a ftrace stacktrace for reading /var/log/messages:

   1197  => ext3_get_blocks_handle

   1198  => ext3_get_block

   1199  => do_mpage_readpage

   1200  => mpage_readpages

   1201  => ext3_readpages

   1202  => __do_page_cache_readahead

   1203  => ra_submit

   1204  => filemap_fault

   1205             head-25243 [000] 20698.351148: blk_queue_bounce

<-__make_request

   1206             head-25243 [000] 20698.351148: <stack trace>

   1207  => __make_request

   1208  => generic_make_request

   1209  => submit_bio

   1210  => mpage_bio_submit

   1211  => do_mpage_readpage

   1212  => mpage_readpages

   1213  => ext3_readpages

   1214  => __do_page_cache_readahead

   1215             head-25243 [000] 20698.351159: blk_rq_init <-get_request

   1216             head-25243 [000] 20698.351159: <stack trace>

   1217  => get_request

   1218  => get_request_wait

   1219  => __make_request

   1220  => generic_make_request

   1221  => submit_bio

   1222  => mpage_bio_submit

   1223  => do_mpage_readpage

   1224  => mpage_readpages

so from above, we can guess __do_page_cache_readahead() is the key

function involved:

cut and paste (and read the comments below):

    253 /*

    254  * do_page_cache_readahead actually reads a chunk of disk.  It

allocates all

    255  * the pages first, then submits them all for I/O. This avoids

the very bad

    256  * behaviour which would occur if page allocations are causing

VM writeback.

    257  * We really don't want to intermingle reads and writes like that.

    258  *

    259  * Returns the number of pages requested, or the maximum

amount of I/O allowed.

    260  *

    261  * do_page_cache_readahead() returns -1 if it encountered request queue

    262  * congestion.

    263  */

    264 static int

    265 __do_page_cache_readahead(struct address_space *mapping,

struct file *filp,

    266                         pgoff_t offset, unsigned long nr_to_read)

    267 {

    268         struct inode *inode = mapping->host;

    269         struct page *page;

    270         unsigned long end_index;        /* The last page we

want to read */

    271         LIST_HEAD(page_pool);

    272         int page_idx;

    273         int ret = 0;

    274         loff_t isize = i_size_read(inode);

    275

    276         if (isize == 0)

    277                 goto out;

    278

    279         end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);

    280

    281         /*

    282          * Preallocate as many pages as we will need.

    283          */

    284         read_lock_irq(&mapping->tree_lock);

    285         for (page_idx = 0; page_idx < nr_to_read; page_idx++) {

    286                 pgoff_t page_offset = offset + page_idx;

    287

    288                 if (page_offset > end_index)

    289                         break;

    290

    291                 page = radix_tree_lookup(&mapping->page_tree,

page_offset);

    292                 if (page)

    293                         continue;

    294

    295                 read_unlock_irq(&mapping->tree_lock);

    296                 page = page_cache_alloc_cold(mapping);

    297                 read_lock_irq(&mapping->tree_lock);

    298                 if (!page)

    299                         break;

    300                 page->index = page_offset;

    301                 list_add(&page->lru, &page_pool);

    302                 ret++;

    303         }

    304         read_unlock_irq(&mapping->tree_lock);

    305

    306         /*

    307          * Now start the IO.  We ignore I/O errors - if the page is not

    308          * uptodate then the caller will launch readpage again, and

    309          * will then handle the error.

    310          */

    311         if (ret)

    312                 read_pages(mapping, filp, &page_pool, ret);

    313         BUG_ON(!list_empty(&page_pool));

    314 out:

    315         return ret;

    316 }

    317

    318 /*

the HEART OF the algo is the last few lines--->read_pages(), and there

is no conditional logic in it, it just readahead blindly.

> it does not restrict itself to 32 pages.

>

> I looked around and saw an old patch that tried to account for actual memory

> on the system and setting max_readahead

> according to that. Restricting to arbitrary limits -- for instance think

> 512MB system vs 4GB system - is not sane IMO.

>

interesting....can u share the link so perhaps i can learn something?

 thanks pal!!!

>

>

> Shailesh Jain

>

>

> On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:

>>

>> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain

>> <coolworldofshail@xxxxxxxxx> wrote:

>> > Hi,

>> >   Is the maximum limit of readahead 128KB ? ..  Can it be changed by FS

>> > kernel module ?

>> >

>> >

>> > Shailesh Jain

>> >

>>

>> not sure why u want to change that?   for a specific performance

>> tuning scenario (lots of sequential read)?   this readahead feature is

>> useful only if u are intending on reading large files.   But if u

>> switch to a different files, assuming many small files, u defeats the

>> purpose of readahead.   i think this is an OS-independent features,

>> which is specifically tuned to the normal usage of the filesystem.

>>

>> so, for example for AIX:

>>

>>

>> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm

>>

>> their readahead is only (max) 16xpagesize.   not sure how big is that,

>> but our 128KB should be > 16xpagesize (how big is our IO blocksize

>> anyway?)

>>

>> for another reputable references:

>>

>> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm

>>

>> (in Oracle database).

>>

>> The problem is that if u read ahead too much, and after that the

>> entire buffer is going to be thrown away due to un-use, then a lot of

>> time is wasted in reading ahead.

>>

>> --

>> Regards,

>> Peter Teoh

>

>

--

Regards,

Peter Teoh