Re: Readahead - is 128KB the limit ?

Peter Teoh <htmldeveloper@xxxxxxxxx> · Sat, 26 Sep 2009 00:48:42 -0400

On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain
<coolworldofshail@xxxxxxxxx> wrote:
> Yes I understand that. Cases for random-reads and other non-sequential
> workloads, readahead logic will
> not ramp up to max size anyway. What I want is to bump up max size, so that
> when kernel detects sequential worklaod

it puzzled me how to distinguished between sequential and random
read......does the kernel actually detect and check that a series of
read are contiguous?   not sensible either.   read-ahead means reading
ahead of expectation, so by the time it detect and check that the
series of read are contiguous, it really does not classified into
"read-ahead" anymore.

any way, i did a ftrace stacktrace for reading /var/log/messages:

   1197  => ext3_get_blocks_handle
   1198  => ext3_get_block
   1199  => do_mpage_readpage
   1200  => mpage_readpages
   1201  => ext3_readpages
   1202  => __do_page_cache_readahead
   1203  => ra_submit
   1204  => filemap_fault
   1205             head-25243 [000] 20698.351148: blk_queue_bounce
<-__make_request
   1206             head-25243 [000] 20698.351148: <stack trace>
   1207  => __make_request
   1208  => generic_make_request
   1209  => submit_bio
   1210  => mpage_bio_submit
   1211  => do_mpage_readpage
   1212  => mpage_readpages
   1213  => ext3_readpages
   1214  => __do_page_cache_readahead
   1215             head-25243 [000] 20698.351159: blk_rq_init <-get_request
   1216             head-25243 [000] 20698.351159: <stack trace>
   1217  => get_request
   1218  => get_request_wait
   1219  => __make_request
   1220  => generic_make_request
   1221  => submit_bio
   1222  => mpage_bio_submit
   1223  => do_mpage_readpage
   1224  => mpage_readpages

so from above, we can guess __do_page_cache_readahead() is the key
function involved:

cut and paste (and read the comments below):

    253 /*
    254  * do_page_cache_readahead actually reads a chunk of disk.  It
allocates all
    255  * the pages first, then submits them all for I/O. This avoids
the very bad
    256  * behaviour which would occur if page allocations are causing
VM writeback.
    257  * We really don't want to intermingle reads and writes like that.
    258  *
    259  * Returns the number of pages requested, or the maximum
amount of I/O allowed.
    260  *
    261  * do_page_cache_readahead() returns -1 if it encountered request queue
    262  * congestion.
    263  */
    264 static int
    265 __do_page_cache_readahead(struct address_space *mapping,
struct file *filp,
    266                         pgoff_t offset, unsigned long nr_to_read)
    267 {
    268         struct inode *inode = mapping->host;
    269         struct page *page;
    270         unsigned long end_index;        /* The last page we
want to read */
    271         LIST_HEAD(page_pool);
    272         int page_idx;
    273         int ret = 0;
    274         loff_t isize = i_size_read(inode);
    275
    276         if (isize == 0)
    277                 goto out;
    278
    279         end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
    280
    281         /*
    282          * Preallocate as many pages as we will need.
    283          */
    284         read_lock_irq(&mapping->tree_lock);
    285         for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
    286                 pgoff_t page_offset = offset + page_idx;
    287
    288                 if (page_offset > end_index)
    289                         break;
    290
    291                 page = radix_tree_lookup(&mapping->page_tree,
page_offset);
    292                 if (page)
    293                         continue;
    294
    295                 read_unlock_irq(&mapping->tree_lock);
    296                 page = page_cache_alloc_cold(mapping);
    297                 read_lock_irq(&mapping->tree_lock);
    298                 if (!page)
    299                         break;
    300                 page->index = page_offset;
    301                 list_add(&page->lru, &page_pool);
    302                 ret++;
    303         }
    304         read_unlock_irq(&mapping->tree_lock);
    305
    306         /*
    307          * Now start the IO.  We ignore I/O errors - if the page is not
    308          * uptodate then the caller will launch readpage again, and
    309          * will then handle the error.
    310          */
    311         if (ret)
    312                 read_pages(mapping, filp, &page_pool, ret);
    313         BUG_ON(!list_empty(&page_pool));
    314 out:
    315         return ret;
    316 }
    317
    318 /*

the HEART OF the algo is the last few lines--->read_pages(), and there
is no conditional logic in it, it just readahead blindly.

> it does not restrict itself to 32 pages.
>
> I looked around and saw an old patch that tried to account for actual memory
> on the system and setting max_readahead
> according to that. Restricting to arbitrary limits -- for instance think
> 512MB system vs 4GB system - is not sane IMO.
>

interesting....can u share the link so perhaps i can learn something?
 thanks pal!!!

>
>
> Shailesh Jain
>
>
> On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>
>> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain
>> <coolworldofshail@xxxxxxxxx> wrote:
>> > Hi,
>> >   Is the maximum limit of readahead 128KB ? ..  Can it be changed by FS
>> > kernel module ?
>> >
>> >
>> > Shailesh Jain
>> >
>>
>> not sure why u want to change that?   for a specific performance
>> tuning scenario (lots of sequential read)?   this readahead feature is
>> useful only if u are intending on reading large files.   But if u
>> switch to a different files, assuming many small files, u defeats the
>> purpose of readahead.   i think this is an OS-independent features,
>> which is specifically tuned to the normal usage of the filesystem.
>>
>> so, for example for AIX:
>>
>>
>> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm
>>
>> their readahead is only (max) 16xpagesize.   not sure how big is that,
>> but our 128KB should be > 16xpagesize (how big is our IO blocksize
>> anyway?)
>>
>> for another reputable references:
>>
>> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm
>>
>> (in Oracle database).
>>
>> The problem is that if u read ahead too much, and after that the
>> entire buffer is going to be thrown away due to un-use, then a lot of
>> time is wasted in reading ahead.
>>
>> --
>> Regards,
>> Peter Teoh
>
>

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ