Re: Readahead - is 128KB the limit ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



mm/readahead.c has the logic for rampup. It detects sequentiality..

http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html

On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain
<coolworldofshail@xxxxxxxxx> wrote:
> Yes I understand that. Cases for random-reads and other non-sequential
> workloads, readahead logic will
> not ramp up to max size anyway. What I want is to bump up max size, so that
> when kernel detects sequential worklaod

it puzzled me how to distinguished between sequential and random
read......does the kernel actually detect and check that a series of
read are contiguous?   not sensible either.   read-ahead means reading
ahead of expectation, so by the time it detect and check that the
series of read are contiguous, it really does not classified into
"read-ahead" anymore.

any way, i did a ftrace stacktrace for reading /var/log/messages:

  1197  => ext3_get_blocks_handle
  1198  => ext3_get_block
  1199  => do_mpage_readpage
  1200  => mpage_readpages
  1201  => ext3_readpages
  1202  => __do_page_cache_readahead
  1203  => ra_submit
  1204  => filemap_fault
  1205             head-25243 [000] 20698.351148: blk_queue_bounce
<-__make_request
  1206             head-25243 [000] 20698.351148: <stack trace>
  1207  => __make_request
  1208  => generic_make_request
  1209  => submit_bio
  1210  => mpage_bio_submit
  1211  => do_mpage_readpage
  1212  => mpage_readpages
  1213  => ext3_readpages
  1214  => __do_page_cache_readahead
  1215             head-25243 [000] 20698.351159: blk_rq_init <-get_request
  1216             head-25243 [000] 20698.351159: <stack trace>
  1217  => get_request
  1218  => get_request_wait
  1219  => __make_request
  1220  => generic_make_request
  1221  => submit_bio
  1222  => mpage_bio_submit
  1223  => do_mpage_readpage
  1224  => mpage_readpages

so from above, we can guess __do_page_cache_readahead() is the key
function involved:

cut and paste (and read the comments below):


   253 /*
   254  * do_page_cache_readahead actually reads a chunk of disk.  It
allocates all
   255  * the pages first, then submits them all for I/O. This avoids
the very bad
   256  * behaviour which would occur if page allocations are causing
VM writeback.
   257  * We really don't want to intermingle reads and writes like that.
   258  *
   259  * Returns the number of pages requested, or the maximum
amount of I/O allowed.
   260  *
   261  * do_page_cache_readahead() returns -1 if it encountered request queue
   262  * congestion.
   263  */
   264 static int
   265 __do_page_cache_readahead(struct address_space *mapping,
struct file *filp,
   266                         pgoff_t offset, unsigned long nr_to_read)
   267 {
   268         struct inode *inode = mapping->host;
   269         struct page *page;
   270         unsigned long end_index;        /* The last page we
want to read */
   271         LIST_HEAD(page_pool);
   272         int page_idx;
   273         int ret = 0;
   274         loff_t isize = i_size_read(inode);
   275
   276         if (isize == 0)
   277                 goto out;
   278
   279         end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
   280
   281         /*
   282          * Preallocate as many pages as we will need.
   283          */
   284         read_lock_irq(&mapping->tree_lock);
   285         for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
   286                 pgoff_t page_offset = offset + page_idx;
   287
   288                 if (page_offset > end_index)
   289                         break;
   290
   291                 page = radix_tree_lookup(&mapping->page_tree,
page_offset);
   292                 if (page)
   293                         continue;
   294
   295                 read_unlock_irq(&mapping->tree_lock);
   296                 page = page_cache_alloc_cold(mapping);
   297                 read_lock_irq(&mapping->tree_lock);
   298                 if (!page)
   299                         break;
   300                 page->index = page_offset;
   301                 list_add(&page->lru, &page_pool);
   302                 ret++;
   303         }
   304         read_unlock_irq(&mapping->tree_lock);
   305
   306         /*
   307          * Now start the IO.  We ignore I/O errors - if the page is not
   308          * uptodate then the caller will launch readpage again, and
   309          * will then handle the error.
   310          */
   311         if (ret)
   312                 read_pages(mapping, filp, &page_pool, ret);
   313         BUG_ON(!list_empty(&page_pool));
   314 out:
   315         return ret;
   316 }
   317
   318 /*

the HEART OF the algo is the last few lines--->read_pages(), and there
is no conditional logic in it, it just readahead blindly.

> it does not restrict itself to 32 pages.
>
> I looked around and saw an old patch that tried to account for actual memory
> on the system and setting max_readahead
> according to that. Restricting to arbitrary limits -- for instance think
> 512MB system vs 4GB system - is not sane IMO.
>

interesting....can u share the link so perhaps i can learn something?
 thanks pal!!!

>
>
> Shailesh Jain
>
>
> On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>
>> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain
>> <coolworldofshail@xxxxxxxxx> wrote:
>> > Hi,
>> >   Is the maximum limit of readahead 128KB ? ..  Can it be changed by FS
>> > kernel module ?
>> >
>> >
>> > Shailesh Jain
>> >
>>
>> not sure why u want to change that?   for a specific performance
>> tuning scenario (lots of sequential read)?   this readahead feature is
>> useful only if u are intending on reading large files.   But if u
>> switch to a different files, assuming many small files, u defeats the
>> purpose of readahead.   i think this is an OS-independent features,
>> which is specifically tuned to the normal usage of the filesystem.
>>
>> so, for example for AIX:
>>
>>
>> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm
>>
>> their readahead is only (max) 16xpagesize.   not sure how big is that,
>> but our 128KB should be > 16xpagesize (how big is our IO blocksize
>> anyway?)
>>
>> for another reputable references:
>>
>> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm
>>
>> (in Oracle database).
>>
>> The problem is that if u read ahead too much, and after that the
>> entire buffer is going to be thrown away due to un-use, then a lot of
>> time is wasted in reading ahead.
>>
>> --
>> Regards,
>> Peter Teoh
>
>



--
Regards,
Peter Teoh


[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux