Re: Readahead - is 128KB the limit ?

Peter Teoh <htmldeveloper@xxxxxxxxx> · Sat, 26 Sep 2009 15:57:23 +0800

On Sat, Sep 26, 2009 at 12:53 PM, shailesh jain
<coolworldofshail@xxxxxxxxx> wrote:
> mm/readahead.c has the logic for rampup. It detects sequentiality..
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html

i see, the patch tried to fixup the readahead windows size based on
total amount of memory.   sounds logical, but then i think no he did
not receive any ACK for the patch ????

the discussion thread i saw is here:

http://www.gossamer-threads.com/lists/linux/kernel/798505?search_string=readahead%3A%20scale%20max%20readahead%20size%20depending%20on%20memory%20size;#798505

and to be noted from the discussion is the loss of unused data, for
1024K it is as high as 49%.

readahead readahead
size miss
128K 38%
512K 45%
1024K 49%

>
> On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <htmldeveloper@xxxxxxxxx>
> wrote:
>>
>> On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain
>> <coolworldofshail@xxxxxxxxx> wrote:
>> > Yes I understand that. Cases for random-reads and other non-sequential
>> > workloads, readahead logic will
>> > not ramp up to max size anyway. What I want is to bump up max size, so
>> > that
>> > when kernel detects sequential worklaod
>>
>> it puzzled me how to distinguished between sequential and random
>> read......does the kernel actually detect and check that a series of
>> read are contiguous?   not sensible either.   read-ahead means reading
>> ahead of expectation, so by the time it detect and check that the
>> series of read are contiguous, it really does not classified into
>> "read-ahead" anymore.
>>
>> any way, i did a ftrace stacktrace for reading /var/log/messages:
>>
>>   1197  => ext3_get_blocks_handle
>>   1198  => ext3_get_block
>>   1199  => do_mpage_readpage
>>   1200  => mpage_readpages
>>   1201  => ext3_readpages
>>   1202  => __do_page_cache_readahead
>>   1203  => ra_submit
>>   1204  => filemap_fault
>>   1205             head-25243 [000] 20698.351148: blk_queue_bounce
>> <-__make_request
>>   1206             head-25243 [000] 20698.351148: <stack trace>
>>   1207  => __make_request
>>   1208  => generic_make_request
>>   1209  => submit_bio
>>   1210  => mpage_bio_submit
>>   1211  => do_mpage_readpage
>>   1212  => mpage_readpages
>>   1213  => ext3_readpages
>>   1214  => __do_page_cache_readahead
>>   1215             head-25243 [000] 20698.351159: blk_rq_init
>> <-get_request
>>   1216             head-25243 [000] 20698.351159: <stack trace>
>>   1217  => get_request
>>   1218  => get_request_wait
>>   1219  => __make_request
>>   1220  => generic_make_request
>>   1221  => submit_bio
>>   1222  => mpage_bio_submit
>>   1223  => do_mpage_readpage
>>   1224  => mpage_readpages
>>
>> so from above, we can guess __do_page_cache_readahead() is the key
>> function involved:
>>
>> cut and paste (and read the comments below):
>>
>>
>>    253 /*
>>    254  * do_page_cache_readahead actually reads a chunk of disk.  It
>> allocates all
>>    255  * the pages first, then submits them all for I/O. This avoids
>> the very bad
>>    256  * behaviour which would occur if page allocations are causing
>> VM writeback.
>>    257  * We really don't want to intermingle reads and writes like that.
>>    258  *
>>    259  * Returns the number of pages requested, or the maximum
>> amount of I/O allowed.
>>    260  *
>>    261  * do_page_cache_readahead() returns -1 if it encountered request
>> queue
>>    262  * congestion.
>>    263  */
>>    264 static int
>>    265 __do_page_cache_readahead(struct address_space *mapping,
>> struct file *filp,
>>    266                         pgoff_t offset, unsigned long nr_to_read)
>>    267 {
>>    268         struct inode *inode = mapping->host;
>>    269         struct page *page;
>>    270         unsigned long end_index;        /* The last page we
>> want to read */
>>    271         LIST_HEAD(page_pool);
>>    272         int page_idx;
>>    273         int ret = 0;
>>    274         loff_t isize = i_size_read(inode);
>>    275
>>    276         if (isize == 0)
>>    277                 goto out;
>>    278
>>    279         end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
>>    280
>>    281         /*
>>    282          * Preallocate as many pages as we will need.
>>    283          */
>>    284         read_lock_irq(&mapping->tree_lock);
>>    285         for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>>    286                 pgoff_t page_offset = offset + page_idx;
>>    287
>>    288                 if (page_offset > end_index)
>>    289                         break;
>>    290
>>    291                 page = radix_tree_lookup(&mapping->page_tree,
>> page_offset);
>>    292                 if (page)
>>    293                         continue;
>>    294
>>    295                 read_unlock_irq(&mapping->tree_lock);
>>    296                 page = page_cache_alloc_cold(mapping);
>>    297                 read_lock_irq(&mapping->tree_lock);
>>    298                 if (!page)
>>    299                         break;
>>    300                 page->index = page_offset;
>>    301                 list_add(&page->lru, &page_pool);
>>    302                 ret++;
>>    303         }
>>    304         read_unlock_irq(&mapping->tree_lock);
>>    305
>>    306         /*
>>    307          * Now start the IO.  We ignore I/O errors - if the page is
>> not
>>    308          * uptodate then the caller will launch readpage again, and
>>    309          * will then handle the error.
>>    310          */
>>    311         if (ret)
>>    312                 read_pages(mapping, filp, &page_pool, ret);
>>    313         BUG_ON(!list_empty(&page_pool));
>>    314 out:
>>    315         return ret;
>>    316 }
>>    317
>>    318 /*
>>
>> the HEART OF the algo is the last few lines--->read_pages(), and there
>> is no conditional logic in it, it just readahead blindly.
>>
>> > it does not restrict itself to 32 pages.
>> >
>> > I looked around and saw an old patch that tried to account for actual
>> > memory
>> > on the system and setting max_readahead
>> > according to that. Restricting to arbitrary limits -- for instance think
>> > 512MB system vs 4GB system - is not sane IMO.
>> >
>>
>> interesting....can u share the link so perhaps i can learn something?
>>  thanks pal!!!
>>
>> >
>> >
>> > Shailesh Jain
>> >
>> >
>> > On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx>
>> > wrote:
>> >>
>> >> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain
>> >> <coolworldofshail@xxxxxxxxx> wrote:
>> >> > Hi,
>> >> >   Is the maximum limit of readahead 128KB ? ..  Can it be changed by
>> >> > FS
>> >> > kernel module ?
>> >> >
>> >> >
>> >> > Shailesh Jain
>> >> >
>> >>
>> >> not sure why u want to change that?   for a specific performance
>> >> tuning scenario (lots of sequential read)?   this readahead feature is
>> >> useful only if u are intending on reading large files.   But if u
>> >> switch to a different files, assuming many small files, u defeats the
>> >> purpose of readahead.   i think this is an OS-independent features,
>> >> which is specifically tuned to the normal usage of the filesystem.
>> >>
>> >> so, for example for AIX:
>> >>
>> >>
>> >>
>> >> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm
>> >>
>> >> their readahead is only (max) 16xpagesize.   not sure how big is that,
>> >> but our 128KB should be > 16xpagesize (how big is our IO blocksize
>> >> anyway?)
>> >>
>> >> for another reputable references:
>> >>
>> >> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm
>> >>
>> >> (in Oracle database).
>> >>
>> >> The problem is that if u read ahead too much, and after that the
>> >> entire buffer is going to be thrown away due to un-use, then a lot of
>> >> time is wasted in reading ahead.
>> >>
>> >> --
>> >> Regards,
>> >> Peter Teoh
>> >
>> >
>>
>>
>>
>> --
>> Regards,
>> Peter Teoh
>
>

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ