http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html
On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
On Fri, Sep 25, 2009 at 11:29 PM, shailesh jainit puzzled me how to distinguished between sequential and random
<coolworldofshail@xxxxxxxxx> wrote:
> Yes I understand that. Cases for random-reads and other non-sequential
> workloads, readahead logic will
> not ramp up to max size anyway. What I want is to bump up max size, so that
> when kernel detects sequential worklaod
read......does the kernel actually detect and check that a series of
read are contiguous? not sensible either. read-ahead means reading
ahead of expectation, so by the time it detect and check that the
series of read are contiguous, it really does not classified into
"read-ahead" anymore.
any way, i did a ftrace stacktrace for reading /var/log/messages:
1197 => ext3_get_blocks_handle
1198 => ext3_get_block
1199 => do_mpage_readpage
1200 => mpage_readpages
1201 => ext3_readpages
1202 => __do_page_cache_readahead
1203 => ra_submit
1204 => filemap_fault
1205 head-25243 [000] 20698.351148: blk_queue_bounce
<-__make_request
1206 head-25243 [000] 20698.351148: <stack trace>
1207 => __make_request
1208 => generic_make_request
1209 => submit_bio
1210 => mpage_bio_submit
1211 => do_mpage_readpage
1212 => mpage_readpages
1213 => ext3_readpages
1214 => __do_page_cache_readahead
1215 head-25243 [000] 20698.351159: blk_rq_init <-get_request
1216 head-25243 [000] 20698.351159: <stack trace>
1217 => get_request
1218 => get_request_wait
1219 => __make_request
1220 => generic_make_request
1221 => submit_bio
1222 => mpage_bio_submit
1223 => do_mpage_readpage
1224 => mpage_readpages
so from above, we can guess __do_page_cache_readahead() is the key
function involved:
cut and paste (and read the comments below):
253 /*
254 * do_page_cache_readahead actually reads a chunk of disk. It
allocates all
255 * the pages first, then submits them all for I/O. This avoids
the very bad
256 * behaviour which would occur if page allocations are causing
VM writeback.
257 * We really don't want to intermingle reads and writes like that.
258 *
259 * Returns the number of pages requested, or the maximum
amount of I/O allowed.
260 *
261 * do_page_cache_readahead() returns -1 if it encountered request queue
262 * congestion.
263 */
264 static int
265 __do_page_cache_readahead(struct address_space *mapping,
struct file *filp,
266 pgoff_t offset, unsigned long nr_to_read)
267 {
268 struct inode *inode = mapping->host;
269 struct page *page;
270 unsigned long end_index; /* The last page we
want to read */
271 LIST_HEAD(page_pool);
272 int page_idx;
273 int ret = 0;
274 loff_t isize = i_size_read(inode);
275
276 if (isize == 0)
277 goto out;
278
279 end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
280
281 /*
282 * Preallocate as many pages as we will need.
283 */
284 read_lock_irq(&mapping->tree_lock);
285 for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
286 pgoff_t page_offset = offset + page_idx;
287
288 if (page_offset > end_index)
289 break;
290
291 page = radix_tree_lookup(&mapping->page_tree,
page_offset);
292 if (page)
293 continue;
294
295 read_unlock_irq(&mapping->tree_lock);
296 page = page_cache_alloc_cold(mapping);
297 read_lock_irq(&mapping->tree_lock);
298 if (!page)
299 break;
300 page->index = page_offset;
301 list_add(&page->lru, &page_pool);
302 ret++;
303 }
304 read_unlock_irq(&mapping->tree_lock);
305
306 /*
307 * Now start the IO. We ignore I/O errors - if the page is not
308 * uptodate then the caller will launch readpage again, and
309 * will then handle the error.
310 */
311 if (ret)
312 read_pages(mapping, filp, &page_pool, ret);
313 BUG_ON(!list_empty(&page_pool));
314 out:
315 return ret;
316 }
317
318 /*
the HEART OF the algo is the last few lines--->read_pages(), and there
is no conditional logic in it, it just readahead blindly.
interesting....can u share the link so perhaps i can learn something?
> it does not restrict itself to 32 pages.
>
> I looked around and saw an old patch that tried to account for actual memory
> on the system and setting max_readahead
> according to that. Restricting to arbitrary limits -- for instance think
> 512MB system vs 4GB system - is not sane IMO.
>
thanks pal!!!
--
>
>
> Shailesh Jain
>
>
> On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>
>> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain
>> <coolworldofshail@xxxxxxxxx> wrote:
>> > Hi,
>> > Is the maximum limit of readahead 128KB ? .. Can it be changed by FS
>> > kernel module ?
>> >
>> >
>> > Shailesh Jain
>> >
>>
>> not sure why u want to change that? for a specific performance
>> tuning scenario (lots of sequential read)? this readahead feature is
>> useful only if u are intending on reading large files. But if u
>> switch to a different files, assuming many small files, u defeats the
>> purpose of readahead. i think this is an OS-independent features,
>> which is specifically tuned to the normal usage of the filesystem.
>>
>> so, for example for AIX:
>>
>>
>> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm
>>
>> their readahead is only (max) 16xpagesize. not sure how big is that,
>> but our 128KB should be > 16xpagesize (how big is our IO blocksize
>> anyway?)
>>
>> for another reputable references:
>>
>> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm
>>
>> (in Oracle database).
>>
>> The problem is that if u read ahead too much, and after that the
>> entire buffer is going to be thrown away due to un-use, then a lot of
>> time is wasted in reading ahead.
>>
>> --
>> Regards,
>> Peter Teoh
>
>
Regards,
Peter Teoh