On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain <coolworldofshail@xxxxxxxxx> wrote: > Yes I understand that. Cases for random-reads and other non-sequential > workloads, readahead logic will > not ramp up to max size anyway. What I want is to bump up max size, so that > when kernel detects sequential worklaod it puzzled me how to distinguished between sequential and random read......does the kernel actually detect and check that a series of read are contiguous? not sensible either. read-ahead means reading ahead of expectation, so by the time it detect and check that the series of read are contiguous, it really does not classified into "read-ahead" anymore. any way, i did a ftrace stacktrace for reading /var/log/messages: 1197 => ext3_get_blocks_handle 1198 => ext3_get_block 1199 => do_mpage_readpage 1200 => mpage_readpages 1201 => ext3_readpages 1202 => __do_page_cache_readahead 1203 => ra_submit 1204 => filemap_fault 1205 head-25243 [000] 20698.351148: blk_queue_bounce <-__make_request 1206 head-25243 [000] 20698.351148: <stack trace> 1207 => __make_request 1208 => generic_make_request 1209 => submit_bio 1210 => mpage_bio_submit 1211 => do_mpage_readpage 1212 => mpage_readpages 1213 => ext3_readpages 1214 => __do_page_cache_readahead 1215 head-25243 [000] 20698.351159: blk_rq_init <-get_request 1216 head-25243 [000] 20698.351159: <stack trace> 1217 => get_request 1218 => get_request_wait 1219 => __make_request 1220 => generic_make_request 1221 => submit_bio 1222 => mpage_bio_submit 1223 => do_mpage_readpage 1224 => mpage_readpages so from above, we can guess __do_page_cache_readahead() is the key function involved: cut and paste (and read the comments below): 253 /* 254 * do_page_cache_readahead actually reads a chunk of disk. It allocates all 255 * the pages first, then submits them all for I/O. This avoids the very bad 256 * behaviour which would occur if page allocations are causing VM writeback. 257 * We really don't want to intermingle reads and writes like that. 258 * 259 * Returns the number of pages requested, or the maximum amount of I/O allowed. 260 * 261 * do_page_cache_readahead() returns -1 if it encountered request queue 262 * congestion. 263 */ 264 static int 265 __do_page_cache_readahead(struct address_space *mapping, struct file *filp, 266 pgoff_t offset, unsigned long nr_to_read) 267 { 268 struct inode *inode = mapping->host; 269 struct page *page; 270 unsigned long end_index; /* The last page we want to read */ 271 LIST_HEAD(page_pool); 272 int page_idx; 273 int ret = 0; 274 loff_t isize = i_size_read(inode); 275 276 if (isize == 0) 277 goto out; 278 279 end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); 280 281 /* 282 * Preallocate as many pages as we will need. 283 */ 284 read_lock_irq(&mapping->tree_lock); 285 for (page_idx = 0; page_idx < nr_to_read; page_idx++) { 286 pgoff_t page_offset = offset + page_idx; 287 288 if (page_offset > end_index) 289 break; 290 291 page = radix_tree_lookup(&mapping->page_tree, page_offset); 292 if (page) 293 continue; 294 295 read_unlock_irq(&mapping->tree_lock); 296 page = page_cache_alloc_cold(mapping); 297 read_lock_irq(&mapping->tree_lock); 298 if (!page) 299 break; 300 page->index = page_offset; 301 list_add(&page->lru, &page_pool); 302 ret++; 303 } 304 read_unlock_irq(&mapping->tree_lock); 305 306 /* 307 * Now start the IO. We ignore I/O errors - if the page is not 308 * uptodate then the caller will launch readpage again, and 309 * will then handle the error. 310 */ 311 if (ret) 312 read_pages(mapping, filp, &page_pool, ret); 313 BUG_ON(!list_empty(&page_pool)); 314 out: 315 return ret; 316 } 317 318 /* the HEART OF the algo is the last few lines--->read_pages(), and there is no conditional logic in it, it just readahead blindly. > it does not restrict itself to 32 pages. > > I looked around and saw an old patch that tried to account for actual memory > on the system and setting max_readahead > according to that. Restricting to arbitrary limits -- for instance think > 512MB system vs 4GB system - is not sane IMO. > interesting....can u share the link so perhaps i can learn something? thanks pal!!! > > > Shailesh Jain > > > On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >> >> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain >> <coolworldofshail@xxxxxxxxx> wrote: >> > Hi, >> > Is the maximum limit of readahead 128KB ? .. Can it be changed by FS >> > kernel module ? >> > >> > >> > Shailesh Jain >> > >> >> not sure why u want to change that? for a specific performance >> tuning scenario (lots of sequential read)? this readahead feature is >> useful only if u are intending on reading large files. But if u >> switch to a different files, assuming many small files, u defeats the >> purpose of readahead. i think this is an OS-independent features, >> which is specifically tuned to the normal usage of the filesystem. >> >> so, for example for AIX: >> >> >> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm >> >> their readahead is only (max) 16xpagesize. not sure how big is that, >> but our 128KB should be > 16xpagesize (how big is our IO blocksize >> anyway?) >> >> for another reputable references: >> >> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm >> >> (in Oracle database). >> >> The problem is that if u read ahead too much, and after that the >> entire buffer is going to be thrown away due to un-use, then a lot of >> time is wasted in reading ahead. >> >> -- >> Regards, >> Peter Teoh > > -- Regards, Peter Teoh -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ