On Sat, Sep 26, 2009 at 12:53 PM, shailesh jain <coolworldofshail@xxxxxxxxx> wrote: > mm/readahead.c has the logic for rampup. It detects sequentiality.. > > http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html i see, the patch tried to fixup the readahead windows size based on total amount of memory. sounds logical, but then i think no he did not receive any ACK for the patch ???? the discussion thread i saw is here: http://www.gossamer-threads.com/lists/linux/kernel/798505?search_string=readahead%3A%20scale%20max%20readahead%20size%20depending%20on%20memory%20size;#798505 and to be noted from the discussion is the loss of unused data, for 1024K it is as high as 49%. readahead readahead size miss 128K 38% 512K 45% 1024K 49% > > On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> > wrote: >> >> On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain >> <coolworldofshail@xxxxxxxxx> wrote: >> > Yes I understand that. Cases for random-reads and other non-sequential >> > workloads, readahead logic will >> > not ramp up to max size anyway. What I want is to bump up max size, so >> > that >> > when kernel detects sequential worklaod >> >> it puzzled me how to distinguished between sequential and random >> read......does the kernel actually detect and check that a series of >> read are contiguous? not sensible either. read-ahead means reading >> ahead of expectation, so by the time it detect and check that the >> series of read are contiguous, it really does not classified into >> "read-ahead" anymore. >> >> any way, i did a ftrace stacktrace for reading /var/log/messages: >> >> 1197 => ext3_get_blocks_handle >> 1198 => ext3_get_block >> 1199 => do_mpage_readpage >> 1200 => mpage_readpages >> 1201 => ext3_readpages >> 1202 => __do_page_cache_readahead >> 1203 => ra_submit >> 1204 => filemap_fault >> 1205 head-25243 [000] 20698.351148: blk_queue_bounce >> <-__make_request >> 1206 head-25243 [000] 20698.351148: <stack trace> >> 1207 => __make_request >> 1208 => generic_make_request >> 1209 => submit_bio >> 1210 => mpage_bio_submit >> 1211 => do_mpage_readpage >> 1212 => mpage_readpages >> 1213 => ext3_readpages >> 1214 => __do_page_cache_readahead >> 1215 head-25243 [000] 20698.351159: blk_rq_init >> <-get_request >> 1216 head-25243 [000] 20698.351159: <stack trace> >> 1217 => get_request >> 1218 => get_request_wait >> 1219 => __make_request >> 1220 => generic_make_request >> 1221 => submit_bio >> 1222 => mpage_bio_submit >> 1223 => do_mpage_readpage >> 1224 => mpage_readpages >> >> so from above, we can guess __do_page_cache_readahead() is the key >> function involved: >> >> cut and paste (and read the comments below): >> >> >> 253 /* >> 254 * do_page_cache_readahead actually reads a chunk of disk. It >> allocates all >> 255 * the pages first, then submits them all for I/O. This avoids >> the very bad >> 256 * behaviour which would occur if page allocations are causing >> VM writeback. >> 257 * We really don't want to intermingle reads and writes like that. >> 258 * >> 259 * Returns the number of pages requested, or the maximum >> amount of I/O allowed. >> 260 * >> 261 * do_page_cache_readahead() returns -1 if it encountered request >> queue >> 262 * congestion. >> 263 */ >> 264 static int >> 265 __do_page_cache_readahead(struct address_space *mapping, >> struct file *filp, >> 266 pgoff_t offset, unsigned long nr_to_read) >> 267 { >> 268 struct inode *inode = mapping->host; >> 269 struct page *page; >> 270 unsigned long end_index; /* The last page we >> want to read */ >> 271 LIST_HEAD(page_pool); >> 272 int page_idx; >> 273 int ret = 0; >> 274 loff_t isize = i_size_read(inode); >> 275 >> 276 if (isize == 0) >> 277 goto out; >> 278 >> 279 end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); >> 280 >> 281 /* >> 282 * Preallocate as many pages as we will need. >> 283 */ >> 284 read_lock_irq(&mapping->tree_lock); >> 285 for (page_idx = 0; page_idx < nr_to_read; page_idx++) { >> 286 pgoff_t page_offset = offset + page_idx; >> 287 >> 288 if (page_offset > end_index) >> 289 break; >> 290 >> 291 page = radix_tree_lookup(&mapping->page_tree, >> page_offset); >> 292 if (page) >> 293 continue; >> 294 >> 295 read_unlock_irq(&mapping->tree_lock); >> 296 page = page_cache_alloc_cold(mapping); >> 297 read_lock_irq(&mapping->tree_lock); >> 298 if (!page) >> 299 break; >> 300 page->index = page_offset; >> 301 list_add(&page->lru, &page_pool); >> 302 ret++; >> 303 } >> 304 read_unlock_irq(&mapping->tree_lock); >> 305 >> 306 /* >> 307 * Now start the IO. We ignore I/O errors - if the page is >> not >> 308 * uptodate then the caller will launch readpage again, and >> 309 * will then handle the error. >> 310 */ >> 311 if (ret) >> 312 read_pages(mapping, filp, &page_pool, ret); >> 313 BUG_ON(!list_empty(&page_pool)); >> 314 out: >> 315 return ret; >> 316 } >> 317 >> 318 /* >> >> the HEART OF the algo is the last few lines--->read_pages(), and there >> is no conditional logic in it, it just readahead blindly. >> >> > it does not restrict itself to 32 pages. >> > >> > I looked around and saw an old patch that tried to account for actual >> > memory >> > on the system and setting max_readahead >> > according to that. Restricting to arbitrary limits -- for instance think >> > 512MB system vs 4GB system - is not sane IMO. >> > >> >> interesting....can u share the link so perhaps i can learn something? >> thanks pal!!! >> >> > >> > >> > Shailesh Jain >> > >> > >> > On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> >> > wrote: >> >> >> >> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain >> >> <coolworldofshail@xxxxxxxxx> wrote: >> >> > Hi, >> >> > Is the maximum limit of readahead 128KB ? .. Can it be changed by >> >> > FS >> >> > kernel module ? >> >> > >> >> > >> >> > Shailesh Jain >> >> > >> >> >> >> not sure why u want to change that? for a specific performance >> >> tuning scenario (lots of sequential read)? this readahead feature is >> >> useful only if u are intending on reading large files. But if u >> >> switch to a different files, assuming many small files, u defeats the >> >> purpose of readahead. i think this is an OS-independent features, >> >> which is specifically tuned to the normal usage of the filesystem. >> >> >> >> so, for example for AIX: >> >> >> >> >> >> >> >> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm >> >> >> >> their readahead is only (max) 16xpagesize. not sure how big is that, >> >> but our 128KB should be > 16xpagesize (how big is our IO blocksize >> >> anyway?) >> >> >> >> for another reputable references: >> >> >> >> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm >> >> >> >> (in Oracle database). >> >> >> >> The problem is that if u read ahead too much, and after that the >> >> entire buffer is going to be thrown away due to un-use, then a lot of >> >> time is wasted in reading ahead. >> >> >> >> -- >> >> Regards, >> >> Peter Teoh >> > >> > >> >> >> >> -- >> Regards, >> Peter Teoh > > -- Regards, Peter Teoh -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ