On Wed, Sep 26, 2012 at 08:28:20AM +0530, Raghavendra D Prabhu wrote: > Hi, > > > * On Sat, Sep 22, 2012 at 09:15:07PM +0800, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote: > >On Sat, Sep 22, 2012 at 04:03:14PM +0530, raghu.prabhu13@xxxxxxxxx wrote: > >>From: Raghavendra D Prabhu <rprabhu@xxxxxxxxxxx> > >> > >>Instead of running radix_tree_lookup in a loop and lock/unlocking in the > >>process, find_get_pages is called once, which returns a page_list, some of which > >>are not NULL and are in core. > >> > >>Also, since find_get_pages returns number of pages, if all pages are already > >>cached, it can return early. > >> > >>This will be mostly helpful when a higher proportion of nr_to_read pages are > >>already in the cache, which will mean less locking for page cache hits. > > > >Do you mean the rcu_read_lock()? But it's a no-op for most archs. So > >the benefit of this patch is questionable. Will need real performance > >numbers to support it. > > Aside from the rcu lock/unlock, isn't it better to not make separate > calls to radix_tree_lookup and merge them into one call? Similar > approach is used with pagevec_lookup which is usually used when one > needs to deal with a set of pages. Yeah, batching is generally good, however find_get_pages() is not the right tool. It costs: - get/release page counts - likely a lot more searches in the address space, because it does not limit the end index of the search. radix_tree_next_hole() will be the right tool, and I have a patch to make it actually smarter than the current dumb loop. > >>Signed-off-by: Raghavendra D Prabhu <rprabhu@xxxxxxxxxxx> > >>--- > >> mm/readahead.c | 31 +++++++++++++++++++++++-------- > >> 1 file changed, 23 insertions(+), 8 deletions(-) > >> > >>diff --git a/mm/readahead.c b/mm/readahead.c > >>index 3977455..3a1798d 100644 > >>--- a/mm/readahead.c > >>+++ b/mm/readahead.c > >>@@ -157,35 +157,42 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, > >> { > >> struct inode *inode = mapping->host; > >> struct page *page; > >>+ struct page **page_list = NULL; > >> unsigned long end_index; /* The last page we want to read */ > >> LIST_HEAD(page_pool); > >> int page_idx; > >> int ret = 0; > >> int ret_read = 0; > >>+ unsigned long num; > >>+ pgoff_t page_offset; > >> loff_t isize = i_size_read(inode); > >> > >> if (isize == 0) > >> goto out; > >> > >>+ page_list = kzalloc(nr_to_read * sizeof(struct page *), GFP_KERNEL); > >>+ if (!page_list) > >>+ goto out; > > > >That cost one more memory allocation and added code to maintain the > >page list. The original code also don't have the cost of grabbing the > >page count, which eliminate the trouble of page release. > > > >> end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); > >>+ num = find_get_pages(mapping, offset, nr_to_read, page_list); > > > >Assume we want to readahead pages for indexes [0, 100] and the cached > >pages are in [1000, 1100]. find_get_pages() will return the latter. > >Which is probably not the your expected results. > > I thought if I ask for pages in the range [0,100] it will return a > sparse array [0,100] but with holes (NULL) for pages not in cache > and references to pages in cache. Isn't that the expected behavior? Nope. The comments above find_get_pages() made it clear, that it's limited by the number of pages rather than the end page index. > > > >> /* > >> * Preallocate as many pages as we will need. > >> */ > >> for (page_idx = 0; page_idx < nr_to_read; page_idx++) { > >>- pgoff_t page_offset = offset + page_idx; > >>+ if (page_list[page_idx]) { > >>+ page_cache_release(page_list[page_idx]); > >>+ continue; > >>+ } > >>+ > >>+ page_offset = offset + page_idx; > >> > >> if (page_offset > end_index) > >> break; > >> > >>- rcu_read_lock(); > >>- page = radix_tree_lookup(&mapping->page_tree, page_offset); > >>- rcu_read_unlock(); > >>- if (page) > >>- continue; > >>- > >> page = page_cache_alloc_readahead(mapping); > >>- if (!page) > >>+ if (unlikely(!page)) > >> break; > > > >That break will leave the remaining pages' page_count lifted and lead > >to memory leak. > > Thanks. Yes, I realized that now. > > > >> page->index = page_offset; > >> list_add(&page->lru, &page_pool); > >>@@ -194,6 +201,13 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, > >> lookahead_size = 0; > >> } > >> ret++; > >>+ > >>+ /* > >>+ * Since num pages are already returned, bail out after > >>+ * nr_to_read - num pages are allocated and added. > >>+ */ > >>+ if (ret == nr_to_read - num) > >>+ break; > > > >Confused. That break seems unnecessary? > > I fixed that: > > > - pgoff_t page_offset = offset + page_idx; > - > - if (page_offset > end_index) > - break; > - > - rcu_read_lock(); > - page = radix_tree_lookup(&mapping->page_tree, page_offset); > - rcu_read_unlock(); > - if (page) > + if (page_list[page_idx]) { > + page_cache_release(page_list[page_idx]); No, you cannot expect: page_list[page_idx]->index == page_idx Thanks, Fengguang > + num--; > continue; > + } > + > + page_offset = offset + page_idx; > + > + /* > + * Break only if all the previous > + * references have been released > + */ > + if (page_offset > end_index) { > + if (!num) > + break; > + else > + continue; > + } > > page = page_cache_alloc_readahead(mapping); > - if (!page) > - break; > + if (unlikely(!page)) > + continue; > > > > >Thanks, > >Fengguang > > > >> } > >> > >> /* > >>@@ -205,6 +219,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, > >> ret_read = read_pages(mapping, filp, &page_pool, ret); > >> BUG_ON(!list_empty(&page_pool)); > >> out: > >>+ kfree(page_list); > >> return (ret_read < 0 ? ret_read : ret); > >> } > >> > >>-- > >>1.7.12.1 > >> > >>-- > >>To unsubscribe, send a message with 'unsubscribe linux-mm' in > >>the body to majordomo@xxxxxxxxx. For more info on Linux MM, > >>see: http://www.linux-mm.org/ . > >>Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > > > > > > > Regards, > -- > Raghavendra Prabhu > GPG Id : 0xD72BE977 > Fingerprint: B93F EBCB 8E05 7039 CD3C A4B8 A616 DCA1 D72B E977 > www: wnohang.net -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>