Hi, * On Fri, Sep 28, 2012 at 08:18:50PM +0800, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote:
On Wed, Sep 26, 2012 at 08:28:20AM +0530, Raghavendra D Prabhu wrote:Hi, * On Sat, Sep 22, 2012 at 09:15:07PM +0800, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote: >On Sat, Sep 22, 2012 at 04:03:14PM +0530, raghu.prabhu13@xxxxxxxxx wrote: >>From: Raghavendra D Prabhu <rprabhu@xxxxxxxxxxx> >> >>Instead of running radix_tree_lookup in a loop and lock/unlocking in the >>process, find_get_pages is called once, which returns a page_list, some of which >>are not NULL and are in core. >> >>Also, since find_get_pages returns number of pages, if all pages are already >>cached, it can return early. >> >>This will be mostly helpful when a higher proportion of nr_to_read pages are >>already in the cache, which will mean less locking for page cache hits. > >Do you mean the rcu_read_lock()? But it's a no-op for most archs. So >the benefit of this patch is questionable. Will need real performance >numbers to support it. Aside from the rcu lock/unlock, isn't it better to not make separate calls to radix_tree_lookup and merge them into one call? Similar approach is used with pagevec_lookup which is usually used when one needs to deal with a set of pages.Yeah, batching is generally good, however find_get_pages() is not the right tool. It costs: - get/release page counts - likely a lot more searches in the address space, because it does not limit the end index of the search. radix_tree_next_hole() will be the right tool, and I have a patch to make it actually smarter than the current dumb loop.
Good to know.
>>Signed-off-by: Raghavendra D Prabhu <rprabhu@xxxxxxxxxxx> >>--- >> mm/readahead.c | 31 +++++++++++++++++++++++-------- >> 1 file changed, 23 insertions(+), 8 deletions(-) >> >>diff --git a/mm/readahead.c b/mm/readahead.c >>index 3977455..3a1798d 100644 >>--- a/mm/readahead.c >>+++ b/mm/readahead.c >>@@ -157,35 +157,42 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, >> { >> struct inode *inode = mapping->host; >> struct page *page; >>+ struct page **page_list = NULL; >> unsigned long end_index; /* The last page we want to read */ >> LIST_HEAD(page_pool); >> int page_idx; >> int ret = 0; >> int ret_read = 0; >>+ unsigned long num; >>+ pgoff_t page_offset; >> loff_t isize = i_size_read(inode); >> >> if (isize == 0) >> goto out; >> >>+ page_list = kzalloc(nr_to_read * sizeof(struct page *), GFP_KERNEL); >>+ if (!page_list) >>+ goto out; > >That cost one more memory allocation and added code to maintain the >page list. The original code also don't have the cost of grabbing the >page count, which eliminate the trouble of page release. > >> end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); >>+ num = find_get_pages(mapping, offset, nr_to_read, page_list); > >Assume we want to readahead pages for indexes [0, 100] and the cached >pages are in [1000, 1100]. find_get_pages() will return the latter. >Which is probably not the your expected results. I thought if I ask for pages in the range [0,100] it will return a sparse array [0,100] but with holes (NULL) for pages not in cache and references to pages in cache. Isn't that the expected behavior?Nope. The comments above find_get_pages() made it clear, that it's limited by the number of pages rather than the end page index.
Yes, I noticed that, however since nr_to_read in this case is equal to nr_pages.
However, I think I understand what you are saying -- ie. if offset + nr_pages exceeds the end_index then it will return pages not belonging to the file, is that right?
In that case, won't capping nr_pages do, ie. check if offset + nr_pages > end_index and if that is true, then reduce nr_to_read by end_index. Won't that work?
> >> /* >> * Preallocate as many pages as we will need. >> */ >> for (page_idx = 0; page_idx < nr_to_read; page_idx++) { >>- pgoff_t page_offset = offset + page_idx; >>+ if (page_list[page_idx]) { >>+ page_cache_release(page_list[page_idx]); >>+ continue; >>+ } >>+ >>+ page_offset = offset + page_idx; >> >> if (page_offset > end_index) >> break; >> >>- rcu_read_lock(); >>- page = radix_tree_lookup(&mapping->page_tree, page_offset); >>- rcu_read_unlock(); >>- if (page) >>- continue; >>- >> page = page_cache_alloc_readahead(mapping); >>- if (!page) >>+ if (unlikely(!page)) >> break; > >That break will leave the remaining pages' page_count lifted and lead >to memory leak. Thanks. Yes, I realized that now. > >> page->index = page_offset; >> list_add(&page->lru, &page_pool); >>@@ -194,6 +201,13 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, >> lookahead_size = 0; >> } >> ret++; >>+ >>+ /* >>+ * Since num pages are already returned, bail out after >>+ * nr_to_read - num pages are allocated and added. >>+ */ >>+ if (ret == nr_to_read - num) >>+ break; > >Confused. That break seems unnecessary? I fixed that: - pgoff_t page_offset = offset + page_idx; - - if (page_offset > end_index) - break; - - rcu_read_lock(); - page = radix_tree_lookup(&mapping->page_tree, page_offset); - rcu_read_unlock(); - if (page)+ if (page_list[page_idx]) { + page_cache_release(page_list[page_idx]);No, you cannot expect: page_list[page_idx]->index == page_idx Thanks, Fengguang+ num--; continue; + } + + page_offset = offset + page_idx; + + /* + * Break only if all the previous + * references have been released + */ + if (page_offset > end_index) { + if (!num) + break; + else + continue; + } page = page_cache_alloc_readahead(mapping); - if (!page) - break; + if (unlikely(!page)) + continue; > >Thanks, >Fengguang > >> } >> >> /* >>@@ -205,6 +219,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, >> ret_read = read_pages(mapping, filp, &page_pool, ret); >> BUG_ON(!list_empty(&page_pool)); >> out: >>+ kfree(page_list); >> return (ret_read < 0 ? ret_read : ret); >> } >> >>-- >>1.7.12.1 >> >>-- >>To unsubscribe, send a message with 'unsubscribe linux-mm' in >>the body to majordomo@xxxxxxxxx. For more info on Linux MM, >>see: http://www.linux-mm.org/ . >>Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > Regards, -- Raghavendra Prabhu GPG Id : 0xD72BE977 Fingerprint: B93F EBCB 8E05 7039 CD3C A4B8 A616 DCA1 D72B E977 www: wnohang.net
Regards, -- Raghavendra Prabhu GPG Id : 0xD72BE977 Fingerprint: B93F EBCB 8E05 7039 CD3C A4B8 A616 DCA1 D72B E977 www: wnohang.net
Attachment:
pgpojFv9N0IzR.pgp
Description: PGP signature