* Andrew Morton (akpm@xxxxxxxxxxxxxxxxxxxx) wrote: > On Mon, 17 Jun 2013 17:39:36 -0400 Rapha__l Beamonte <raphael.beamonte@xxxxxxxxx> wrote: > > > 2013/6/17 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > > That change wasn't terribly efficient - if there are any unpopulated > > > pages in the range (which is quite likely), fadvise() will now always > > > call invalidate_mapping_pages() a second time. > > > > > > Perhaps this is fixable. Say, make lru_add_drain_all() return a > > > success code, or even teach lru_add_drain_all() to return a code > > > indicating that one of the spilled pages was (or might have been) on a > > > particular mapping. > > > > > > > Following our tests results, that was the call to lru_add_drain_all() that > > causes the problem. The second call to invalidate_mapping_pages() isn't > > really important. We tried to compile a kernel with the commit introducing > > this change but with the "lru_add_drain_all()" line removed, and the > > problem disappeared, even if we called two times invalidate_mapping_pages() > > (as the rest of the commit was still here). > > Ah, OK, schedule_on_each_cpu() could certainly do that - it has to wait > for every CPU to context switch and schedule the worker function. > > There's a lot we could do here. Such as not doing the schedule_work() > at all for a cpu which has an empty lru_add_pvec. First approach proposed, submitted as RFC. Compile-tested only. --- mm/swap.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) Index: linux/mm/swap.c =================================================================== --- linux.orig/mm/swap.c +++ linux/mm/swap.c @@ -652,7 +652,40 @@ static void lru_add_drain_per_cpu(struct */ int lru_add_drain_all(void) { - return schedule_on_each_cpu(lru_add_drain_per_cpu); + int cpu; + struct work_struct __percpu *works; + + works = alloc_percpu(struct work_struct); + if (!works) + return -ENOMEM; + + get_online_cpus(); + + for_each_online_cpu(cpu) { + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); + struct work_struct *work = per_cpu_ptr(works, cpu); + struct pagevec *pvec; + int lru; + + INIT_WORK(work, lru_add_drain_per_cpu); + for_each_lru(lru) { + pvec = &pvecs[lru - LRU_BASE]; + if (pagevec_count(pvec)) { + schedule_work_on(cpu, work); + break; + } + } + } + + for_each_online_cpu(cpu) { + struct work_struct *work = per_cpu_ptr(works, cpu); + if (work_pending(work)) + flush_work(work); + } + + put_online_cpus(); + free_percpu(works); + return 0; } /* > Or even pass down > the address_space and only schedule the work for CPUs which have a page > from *this mapping* in their lru_add_pvec. That will all be highly > racy, but as long as the failure mode is "flushed unnecessarily" then > that's OK. Second approach, submitted as RFC with some questions left unanswered in the code. Compile-tested only. --- include/linux/swap.h | 1 mm/fadvise.c | 2 - mm/swap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 64 insertions(+), 1 deletion(-) Index: linux/include/linux/swap.h =================================================================== --- linux.orig/include/linux/swap.h +++ linux/include/linux/swap.h @@ -242,6 +242,7 @@ extern void mark_page_accessed(struct pa extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern int lru_add_drain_all(void); +extern int lru_add_drain_mapping(struct address_space *mapping); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_page(struct page *page); extern void swap_setup(void); Index: linux/mm/swap.c =================================================================== --- linux.orig/mm/swap.c +++ linux/mm/swap.c @@ -689,6 +689,68 @@ int lru_add_drain_all(void) } /* + * Returns 0 for success + */ +int lru_add_drain_mapping(struct address_space *mapping) +{ + int cpu; + struct work_struct __percpu *works; + + works = alloc_percpu(struct work_struct); + if (!works) + return -ENOMEM; + + get_online_cpus(); + + for_each_online_cpu(cpu) { + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); + struct work_struct *work = per_cpu_ptr(works, cpu); + struct pagevec *pvec; + int lru; + + INIT_WORK(work, lru_add_drain_per_cpu); + for_each_lru(lru) { + int i; + + pvec = &pvecs[lru - LRU_BASE]; + /* + * Race failure mode is to flush unnecessarily. + * We use PAGEVEC_SIZE rather than pvec->nr to + * stay on the safe-side wrt pvec resize. + */ + for (i = 0; i < PAGEVEC_SIZE; i++) { + struct page *page; + + /* + * Racy access to page. TODO: is it OK + * to access it from the remote CPU's + * lru without any kind of ownership or + * synchronization ? + */ + page = ACCESS_ONCE(pvec->pages[i]); + if (!pvec->pages[i]) + continue; + if (page->mapping == mapping) { + schedule_work_on(cpu, work); + goto next_cpu; + } + } + } + next_cpu: ; /* TODO: coding ugliness */ + } + + for_each_online_cpu(cpu) { + struct work_struct *work = per_cpu_ptr(works, cpu); + if (work_pending(work)) + flush_work(work); + } + + put_online_cpus(); + free_percpu(works); + return 0; +} + +/* * Batched page_cache_release(). Decrement the reference count on all the * passed pages. If it fell to zero then remove the page from the LRU and * free it. Index: linux/mm/fadvise.c =================================================================== --- linux.orig/mm/fadvise.c +++ linux/mm/fadvise.c @@ -132,7 +132,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, l * pagevecs and try again. */ if (count < (end_index - start_index + 1)) { - lru_add_drain_all(); + lru_add_drain_mapping(mapping); invalidate_mapping_pages(mapping, start_index, end_index); } Feedback is welcome, Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html