On 8/14/2013 2:46 AM, Andrew Morton wrote: > On Tue, 13 Aug 2013 19:32:37 -0400 Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote: > >> On 8/13/2013 7:29 PM, Tejun Heo wrote: >>> Hello, >>> >>> On Tue, Aug 13, 2013 at 06:53:32PM -0400, Chris Metcalf wrote: >>>> int lru_add_drain_all(void) >>>> { >>>> - return schedule_on_each_cpu(lru_add_drain_per_cpu); >>>> + return schedule_on_each_cpu_cond(lru_add_drain_per_cpu, >>>> + lru_add_drain_cond, NULL); > This version looks nice to me. It's missing the conversion of > schedule_on_each_cpu(), but I suppose that will be pretty simple. I assume you were thinking of something like a NULL "cond" function pointer meaning "no condition" in the implementation, and then schedule_on_each_cpu just calls schedule_on_each_cpu_cond(work, NULL, NULL)? >>> It won't nest and doing it simultaneously won't buy anything, right? >> Correct on both counts, I think. > I'm glad you understood the question :( > > What does "nest" mean? lru_add_drain_all() calls itself recursively, > presumably via some ghastly alloc_percpu()->alloc_pages(GFP_KERNEL) > route? If that ever happens then we'd certainly want to know about it. > Hopefully PF_MEMALLOC would prevent infinite recursion. Well, I think "it won't nest" is true in exactly the way you say - or maybe we want to say "it shouldn't nest". Catching the condition would obviously be a good thing. > If "nest" means something else then please enlighten me! > > As for "doing it simultaneously", I assume we're referring to > concurrent execution from separate threads. If so, why would that "buy > us anything"? Confused. As long as each thread sees "all pages which > were in pagevecs at the time I called lru_add_drain_all() get spilled > onto the LRU" then we're good. afaict the implementation will do this. If we can avoid doing it simultaneously, we avoid having to do any allocation at all, which seems like a win. Let's consider if we simply do a mutex_lock() around the routine. If we also then test to see which cpus need to be drained before issuing the drain, the result for two cpus that both try to drain simultaneously would be that one succeeds and drains everything; the second blocks on the mutex until the first is done, then quickly enters the code, finds that no cpus require draining, and returns, faster than would have been the case in the current code when it would also have fired drain requests at every cpu, concurrently with the earlier cpus firing its drain requests. Frankly, this seems like a pretty reasonable solution. We can implement it purely in swap.c using existing workqueue APIs. Note that lru_add_drain_all() always returns success this way, so I've changed the API to make it a void function. Considering none of the callers currently check the error return value anyway, that seemed like a straightforward thing to do. :-) Unfortunately DEFINE_PER_CPU doesn't work at function scope, so I have to make it file-static. Assuming the existence of my "need_activate_page_drain()" helper from the earlier versions of the patch set, lru_add_drain_all() looks like the following: static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); void lru_add_drain_all(void) { static DEFINE_MUTEX(lock); int cpu; mutex_lock(&lock); get_online_cpus(); for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || need_activate_page_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); } else { work->entry.next = NULL; } } for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); if (work->entry.next) flush_work(work); } put_online_cpus(); mutex_unlock(&lock); } Tejun, I don't know if you have a better idea for how to mark a work_struct as being "not used" so we can set and test it here. Is setting entry.next to NULL good? Should we offer it as an API in the workqueue header? We could wrap the whole thing in a new workqueue API too, of course (schedule_on_each_cpu_cond_sequential??) but it seems better at this point to wait until we find another caller with similar needs, and only then factor the code into a new workqueue API. -- Chris Metcalf, Tilera Corp. http://www.tilera.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>