On Tue 16-02-21 09:03:47, Minchan Kim wrote: > LRU pagevec holds refcount of pages until the pagevec are drained. > It could prevent migration since the refcount of the page is greater > than the expection in migration logic. To mitigate the issue, > callers of migrate_pages drains LRU pagevec via migrate_prep or > lru_add_drain_all before migrate_pages call. > > However, it's not enough because pages coming into pagevec after the > draining call still could stay at the pagevec so it could keep > preventing page migration. Since some callers of migrate_pages have > retrial logic with LRU draining, the page would migrate at next trail > but it is still fragile in that it doesn't close the fundamental race > between upcoming LRU pages into pagvec and migration so the migration > failure could cause contiguous memory allocation failure in the end. Please put some numbers on how often this happens here. > The other concern is migration keeps retrying until pages in pagevec > are drained. During the time, migration repeatedly allocates target > page, unmap source page from page table of processes and then get to > know the failure, restore the original page to pagetable of processes, > free target page, which is also not good. This is not good for performance you mean, rigth? > To solve the issue, this patch tries to close the race rather than > relying on retrial and luck. The idea is to introduce > migration-in-progress tracking count with introducing IPI barrier > after atomic updating the count to minimize read-side overhead. > > The migrate_prep increases migrate_pending_count under the lock > and IPI call to guarantee every CPU see the uptodate value > of migrate_pending_count. Then, drain pagevec via lru_add_drain_all. > >From now on, no LRU pages could reach pagevec since LRU handling > functions skips the batching if migration is in progress with checking > migrate_pedning(IOW, pagevec should be empty until migration is done). > Every migrate_prep's caller should call migrate_finish in pair to > decrease the migration tracking count. migrate_prep already does schedule draining on each cpu which has pages queued. Why isn't it enough to disable pcp lru caches right before draining in migrate_prep? More on IPI side below [...] > +static DEFINE_SPINLOCK(migrate_pending_lock); > +static unsigned long migrate_pending_count; > +static DEFINE_PER_CPU(struct work_struct, migrate_pending_work); > + > +static void read_migrate_pending(struct work_struct *work) > +{ > + /* TODO : not sure it's needed */ > + unsigned long dummy = __READ_ONCE(migrate_pending_count); > + (void)dummy; What are you trying to achieve here? Are you just trying to enforce read memory barrier here? > +} > + > +bool migrate_pending(void) > +{ > + return migrate_pending_count; > +} > + > /* > * migrate_prep() needs to be called before we start compiling a list of pages > * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is > @@ -64,11 +80,27 @@ > */ > void migrate_prep(void) > { > + unsigned int cpu; > + > + spin_lock(&migrate_pending_lock); > + migrate_pending_count++; > + spin_unlock(&migrate_pending_lock); I suspect you do not want to add atomic_read inside hot paths, right? Is this really something that we have to microoptimize for? atomic_read is a simple READ_ONCE on many archs. > + > + for_each_online_cpu(cpu) { > + struct work_struct *work = &per_cpu(migrate_pending_work, cpu); > + > + INIT_WORK(work, read_migrate_pending); > + queue_work_on(cpu, mm_percpu_wq, work); > + } > + > + for_each_online_cpu(cpu) > + flush_work(&per_cpu(migrate_pending_work, cpu)); I also do not follow this scheme. Where is the IPI you are mentioning above? > + /* > + * From now on, every online cpu will see uptodate > + * migarte_pending_work. > + */ > /* > * Clear the LRU lists so pages can be isolated. > - * Note that pages may be moved off the LRU after we have > - * drained them. Those pages will fail to migrate like other > - * pages that may be busy. > */ > lru_add_drain_all(); Overall, this looks rather heavy weight to my taste. Have you tried to play with a simple atomic counter approach? atomic_read when adding to the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain. -- Michal Hocko SUSE Labs