On Fri, May 20, 2011 at 12:19 PM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: > On Fri, May 20, 2011 at 12:01:12PM -0400, Andrew Lutomirski wrote: >> On Fri, May 20, 2011 at 11:33 AM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 8bfd450..a5c01e9 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, >> > >> > /* Check if we should syncronously wait for writeback */ >> > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { >> > + unsigned long nr_active; >> > set_reclaim_mode(priority, sc, true); >> > + nr_active = clear_active_flags(&page_list, NULL); >> > + count_vm_events(PGDEACTIVATE, nr_active); >> > nr_reclaimed += shrink_page_list(&page_list, zone, sc); >> > } >> > >> > -- >> >> I'm now running that patch *without* the pgdat_balanced fix or the >> need_resched check. The VM_BUG_ON doesn't happen but I still get > > Please forget need_resched. > Instead of it, could you test shrink_slab patch with !pgdat_balanced? > > @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + /* Assume we'll be able to shrink next time */ > + ret = 1; > + goto out; > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, > shrinker->nr += total_scan; > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > >> incorrect OOM kills. >> >> However, if I replace the check with: >> >> if (false &&should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { >> >> then my system lags under bad memory pressure but recovers without >> OOMs or oopses. > > I agree you can see OOM but oops? Did you see any oops? No oops. I've now reproduced the OOPS with both the if (false) change and the clear_active_flags change. Also, would this version be better? I think your version overcounts nr_scanned, but I'm not sure what effect that would have. diff --git a/mm/vmscan.c b/mm/vmscan.c index 3f44b81..d1dabc9 100644 @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, /* Check if we should syncronously wait for writeback */ if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { + unsigned long nr_active, old_nr_scanned; set_reclaim_mode(priority, sc, true); + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); + old_nr_scanned = sc->nr_scanned; nr_reclaimed += shrink_page_list(&page_list, zone, sc); + sc->nr_scanned = old_nr_scanned; } local_irq_disable(); I just tested 2.6.38.6 with the attached patch. It survived dirty_ram and test_mempressure without any problems other than slowness, but when I hit ctrl-c to stop test_mempressure, I got the attached oom. --Andy
Attachment:
test.patch
Description: Binary data
Attachment:
oom.txt.xz
Description: application/xz