On 05/09/2013 02:55 PM, Mel Gorman wrote: > On Thu, May 09, 2013 at 10:06:17AM +0400, Glauber Costa wrote: >> [ Sending again, forgot to CC fsdevel. Shame on me ] >> To Mel >> ====== >> > > I'm surprised Dave Chinner is not on the cc. He may or may not see it > on fsdevel. > Yeah, I have been screwing up CC's =( Included one, forgot the other. Still, I would expect Dave to be seeing most patches here, since he will be automatically on the CC for most of the individual patches. >> Mel, I have identified the overly aggressive behavior you noticed to be a bug >> in the at-least-one-pass patch, that would ask the shrinkers to scan the full >> batch even when total_scan < batch. They would do their best for it, and >> eventually succeed. I also went further, and made that the behavior of direct >> reclaim only - The only case that really matter for memcg, and one in which >> we could argue that we are more or less desperate for small squeezes in memory. >> Thank you very much for spotting this. >> > > I haven't seen the relevant code yet but in general I do not think it is > a good idea for direct reclaim to potentially reclaim all of slabs like > this. Direct reclaim does not necessarily mean the system is desperate > for small amounts of memory. Lets take a few examples where it would be > a poor decision to reclaim all the slab pages within direct reclaim. > > 1. Direct reclaim triggers because kswapd is stalled writing pages for > memcg (see code near comment "memcg doesn't have any dirty pages > throttling"). A memcg dirtying its limit of pages may cause a lot of > direct reclaim and dumping all the slab pages > > 2. Direct reclaim triggers because kswapd is writing pages out to swap. > Similar to memcg above, kswapd failing to make forward progress triggers > direct reclaim which then potentially reclaims all slab > > 3. Direct reclaim triggers because kswapd waits on congestion as there > are too many pages under writeback. In this case, a large amounts of > writes to slow storage like USB could result in all slab being reclaimed > > 4. The system has been up a long time, memory is fragmented and the page > allocator enters direct reclaim/compaction to allocate THPs. It would > be very unfortunate if allocating a THP reclaimed all the slabs > For the record: We are no longer reclaiming *all the slabs*, that was the bug, and is fixed here. We are scanning a bit more, but my preliminary tests indicate that this is not the reason. Today is a holiday here, so I am half in the office, half outside. I plan to have news soon. That said, this is a minor part of the patchset, and I don't intend to make a case for it. I am totally fine with the conservative route of making it memcg specific + a comment explaining why this is happening. > All that is potentially bad and likely to make Dave put in his cranky > pants. I would much prefer if direct reclaim and kswapd treated slab > similarly and not ask the shrinkers to do a full scan unless the alternative > is OOM kill. > >> Running postmark on the final result (at least on my 2-node box) show something >> a lot saner. We are still stealing more inodes than before, but by a factor of >> around 15 %. Since the correct balance is somewhat heuristic anyway - I >> personally think this is acceptable. But I am waiting to hear from you on this >> matter. Meanwhile, I am investigating further to try to pinpoint where exactly >> this comes from. It might either be because of the new node-aware behavior, or >> because of the increased calculation precision in the first patch. >> > > I'm going to defer to Dave as to whether that increased level of slab > reclaim is acceptable or not. > Dave? In any case, it is probably sleep time for Dave, so I hope to be able to by the time he sees this, provide a more definite explanation about why we are seeing this increase. >> In particular, I haven't done anything about your comment regarding MAX_NODES >> array. After the memcg patches are applying, fixing this is a lot easier, >> because memcg already departs from a static MAX_NODES array to a dynamic one. >> I wanted, however, to keep the noise introduction down in something that I >> expect to be merged soon. I would suggest merging a patch that fixes that >> on top of the series, instead of the middle, if you really think it matters. >> I, of course, commit to doing this in that case. >> > > I think fixing it on top would be reasonable assuming the other memcg people > are happy with the memcg parts of the series. I didn't get a chance to look > at them the last time and focused more on the API and per-node list changes. > Great, thanks -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html