On Wed, 7 Apr 2010, KOSAKI Motohiro wrote: > > > > oom-badness-heuristic-rewrite.patch > > > > > > Do you have any specific feedback that you could offer on why you decided > > > to nack this? > > > > > > > I like this patch. But I think no one can't Ack this because there is no > > "correct" answer. At least, this show good behavior on my environment. > > see diffstat. that's perfectly crap, obviously need to make separate patches > individual one. Who can review it? > > Documentation/filesystems/proc.txt | 95 ++++---- > Documentation/sysctl/vm.txt | 21 + > fs/proc/base.c | 98 ++++++++ > include/linux/memcontrol.h | 8 > include/linux/oom.h | 17 + > include/linux/sched.h | 3 > kernel/fork.c | 1 > kernel/sysctl.c | 9 > mm/memcontrol.c | 18 + > mm/oom_kill.c | 319 ++++++++++++++------------- > 10 files changed, 404 insertions(+), 185 deletions(-) > > additional commets is in below. > This specific change cannot be broken down into individual patches as much as I'd like to. It's a complete rewrite of the badness() function and requires two new tunables to be introduced, determination of the amount of memory available to current, formals being changed around, and documentation. A review tip: the change itself is in the rewrite of the function now called oom_badness(), so I recommend applying downloading mmotm and reading it there as well as the documentation change. The remainder of the patch fixes up the various callers of that function and isn't interesting. > If you suggest to revert pagefault_oom itself, it is considerable. but > even though I don't think so. > > quote nick's mail > > The thing I should explain is that user interfaces are most important > for their intended semantics. We don't generally call bugs or oversights > part of the interface, and they are to be fixed unless some program > relies on them. > I disagree, I believe the long-standing semantics of user interfaces such as panic_on_oom are more important than what the name implies or what it was intended for when it was introduced. > Nowhere in the vm documentation does it say anything about "pagefault > ooms", and even in the kernel code, even to mm developers (who mostly > don't care about oom killer) probably wouldn't immediately think of > pagefault versus any other type of oom. > > Given that, do you think it is reasonable, when panic_on_oom is set, > to allow a process to be killed due to oom condition? Or do you think > that was an oversight of the implementation? > Users have a well-defined and long-standing method of protecting their applications from oom kill and that is OOM_DISABLE. With my patch, if current is unkillable because it is OOM_DISABLE, then we fallback to a tasklist scan iff panic_on_oom is unset. > Regardless of what architectures currently do. Yes there is a > consistency issue, and it should have been fixed earlier, but the > consistency issue goes both ways now. Some (the most widely tested > and used, if that matters) architectures, do it the right way. > > So, this patch is purely backstep. it break panic_on_oom. > If anyone post "pagefault_out_of_memory() aware pagefault for ppc" or > something else architecture, I'm glad and ack it. > It's not a backstep, it's making all architectures consistent as it sits right now in mmotm. If someone would like to change all VM_FAULT_OOM handlers to do a tasklist scan and not default to killing current, that is an extension of this patchset. Likewise, if we want to ensure panic_on_oom is respected even for pagefault ooms, then we need to do that on all architectures so that we don't have multiple definitions depending on machine type. The semantics of a sysctl shouldn't depend on the architecture and right now it does, so this patch fixes that. In other words: if you want to extend the definition of panic_on_oom, then do so completely for all architectures first and then add it to the documentation. > > > > oom-deprecate-oom_adj-tunable.patch > > > > > > Alan had a concern about removing /proc/pid/oom_adj, or redefining it with > > > different semantics as I originally did, and then I updated the patchset > > > to deprecate the old tunable as Andrew suggested. > > > > > > My somewhat arbitrary time of removal was approximately 18 months from > > > the date of deprecation which would give us 5-6 major kernel releases in > > > between. If you think that's too early of a deadline, then I'd happily > > > extend it by 6 months or a year. > > > > > > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if > > > there's a finer grained alternative available already unless you want > > > /proc/pid/oom_adj to actually mean something in which case you'll never be > > > able to seperate oom badness scores from bitshifts. I believe everyone > > > agrees that a more understood and finer grained tunable is necessary as > > > compared to the current implementation that has very limited functionality > > > other than polarizing tasks. > > The problem is, oom_adj is one of most widely used knob. it is not only used > admin, but also be used applications. in addition, oom_score_adj is bad interface > and no good to replace oom_adj. kamezawa-san, as following your mentioned. > oom_adj is retained but deprecated, so I'm not sure what you're suggesting here. Do you think we should instead keep oom_adj forever in parallel with oom_score_adj? It's quite clear that a more powerful, finer-grained solution is necessary than what oom_adj provides. I believe the deprecation for 5-6 major kernel releases is enough, but we can certainly talk about extending that by a year if you'd like. Can you elaborate on why you believe oom_score_adj is a bad interface or have had problems with it in your personal use? > agreed. oom_score_adj is completely crap. should gone. > but also following pseudo scaling adjustment is crap too. it don't consider > both page sharing and mlock pages. iow, it never works correctly. > > > + points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 / > + totalpages; > That baseline actually does work much better than total_vm as we've discussed multiple times on LKML leading up to the development of this series, but if you'd like to propose additional considerations into the heuristic, than please do so. > > > > oom-replace-sysctls-with-quick-mode.patch > > > > > > > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason. > > > > > > Which patch of the four you listed are you referring to here? > > > > > replacing used sysctl is bad idea, in general. > > > > I have no _strong_ opinion. I welcome the patch series. But aboves are my concern. > > Thank you for your work. > > I really hate "that is _inteltional_ regression" crap. now almost developers > ignore a bug report and don't join problem investigate works. I and very few > people does that. (ok, I agree you are in such few developers, thanks) > > Why can't we discard it simplely? please don't make crap. > Perhaps you don't understand. The users of oom_kill_allocating_task are those systems that have extremely large tasklists and so iterating through it comes at a substantial cost. It was originally requested by SGI because they preferred an alternative to the tasklist scan used for cpuset-constrained ooms and were satisfied with simply killing something quickly instead of iterating the tasklist. This patchset, however, enables oom_dump_tasks by default because it provides useful information to the user to understand the memory use of their applications so they can hopefully determine why the oom occurred. This requires a tasklist scan itself, so those same users of oom_kill_allocating_task are no longer protected from that cost by simply setting this sysctl. They must also disable oom_dump_tasks or we're at the same efficiency that we were before oom_kill_allocating_task was introduced. Since they must modify their startup scripts, and since the users of both of these sysctls are the same and nobody would use one without the other, it should be possible to consolidate them into a single sysctl. If additional changes are made to the oom killer in the future, it would then be possible to test for this single sysctl, oom_kill_quick, instead without introducing additional sysctls and polluting procfs. Thus, it's completely unnecessary to keep oom_kill_allocating_task and we can redefine it for those systems. What alternatives do you have in mind or what part of this logic do you not agree with? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>