On Fri, 01 Mar 2024, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > > > Or maybe you just want the syscall to return an error instead of > > > > blocking for an unbounded amount of time if userspace asks for > > > > something silly. > > > > > > Warn on allocation above a certain size without MAY_FAIL would seem to > > > cover all those cases. If there is a case for requiring instant > > > allocation, you always have GFP_ATOMIC, and, I suppose, we could even > > > do a bounded reclaim allocation where it tries for a certain time then > > > fails. > > > > Then you're baking in this weird constant into all your algorithms that > > doesn't scale as machine memory sizes and working set sizes increase. > > > > > > Honestly, relying on the OOM killer and saying that because that now > > > > we don't have to write and test your error paths is a lazy cop out. > > > > > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > > > simplified) dumps clean cache first and tries the shrinkers then tries > > > to write out dirty cache. Only after that hasn't found anything after > > > a few iterations will the oom killer get activated > > > > All your caches dumped and the machine grinds to a halt and then a > > random process gets killed instead of simply _failing the allocation_. > > > > > > The same kind of thinking got us overcommit, where yes we got an > > > > increase in efficiency, but the cost was that everyone started > > > > assuming and relying on overcommit, so now it's impossible to run > > > > without overcommit enabled except in highly controlled environments. > > > > > > That might be true for your use case, but it certainly isn't true for a > > > cheap hosting cloud using containers: overcommit is where you make your > > > money, so it's absolutely standard operating procedure. I wouldn't > > > call cheap hosting a "highly controlled environment" they're just > > > making a bet they won't get caught out too often. > > > > Reading comprehension fail. Reread what I wrote. > > > > > > And that means allocation failure as an effective signal is just > > > > completely busted in userspace. If you want to write code in > > > > userspace that uses as much memory as is available and no more, you > > > > _can't_, because system behaviour goes to shit if you have overcommit > > > > enabled or a bunch of memory gets wasted if overcommit is disabled > > > > because everyone assumes that's just what you do. > > > > > > OK, this seems to be specific to your use case again, because if you > > > look at what the major user space processes like web browsers do, they > > > allocate way over the physical memory available to them for cache and > > > assume the kernel will take care of it. Making failure a signal for > > > being over the working set would cause all these applications to > > > segfault almost immediately. > > > > Again, reread what I wrote. You're restating what I wrote and completely > > missing the point. > > > > > > Let's _not_ go that route in the kernel. I have pointy sticks to > > > > brandish at people who don't want to deal with properly handling > > > > errors. > > > > > > Error legs are the least exercised and most bug, and therefore exploit, > > > prone pieces of code in C. If we can get rid of them, we should. > > > > Fuck no. > > > > Having working error paths is _basic_, and learning how to test your > > code is also basic. If you can't be bothered to do that you shouldn't be > > writing kernel code. > > > > We are giving far too much by going down the route of "oh, just kill > > stuff if we screwed the pooch and overcommitted". > > > > I don't fucking care if it's what the big cloud providers want because > > it's convenient for them, some of us actually do care about reliability. > > > > By just saying "oh, the OO killer will save us" what you're doing is > > making it nearly impossible to fully utilize a machine without having > > stuff randomly killed. > > > > And besides all that, as a practical matter you can't just "not have > erro paths" because, like you said, you'd still have to have a max size > where you WARN() - and _fail the allocation_ - and you've still got to > unwind. No. You warn and DON'T fail the allocation. Just like lockdep warns of possible deadlocks but lets you continue. These will be found in development (mostly) and changed to use __GFP_RETRY_MAYFAIL and have appropriate error-handling paths. > > The OOM killer can't kill processes while they're stuck blocking on an > allocation that will rever return in the kernel. But it can depopulate the user address space (I think). NeilBrown > > I think we can safely nip this idea in the bud. > > Test your damn error paths... >