Re: Postgres server crash

Richard Troy <rtroy@xxxxxxxxxxxxxxxx> · Sat, 18 Nov 2006 17:28:46 -0800 (PST)

On Thu, 16 Nov 2006, Tom Lane wrote:
>
> "Craig A. James" <cjames@xxxxxxxxxxxxxxxx> writes:
> > OOM?  Can you give me a quick pointer to what this acronym stands for
> > and how I can reconfigure it?
>
> See "Linux Memory Overcommit" at
> http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128
> or try googling for "OOM kill" for non-Postgres-specific coverage.

I did that - spent about two f-ing hours looking for what I wanted. (Guess
I entered poor choices for my searches. -frown- ) There are a LOT of
articles that TALK ABOUT OOM, but prescious few actually tell you what you
can do about it.

Trying to save you some time:

On linux you can use the sysctl utility to muck with vm.overcommit_memory;
You can disable the "feature."

Google _that_ for more info!

>
> > It sounds like a "feature" old UNIX
> > systems like SGI IRIX had, where the system would allocate virtual
> > memory that it didn't really have, then kill your process if you tried
> > to use it.  I.e. malloc() would never return NULL even if swap space
> > was over allocated.  Is this what you're talking about?  Having this
> > enabled on a server is deadly for reliability.
>
> No kidding :-(.  The default behavior in Linux is extremely unfortunate.
>
> 			regards, tom lane

That's a major understatement.

The reason I spent a couple of hours looking for what I could learn on
this is that I've been absolutely beside myself on this "extremely
unfortunate" "feature." I had a badly behaving app (but didn't know which
app it was), so Linux would kill lots of things, like, oh, say, inetd.
Good luck sshing into the box. You just had to suffer with pushing the
damned reset button... It must have taken at least a week before figuring
out what not to do. (What I couldn't/can't understand is why the system
wouldn't just refuse the bad app the memory when it was short - no, you've
had enough!)

<soapbox> ...I read a large number of articles on this subject and am
absolutely dumbfounded by the -ahem- idiots who think killing a random
process is an appropriate action. I'm just taking their word for it that
there's some kind of impossibility of the existing Linux kernel not
getting itself into a potentially hung situation because it didn't save
itself any memory. Frankly, if it takes a complete kernel rewrite to fix
the problem that the damned operating system can't manage its own needs,
then the kernel needs to be rewritten! </soapbox>

These kernel hackers could learn something from VAX/VMS.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@xxxxxxxxxxxxxxxx, http://ScienceTools.com/