On Thu, Aug 28, 2008 at 5:08 PM, <david@xxxxxxx> wrote: > On Thu, 28 Aug 2008, Scott Marlowe wrote: > >> On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@xxxxxxxxxxx> >> wrote: >> >>> Another point is that from a business perspective, a database that has >>> stopped responding is equally bad regardless of whether that is because >>> the >>> OOM killer has appeared or because the machine is thrashing. In both >>> cases, >>> there is a maximum throughput that the machine can handle, and if >>> requests >>> appear quicker than that the system will collapse, especially if the >>> requests start timing out and being retried. >> >> But there's a HUGE difference between a machine that has bogged down >> under load so badly that you have to reset it and a machine that's had >> the postmaster slaughtered by the OOM killer. In the first situation, >> while the machine is unresponsive, it should come right back up with a >> coherent database after the restart. >> >> OTOH, a machine with a dead postmaster is far more likely to have a >> corrupted database when it gets restarted. > > wait a min here, postgres is supposed to be able to survive a complete box > failure without corrupting the database, if killing a process can corrupt > the database it sounds like a major problem. Yes it is a major problem, but not with postgresql. It's a major problem with the linux OOM killer killing processes that should not be killed. Would it be postgresql's fault if it corrupted data because my machine had bad memory? Or a bad hard drive? This is the same kind of failure. The postmaster should never be killed. It's the one thing holding it all together.