Re: select on 22 GB table causes "An I/O error occured while sending to the backend." exception

david@xxxxxxx · Thu, 28 Aug 2008 18:16:16 -0700 (PDT)

On Thu, 28 Aug 2008, Scott Marlowe wrote:

On Thu, Aug 28, 2008 at 5:08 PM,  <david@xxxxxxx> wrote:
On Thu, 28 Aug 2008, Scott Marlowe wrote:

On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@xxxxxxxxxxx>
wrote:

Another point is that from a business perspective, a database that has
stopped responding is equally bad regardless of whether that is because
the
OOM killer has appeared or because the machine is thrashing. In both
cases,
there is a maximum throughput that the machine can handle, and if
requests
appear quicker than that the system will collapse, especially if the
requests start timing out and being retried.

But there's a HUGE difference between a machine that has bogged down
under load so badly that you have to reset it and a machine that's had
the postmaster slaughtered by the OOM killer.  In the first situation,
while the machine is unresponsive, it should come right back up with a
coherent database after the restart.

OTOH, a machine with a dead postmaster is far more likely to have a
corrupted database when it gets restarted.

wait a min here, postgres is supposed to be able to survive a complete box
failure without corrupting the database, if killing a process can corrupt
the database it sounds like a major problem.

Yes it is a major problem, but not with postgresql.  It's a major
problem with the linux OOM killer killing processes that should not be
killed.

Would it be postgresql's fault if it corrupted data because my machine
had bad memory?  Or a bad hard drive?  This is the same kind of
failure.  The postmaster should never be killed.  It's the one thing
holding it all together.

the ACID guarantees that postgres is making are supposed to mean that even 
if the machine dies, the CPU goes up in smoke, etc, the transactions that 
are completed will not be corrupted.

if killing the process voids all the ACID protection then something is 
seriously wrong.

it may loose transactions that are in flight, but it should not corrupt 
the database.

David Lang