Re: Next steps in debugging database storage problems?

Jacob Bunk Nielsen <jacob@xxxxxxx> · Thu, 03 Jul 2014 10:26:02 +0200

Hi

Jacob Bunk Nielsen <jacob@xxxxxxx> writes:

> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
> stored on a XFS file system. We are seeing problems such as:
>
> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>
> and
>
> could not read block 5 in file "base/805208348/1259338118": read only
> 0 of 8192 bytes

We use streaming replication to a different server on different
hardware. That server had been up for 300+ days and just had an incident
of:

LOG:  consistent recovery state reached at 226/E7DE1680
WARNING:  page 0 of relation base/805208133/1274861078 does not exist
CONTEXT:  xlog redo insert: rel 1663/805208133/1274861078; tid 0/1
PANIC:  WAL contains references to invalid pages
LOG:  database system is ready to accept read only connections
CONTEXT:  xlog redo insert: rel 1663/805208133/1274861078; tid 0/1
LOG:  startup process (PID 2308) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

We've rebooted that server now and restarted the replication. We'll see
how it goes in a few hours.

I'm still very interested in hearing any hints you guys may have to how
I should debug these problems.

> I've tried writing a program to simulate a workload that resembles the
> workload on the problematic tables, but I can't get that to fail. So
> what should be my next step in debugging this?

That program has been running for 24+ hours now, and everything just
works as expected, so still no luck in reproducing this problem.

Best regards

Jacob

P.S. Sorry about the double post with different subject - my initial
post was held up for several hours due to putting "Help" in the subject,
so I thought I had been discarded by a list admin.