Re: database corruption

Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> · Fri, 12 Feb 2016 07:46:25 -0500

On Fri, 12 Feb 2016 10:56:04 +0100
"Oliver Stöneberg" <oliverst@xxxxxxxxx> wrote:

> We are running a 64-bit PostgreSQL 9.4.5 server on Windows Server 
> 2012. The system is a virtual machine on a VMware ESX 6.0 server and 
> has 24 GB of memory. The database server is only accessed locally by 
> two services and there is only a single database in the server. The 
> disk is located on a storage that is shared with lots of other 
> servers. The database server has fsync enabled.
> 
> A few weeks ago we already had a data corruption when the disk was 
> full. There are other services running on the same machine that could 
> cause the disk to fill up (e.g. local chaching when the network is 
> acting up). It happened a few times so far but the database was never 
> compromised. In that case thought it was but fortunately we only lost 
> a huge table/toast (300+ GB) that has very verbose data stored which 
> is not essential. That happened with an earlier 9.4 version.
> 
> Today we encountered another data corruption after the disk was full. 
> It's much worse this time around since data that is essential for the 
> applications using it to run. After truncating that 300+ GB table 
> already mentioned above all the services were restarted and one of 
> the applications failed to start with the following database error:
> 
> Caused by: org.postgresql.util.PSQLException: ERROR: invalid memory alloc request size 18446744073709551613
> 	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
> 	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
> 	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
> 	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
> 	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
> 	at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
> 	at joulex.cloud.data.tables.JxObjectStoreJsonTable.getAllJsonsAsHashMap(JxObjectStoreJsonTable.java:215)
> 	... 16 more
> 
> Checking the database logs the problems seem to have started two days 
> ago:
> 2016-02-10 16:00:34 GMTERROR:  invalid page in block 1255 of relation base/16387/3634911224
> 2016-02-10 16:00:34 GMTCONTEXT:  automatic vacuum of table "cloud.t_63d1c1eb806f4e25904b659b75176093.y2016_utilization"
> 
> For some reason it doesn't seem to be caused by the disk being full 
> since the database server was still able to write another 8 GB of log 
> files for the next two days and the system was still working 
> yesterday afternoon.
> It also doesn't appear to be a hardware problem since all the other 
> systems sharing the virtual hostsystem and the storage show no issues 
> at all.
> 
> Unfortunately we don't have a recent backup of the database (a tool 
> to back up all the relevant data was just finished recently and was 
> not set up for this system yet).
> 
> Something else worth noting is that we disabled the autovacuum on the 
> toast table of the 300+ GB table since we perform INSERT INTO on that 
> tbale and the vacuum on the table was causing a performance hit. The 
> autovacuum for it is still being performed to prevent wraparound from 
> time to and that autovacuum was still running after the machine run 
> out of disk space and the services was restarted.
> 
> Any help in recovering the data is appreciated and if there is more 
> information necessary on this I will try to provide it. Thanks in 
> advance.

You most likely have byte-level corruption. If you need that data back,
your best bet is to hire a company with PostgreSQL experts who know
the structure of how the data is stored on disk and can manipulate
the files directly to recover whatever hasn't been destroyed. If you
want to do it yourself, it will require you to understand the actual
byte sequences as they are stored on disk, as well as the system
PostgreSQL uses to identify database pages within the file system.
First you will have to indentify the file that contains the corrupt
page, then you will have to modify the bytes in the page to make the
page non-corrupt. (do this with Postgres shut down) In any event,
that sort of thing is touchy work, even if you do understand it well,
so make sure you have a full copy of all database files so you can
roll back if you make things worse.

Long term, you need to fix your hardware. Postgres doesn't corrupt
itself just because the disks fill up, so your hardware must be lying
about what writes completed successfully, otherwise, Postgres would
be able to recover after a restart.

Beyond that, running Postgres on a filesystem that frequently fills up
is going to be problematic all around anyway. If you don't improve the
hardware situation, you're going to continue to have problems like this.

-- 
Bill Moran

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general