On Fri, 12 Feb 2016 10:56:04 +0100 "Oliver Stöneberg" <oliverst@xxxxxxxxx> wrote: > We are running a 64-bit PostgreSQL 9.4.5 server on Windows Server > 2012. The system is a virtual machine on a VMware ESX 6.0 server and > has 24 GB of memory. The database server is only accessed locally by > two services and there is only a single database in the server. The > disk is located on a storage that is shared with lots of other > servers. The database server has fsync enabled. > > A few weeks ago we already had a data corruption when the disk was > full. There are other services running on the same machine that could > cause the disk to fill up (e.g. local chaching when the network is > acting up). It happened a few times so far but the database was never > compromised. In that case thought it was but fortunately we only lost > a huge table/toast (300+ GB) that has very verbose data stored which > is not essential. That happened with an earlier 9.4 version. > > Today we encountered another data corruption after the disk was full. > It's much worse this time around since data that is essential for the > applications using it to run. After truncating that 300+ GB table > already mentioned above all the services were restarted and one of > the applications failed to start with the following database error: > > Caused by: org.postgresql.util.PSQLException: ERROR: invalid memory alloc request size 18446744073709551613 > at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182) > at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911) > at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173) > at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622) > at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472) > at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386) > at joulex.cloud.data.tables.JxObjectStoreJsonTable.getAllJsonsAsHashMap(JxObjectStoreJsonTable.java:215) > ... 16 more > > Checking the database logs the problems seem to have started two days > ago: > 2016-02-10 16:00:34 GMTERROR: invalid page in block 1255 of relation base/16387/3634911224 > 2016-02-10 16:00:34 GMTCONTEXT: automatic vacuum of table "cloud.t_63d1c1eb806f4e25904b659b75176093.y2016_utilization" > > For some reason it doesn't seem to be caused by the disk being full > since the database server was still able to write another 8 GB of log > files for the next two days and the system was still working > yesterday afternoon. > It also doesn't appear to be a hardware problem since all the other > systems sharing the virtual hostsystem and the storage show no issues > at all. > > Unfortunately we don't have a recent backup of the database (a tool > to back up all the relevant data was just finished recently and was > not set up for this system yet). > > Something else worth noting is that we disabled the autovacuum on the > toast table of the 300+ GB table since we perform INSERT INTO on that > tbale and the vacuum on the table was causing a performance hit. The > autovacuum for it is still being performed to prevent wraparound from > time to and that autovacuum was still running after the machine run > out of disk space and the services was restarted. > > Any help in recovering the data is appreciated and if there is more > information necessary on this I will try to provide it. Thanks in > advance. You most likely have byte-level corruption. If you need that data back, your best bet is to hire a company with PostgreSQL experts who know the structure of how the data is stored on disk and can manipulate the files directly to recover whatever hasn't been destroyed. If you want to do it yourself, it will require you to understand the actual byte sequences as they are stored on disk, as well as the system PostgreSQL uses to identify database pages within the file system. First you will have to indentify the file that contains the corrupt page, then you will have to modify the bytes in the page to make the page non-corrupt. (do this with Postgres shut down) In any event, that sort of thing is touchy work, even if you do understand it well, so make sure you have a full copy of all database files so you can roll back if you make things worse. Long term, you need to fix your hardware. Postgres doesn't corrupt itself just because the disks fill up, so your hardware must be lying about what writes completed successfully, otherwise, Postgres would be able to recover after a restart. Beyond that, running Postgres on a filesystem that frequently fills up is going to be problematic all around anyway. If you don't improve the hardware situation, you're going to continue to have problems like this. -- Bill Moran -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general