On Sep 1 I found strange errors on my database: Sep 1 18:55:52 <local0.notice> edge postgres[1691]: [7-1] :[] WARNING: archiving transaction log file "0000000100000022000000C9" failed too many times, will try again later Sep 1 19:01:11 <local0.warn> edge postgres[77410]: [11-1] bacula:bacula[192.168.98.2(37891)] ERROR: could not read block 80461 in file "pg_tblspc/16391/PG_9.5_201510051/16394/1435214": read only 0 of 8192 bytes Sep 1 19:01:11 <local0.warn> edge postgres[77410]: [11-2] bacula:bacula[192.168.98.2(37891)] STATEMENT: INSERT INTO File ... Sep 1 19:04:59 <local0.warn> edge postgres[68877]: [8-1] bacula:bacula[192.168.98.2(36617)] ERROR: could not read block 80461 in file "pg_tblspc/16391/PG_9.5_201510051/16394/1435214": read only 0 of 8192 bytes Sep 1 19:04:59 <local0.warn> edge postgres[68877]: [8-2] bacula:bacula[192.168.98.2(36617)] STATEMENT: INSERT INTO File ... Sep 1 19:43:48 <local0.notice> edge postgres[1691]: [12-1] :[] WARNING: archiving transaction log file "0000000100000022000000CE" failed too many times, will try again later Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [14-1] bacula:bacula[192.168.98.2(12726)] ERROR: deadlock detected at character 13 Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [14-2] bacula:bacula[192.168.98.2(12726)] STATEMENT: DELETE FROM jobmedia WHERE jobid IN Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [14-3] (SELECT jobid FROM idliste); Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [15-1] bacula:bacula[192.168.98.2(12726)] ERROR: current transaction is aborted, commands ignored until end of transaction block Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [15-2] bacula:bacula[192.168.98.2(12726)] STATEMENT: DELETE FROM job WHERE jobid IN Sep 1 19:57:48 <local0.warn> edge postgres[92864]: [15-3] (SELECT jobid FROM idliste); Sep 1 20:07:25 <local0.warn> edge postgres[76452]: [8-1] bacula:bacula[192.168.98.2(53112)] ERROR: could not read block 80461 in file "pg_tblspc/16391/PG_9.5_201510051/16394/1435214": read only 0 of 8192 bytes Sep 1 20:07:25 <local0.warn> edge postgres[76452]: [8-2] bacula:bacula[192.168.98.2(53112)] STATEMENT: INSERT INTO File ... I found the unreadable relation 16394/1435214 being an index to the "File" table, found that being much too short on disk (less than 80461 blocks), did a "REINDEX" on it, and since then everything seems fine again (except that I am missing some payload data, because the "INSERT" commands obviousely did not execute). Now I am trying to do the forensics, because in my opinion such behaviour should not appear. What had happened: at about 3am a daily VACUUM was started (there is no autovacuum in place). Also some other housekeeping stuff was started which made the system very slow. Then around 5.30am a monthly VACUUM FULL was started, and from then on the two VACUUM were busily competing for locks, until finally after 7pm both got thru: Sep 1 19:29:10 <local0.info> edge postgres[76238]: [15-1] pgsql:bacula[192.168.97.9(21132)] LOG: duration: 45103183.199 ms statement: VACUUM (ANALYZE); Sep 1 19:30:26 <local0.info> edge postgres[76730]: [20-1] pgsql:bacula[192.168.97.9(25710)] LOG: duration: 44924828.356 ms statement: VACUUM (FULL, ANALYZE); There are no errors reported from either of them, neither any other errors in the system logs. But something must have gone wrong before 19:01:11, yielding a truncated index. And the only thing that should rewrite the index is the VACUUM FULL. Doing the math, I figure that the disk might have been too small: The bacula database is on a separate tablespace, which was occupied by 48% when I recognized the issue. Considering that VACUUM FULL will need double space, and that the index is temporarily written to pgsql_tmp directory (on the same disk/tablespace), it might likely have hit the limit. The log shows the temporary files for the indexes being used, all with their proper sizes - so it must have been the final create of the index file that did fail. The base postgres cluster installation with the default tablespace (and pg_xlog etc.) cannot have been harmed by a disk-full on that tablespace. The reason why there is also a WAL archiving failure at 18:55 should be that my temporary staging area was full due to a huge amount of journals created from VACUUM FULL, and bacula at the same time being unable to move them to permamant storage because it was blocked by the VACUUM FULL. But there is very much reserve space in pg_xlog, so since the complaint at 18:55 is the first of its kind, this cannot have run full. So, these complaints should be harmless "works as designed" and not related to the issue. Finally, I cannot get any clue from the deadlock at 19:57:48. This might be somehow related to the broken index, but I do not see how that could be. ----------------------------------- Comcluding: There was a broken index (which I would consider a database corruption, although nothing too bad has arisen from it) and there are NO messages/errors of any kind giving a clue on how this might have happened. There MAY have been a disk-full event, but then again there are NO messages/errors reporting a disk-full. The underlying file system is ZFS, and from what I know it gets annoyingly slow when filling up, but I don't precisely know what it does when actually getting full. Probably it does not report directly to syslog, but at least it should return a failed system call (which then should yield some error reported) - it should not happen that files are silently truncated. So, I might assume somewhere here is hidden some bug. But then, is the problem in postgres (9.5.7), in the OS (FBSD 11.1). or in ZFS? Anyway, documentation section 28.2. ("disk-full will not result in data corruption") seems questionable. PMc -- Peter Much * Fichtenstr. 28 * 65527 Niedernhausen home +49-6127-967111 * road +49-151-21416768 -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin