> Are you running the latest minor release for your PostgreSQL version?
> Were there any crashes recently?
> Do you have "fsync = on"?
> Did you make sure that you have a reliable storage system?
No. 9.3.9 whereas the lastest is 9.3.10
> Were there any crashes recently?
No[*]. See comments below about LVM snapshots.
> Do you have "fsync = on"?
Yes.
> Did you make sure that you have a reliable storage system?
RAID-6 SSDs on ext4
I've observed the problem on other than the "pg_depend" table. A normal table (called "blocks") had the problem. A "REINDEX TABLE blocks" did not fix the problem, however a "VACUUM FULL" did "fix" it.
We had thought we figured out what might have caused the problem. We were taking LVM snapshots while the database was running and then at sometime later we reverted the snapshot (while postgres was stopped). Our theory was that since postgres was running and therefore we captured its backing store in an indeterminate state. We changed our snapshot process to shutdown postgres, take the LVM snapshot, and then start postgres.
Unfortunately, the problem cropped up again.
We had an empty database except for our created tables, took the snapshot, and then populated the tables using normal INSERT/UPDATE/DELETEs while the system ran. Then, we reboot and revert the LVM snapshot--back to the empty database. We then populated the tables by using a series "COPY <table> FROM <file>". The only things special about this is we dropped one foreign key constraint and that with our data files is that we may issue back-to-back COPY's to the same table (each COPY pulling in different data as we pre-process the data before COPY'ing it). The amount of data is relatively small from our COPY's:
dirtable - 1 row
dirpath - 1 row
cloud - 940 rows
blocks - 176 rows (before it died)
Indexes were not disabled during the bulk import.
Chris
On Mon, Nov 23, 2015 at 2:59 AM, Albe Laurenz <laurenz.albe@xxxxxxxxxx> wrote:
Chris Richards wrote:
> Adrian is correct. This worked by itself whereas using it in the creation of the temporary table
> failed.
> mdb-> SELECT pq.* FROM policyqueue AS pq
> mdb-> JOIN seed_progress AS sp ON pq.id <http://pq.id/> =sp.polidx;
>
> I checked the query Albe suggested; there were two `relfilenode`s (11936 and 11937) that exhibited the
> error. Respectively, they were pg_depend_depender_index and pg_depend_reference_index.
>
> Unfortunately, I didn't disable the nightly processes and something must have(?) fixed the glitch; at
> midnight GMT the query ran successfully. Ugh.
>
> If it crops up again, I have some tools to try and capture data immediately, and the suggested REINDEX
> since both appear to be indices.
These indexes are both on the system table "pg_depend".
That explains why the query ran without problems but the CREATE TABLE didn't:
Creating an object creates dependencies, and you have data corruption in the
system table that tracks dependencies.
I would be pretty worried in your place if I had a corrupted catalog, even if
it seems to have "fixed itself". There might be other tables with corruption.
I would do two things:
1) Wait for a moment when there is little database traffic and
run "REINDEX TABLE pg_depend;" to rebuild these indexes.
2) As soon as you can schedule some downtime, pg_dumpall the database
cluster, stop the server, throw away the database cluster, create a new
one with "initdb" and restore the dump into that.
That will get rid of any lurking data corruption.
Watch out for error messages during the pg_dumpall!
The main question is of course how you got the corruption in the first place.
Are you running the latest minor release for your PostgreSQL version?
Were there any crashes recently?
Do you have "fsync = on"?
Did you make sure that you have a reliable storage system?
Yours,
Laurenz Albe