Tom,
Here is the data you requested. It took little while to gather it as this kind of corruption doesn't happen all the time.
The first sign that we know something is wrong is our application freezing when communicating with the DB.
If we then issue a vacuum, here's the output:
sonar=# VACUUM FULL;
WARNING: index "user_session_pkey" contains 85613 row versions, but table contains 85513 row versions
HINT: Rebuild the index with REINDEX.
WARNING: index "user_session_pkey" contains 85613 row versions, but table contains 85513 row versions
HINT: Rebuild the index with REINDEX.
VACUUM
If we the try a reindex, here's the output:
sonar=# REINDEX TABLE user_session; ERROR: could not create unique index DETAIL: Table contains duplicated values.
There are duplicates and massive amounts of erroneous data in most columns in the table. From here, no amount of row deletion seems to clear the issue, the only remedy is a truncate on the table and repopulate. In fact if we try even a select we get this:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
Here is the schema for the table:
sonar=# \d user_session;
Table "public.user_session"
Column | Type | Modifiers
----------------+-----------------------------+---------------------------------
-------------------
user_id | integer | not null
session_id | integer | not null
group_id | integer |
src | character varying(22) | not null
hit_bytes_in | bigint |
miss_bytes_in | bigint |
logon_time | timestamp without time zone | default ('now'::text)::timestamp
(6) with time zone
logoff_time | timestamp without time zone | default ('now'::text)::timestamp
(6) with time zone
account_active | character(1) | default 'Y'::bpchar
hit_bytes_out | bigint |
miss_bytes_out | bigint |
cost_bytes_in | double precision |
cost_bytes_out | double precision |
time_cost | double precision |
Indexes:
"user_session_pkey" primary key, btree (user_id, session_id)
Foreign-key constraints:
"$1" FOREIGN KEY (user_id) REFERENCES user_table(user_id) ON DELETE CASCADE
Triggers:
delsessionusagetrigger AFTER DELETE ON user_session FOR EACH ROW EXECUTE PRO
CEDURE delsessionusagefunc()
I'm most concerned that a primary constraint is being actively violated, and I don't understand how this is possible.
Any help appreciated.
----- Original Message ----- From: "Tom Lane" <tgl@xxxxxxxxxxxxx>
To: "Andrew Hall" <temp02@xxxxxxxxxxxxxxx>
Cc: <pgsql-general@xxxxxxxxxxxxxx>
Sent: Friday, February 04, 2005 10:12 AM
Subject: Re: Lost rows/data corruption?
"Andrew Hall" <temp02@xxxxxxxxxxxxxxx> writes:We have a long running DB application using PG7.4.6. We do a VACUUM FULL
every night and a normal 'maintenance' VACUUM every hour. We do nothing with
any indexes. Every now and then we get errors from the database whereby an
update will fail on a table saying that there is duplicate violation on a
primary keyed row. Theoretically this is impossible as the constraint should
prevent any duplication, ever. When the next vacuum is run, we see an error
saying that there are more rows than the database can account for (or words
to that effect) and we should either REINDEX or TRUNCATE the table.
Could we see the exact log trace, rather than handwaving? Also the schemas of the table(s)/index(es) involved might be useful.
regards, tom lane
---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@xxxxxxxxxxxxxx)