Hello Everyone,
We a have a DB running on PostgreSQL 8.0.12 that has been running into
a string of issues and as of now we have ran out solutions, so we
would like to hear some input from some more knowledgeable people.
Initially we found out that our DB had the xid-wrap problem. Given
some advice from the folks of the IRC channel we went for a whole-DB
vacuum. Starting with this process we ran into tables that had invalid
pages. Given that we could recover the data in these table, we decided
to zero out the tables with invalid pages. Later we ran into some
tables being unable to be vacuumed because we got this error:
INFO: vacuuming "public.cdrs_part_2007_10_08"
ERROR: could not access status of transaction 3591307276
DETAIL: could not open file "/home/postgres/data/pg_clog/0D60": No
such file or directory
We decided to drop these tables as well, since we could rebuild that
data without problems.
Given these errors, we started suspecting (A little bit late) hardware
errors. Well, the only issue we found was some FileSystem corruption
in one of the partitions were the DB resides, which we seem to have
been able to recover from, although I don't know if we might have lost
some data in that process. After that we have searched for bad blocks
on the drives and bad RAM and have found none. Although we are
suspecting that the culprit is either SCSI controller on the storage
array or either a failing drive that the storage array is not
reporting as bad.
Anyway we were able to finish the whole DB vacuum and recover from the
xid-wrap. Having done this we decided to do a dump of the DB as a
backup measure. While doing so we got this error:
pg_dump ts -t cdrs_part_2007_03_01 | gzip >
/home/postgres/cdrs2/backup/cdrs_part_2007_03_01.gz
pg_dump: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: SQL command to dump the contents of table
"cdrs_part_2007_03_01"
failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR: invalid memory alloc request
size 18446744073709551613
pg_dump: The command was: COPY public.cdrs_part_2007_03_01 (calling_num,
calling_ser_num, dialed_num, called_num, called_ser_num, call_type,
billing_num, treatment_code, billing_ser_num, event_info,
first_orig_site,
first_orig_sect, first_orig, first_orig_memb, prev_orig_site,
prev_orig_sect, prev_orig, prev_orig_memb, last_orig_site,
last_orig_sect,
last_orig, last_orig_memb, orig_time, handoffs, first_term_site,
first_term_sect, first_term, first_term_memb, prev_term_site,
prev_term_sect, prev_term, prev_term_memb, last_term_site,
last_term_sect,
last_term, last_term_memb, discon_time, call_duration, completion_code,
term_num, answer_time, carrier_id_code, redirecting_num,
lnp_calling_msid,
lnp_called_msid, lnp_billing_msid, cdr_file) TO stdout;
This error happens on data was on the partition whose Filesystem has
corrected. At least another table on the same filesystem gives the
same error. Or tables in other filesystems do not give error this when
dumping. Given this we checked the FS of this partition and the drives
of the RAID unit that forms the partition and haven't found any more
problems.
Any ideas on what we might do to recover from this problem and finish
the dump? Any other things we might do to test the consistency of our
database?
Thanks,
Pepe
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend