Hey Tom,
I've finished some vacuums on the table with the following results. I'll
continue to run a couple more of these vacuums, but if you don't see another
post from me in the next couple of hours, you can assume that they continue
to look clean as well.
Thanks again for any clues on this. I'd be interested in any alternatives,
like a trace if that's an option. One thing I'm going to try next is to run
the COPY TO/FROM test on a redhat version of postgresql v8.1.4 that was
compiled with gcc 3.4.5 instead of gcc 3.4.3 & gmake version 3.80.
We are considering a test with the data on a Solaris build of postgres as
well, but that's coming slowly.
~DjK
1st vacuum - 70 minutes
INFO: vacuuming "_test"
INFO: "_test": found 0 removable, 12149265 nonremovable row versions in
492247 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.25s/0.05u sec elapsed 2374.30 sec.
INFO: analyzing "_test"
INFO: "_test": scanned 3000 of 492247 pages, containing 74030 live rows and
0 dead rows; 3000 rows in sample, 12147015 estimated total rows
Total query runtime: 4175635 ms.
2nd vacuum - 40 minutes
INFO: vacuuming "_test"
INFO: "_test": found 0 removable, 12149265 nonremovable row versions in
492247 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 1.98s/0.33u sec elapsed 2419.27 sec.
INFO: analyzing "_test"
INFO: "_est": scanned 3000 of 492247 pages, containing 74065 live rows and
0 dead rows; 3000 rows in sample, 12152758 estimated total rows
Total query runtime: 2433517 ms.
From: Tom Lane <tgl@xxxxxxxxxxxxx>
To: "Mr. Dan" <bitsandbytes88@xxxxxxxxxxx>
CC: pgsql-admin@xxxxxxxxxxxxxx
Subject: Re: [ADMIN] COPY FROM command v8.1.4 Date: Wed, 13 Sep 2006
12:50:26 -0400
"Mr. Dan" <bitsandbytes88@xxxxxxxxxxx> writes:
>> How are you doing the copies, exactly? SQL COPY command, psql \copy,
>> something else?
> We've tried SQL COY and psql \copy and always get random results - 0,1,
or 2
> blocks of 25 rows missing.
Hmph. If it happens with a SQL COPY command then psql seems to be off
the hook, and that also eliminates some theories about dropped TCP
packets and such.
Would you check back in the source table for the COPY and see what the
ctid values are for the missing rows? I'm wondering about a pattern
like "the dropped rows of a group are all on the same disk page", ie,
what's being missed is one whole page at a time.
If that's what's happening, the only very plausible theory I can think
of is that your disk drive is sometimes glitching and returning a page
of all-zeroes instead of what it should return. Postgres will not
complain about this in normal operation (because there are legitimate
error-recovery scenarios where a zero page can be in a table); it'll
just treat the page as empty. VACUUM will complain though, so the next
step would be to set up a test table by copying your large table and
then repeatedly run plain VACUUM on the test table. If you get sporadic
warnings "relation foo page N is uninitialized --- fixing" then we have
the smoking gun. Don't run this test directly on a valuable table, as
each such message would mean you just lost another page of data :-(
FWIW, I spent several hours yesterday evening copying 6GB tables around
to see if I could reproduce any such problem, and I couldn't...
regards, tom lane