Re: Help - corruption issue?

Tomas Vondra <tv@xxxxxxxx> · Wed, 20 Apr 2011 22:11:42 +0200

Dne 20.4.2011 12:56, Phoenix Kiula napsal(a):
>> On a fast network it should only take a few minutes.  Now rsyncing
>> live 2.4 TB databases, that takes time. :)  Your raptors, if they're
>> working properly, should be able to transfer at around 80 to
>> 100Megabytes a second.  10 to 15 seconds a gig.  30 minutes or so via
>> gig ethernet.  I'd run iostat and see how well my drive array was
>> performing during a large, largely sequential copy.
> 
> 
> OK. An update.
> 
> We have changed all the hardware except disks.

OK, so the card is working and the drives are fine. Have you run the
tw_cli tool to check the drives? Because it's probably the last thing
that might be faulty and was not replaced.

> REINDEX still gave this problem:
> 
> --
> server closed the connection unexpectedly
> 	This probably means the server terminated abnormally
> 	before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> --

Hm, have you checked if there's something else in the logs? More details
about the crash or something like that.

I'd probably try to run strace on the backend, to get more details about
where it crashes. Just find out the PID of the backend dedicated to your
psql session, do

$ strace -p PID > crash.log 2>&1

and then run the REINDEX. Once it crashes you can see the last few lines
from the logfile.

> So I rebooted and logged back in a single user mode. All services
> stopped. All networking stopped. Only postgresql started. I tried the
> REINDEX again.
> 
> Same problem :(
> 
> This means the problem is likely with data?

Well, maybe. It might be a problem with the data, it might be a bug in
postgres ...

> I do have a "pg_dumpall" dump from 1 day before. Will lose some data,
> but should have most of it.
> 
> Is it worth it for me to try and restore from there? What's the best
> thing to do right now?

So have you done the file backup? That's the first thing I'd do.

Anyway what's best depends on how important is the missing piece of
data. We still don't know how to fix the problem, but it sure seems like
a corrupted data.

I think you already know which table is corrupted, right? In that case
you may actually try to find the bad block and erase it (and maybe do a
copy so that we can see what's wrong with it and how it might happen).
There's a very nice guide on how to do that

http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

It sure seems like the problem you have (invalid alloc request etc.).
The really annoying part is locating the block, as you have to scan
through the table (which sucks with such big table).

And yes, if there's corruption, there might be more corrupted blocks.

regards
Tomas

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general