Re: VMWare file system / database corruption

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Mon, 21 Sep 2009 13:04:09 -0600

On Mon, Sep 21, 2009 at 12:46 PM, Tom Duffey <tduffey@xxxxxxxxxxxxxxxx> wrote:
>
> On Sep 21, 2009, at 12:40 PM, Scott Marlowe wrote:
>
>> On Mon, Sep 21, 2009 at 11:09 AM, Tom Duffey <tduffey@xxxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> Hi All,
>>>
>>> We're having numerous problems with a PostgreSQL 8.3.7 database running
>>> on a
>>> virtual Linux server w/VMWare ESX.  This is not by choice and I have been
>>> asking the operator of this equipment for details about the disk setup
>>> and
>>> here's what I got:
>>>
>>> "We have a SAN that is presenting an NFS share.  VMWare sees that share
>>> and
>>> reads the VMDK file that make up the virtual file system."
>>>
>>> Does anyone with a better understanding of PostgreSQL and VMWare know if
>>> this is an unreliable setup for PostgreSQL?  I see things like "NFS" and
>>> "VMWare" and start to get worried.
>>
>> I see VMWare and thing performance issues, I see NFS and thing dear
>> god help us all.  Even if properly setup NFS is a problem waiting to
>> happen, and it's not reliable storage for a database in my opinion.
>> That said, lots of folks do it.  Ask for the NFS mount options from
>> the sysadmin.
>
> Thanks to everyone so far for the insight.  I'm trying to get more details
> about the hardware setup but am not making much progress.
>
> Here are some of the errors we're getting.  I searched through archives and
> they all seem to point at hardware trouble but is there anything else I
> should be looking at?
>
> ERROR:  invalid page header in block 2 of relation "pg_toast_19466_index"
>
> ERROR:  invalid memory alloc request size 1667592311
> STATEMENT:  COPY public.version_bundle (node_id_hi, node_id_lo, bundle_data)
> TO stdout;
>
> ERROR:  unexpected chunk number 1632 (expected 1629) for toast value 19711
> in pg_toast_19184
> STATEMENT:  COPY public.data_binval (binval_id, binval_data) TO stdout;
>
> ERROR:  invalid page header in block 414 of relation "pg_toast_19460_index"
>
> ERROR:  could not open segment 1 of relation 1663/16386/16535 (target block
> 3966127611): No such file or directory
>
> I dealt with some of the above by reindexing or finding and deleting bad
> rows.  I can now successfully dump the database but of course have missing
> data so the application is toast.  What I'm really wondering now is how to
> prevent this from happening again and if that means moving the database to
> new hardware.

Definitely sounds like file system corruption to me.  And who knows
what's gotten hammered that hasn't caused an error, eh?  Time to move
to a standalone db server or get a sysadmin who knows how to setup
vmware to make pgsql happy.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general