Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

daveg <daveg@xxxxxxxxx> · Wed, 2 Mar 2011 13:30:34 -0800

On Tue, Mar 01, 2011 at 01:20:43PM -0800, daveg wrote:
> On Tue, Mar 01, 2011 at 12:00:54AM +0200, Heikki Linnakangas wrote:
> > On 28.02.2011 23:28, daveg wrote:
> > >On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:
> > >>We'll likely need to go back and forth a few times with various
> > >>debugging patches until we get to the heart of this..
> > >
> > >Anything new on this? I'm seeing at on one of my clients production boxes.
> > 
> > I haven't heard anything from the OP since.
> > 
> > >Also, what is the significance, ie what is the risk or damage potential if
> > >this flag is set incorrectly?
> > 
> > Sequential scans will honor the flag, so you might see some dead rows 
> > incorrectly returned by a sequential scan. That's the only "damage", but 
> > an incorrectly set flag could be a sign of something more sinister, like 
> > corrupt tuple headers. The flag should never be set incorrectly, so if 
> > you see that message you have hit a bug in PostgreSQL, or you have bad 
> > hardware.
> > 
> > This flag is quite new, so a bug in PostgreSQL is quite possible. If you 
> > still have a backup that contains those incorrectly set flags, I'd like 
> > to see what the page looks like.
> 
> 
> I ran vacuums on all the affected tables last night. I plan to take a downtime
> to clear the buffer cache and then to run vacuums on all the dbs in the
> cluster.
> 
> Most but not all the tables involved are catalogs.
> 
> However, I could probably pick up your old patch sometime next week if it
> recurrs and send you page images.

After a restart and vacuum of all dbs with no other activity things were
quiet for a couple hours and then we started seeing these PD_ALL_VISIBLE
messages again. 

Going back through the logs we have been getting these since at least before
mid January. Oddly, this only happens on four systems which are all new Dell
32 core Nehalem 512GB machines using iscsi partitions served off a Netapp.
Our older 8 core 64GB hosts have never logged any of these errors. I'm not
saying it is related to the hw, as these hosts are doing a lot more work than
the old hosts so it may be a concurrency problem that just never came up at
lower levels before.

Postgresql version is 8.4.4.

I'll pick up Heikkis page logging patch and run it for a bit to get some
damaged page images. What else could I be doing to track this down?

-dg

-- 
David Gould       daveg@xxxxxxxxx      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin