Re: Unfamous 'could not read block ... in file "...": read only 0 of 8192 bytes' again

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 21 Feb 2012 13:21:31 -0500

Maxim Boguk <maxim.boguk@xxxxxxxxx> writes:
> On Tue, Feb 21, 2012 at 5:32 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> If the index key values are not private information, could we see that
>> with "pg_filedump -i -f" not just bare?

> There is it:
> http://maximboguk.com/static/etc/agency_statistics_pkey.pg_full_filedump.gz

Thanks.  The index definitely contains about a hundred apparently-live
index entries, which shouldn't be there considering the table is empty,
and it doesn't seem to be corrupted in any obvious way.

However, I noticed that all the pages have btpo_cycleid zero, which
seems a bit improbable given that there are some with BTP_SPLIT_END set,
and the only way for that to become set is if there are adjacent index
pages with different btpo_cycleid values.  This led me to look for bugs
that might reset btpo_cycleid incorrectly, and I believe I see one.
_bt_delitems_delete has this code that was copied-and-pasted from its
previous incarnation as _bt_delitems:

	/*
	 * We can clear the vacuum cycle ID since this page has certainly been
	 * processed by the current vacuum scan.
	 */
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	opaque->btpo_cycleid = 0;

Now that comment is a lie; this is called from _bt_findinsertloc not
from vacuum.  So the series of events I'm imagining is

1. Autovacuum starts.

2. A page gets split; the page in question is beyond where autovac has
reached in its index scan, but a page from before that gets recycled to
become the new right half.  Both these pages get marked with the
vacuum's cycleid.

3. Whatever's doing the insertions continues to insert into the same
range of the index.  (I notice that your INSERT query has a GROUP BY
that uses the high columns of your pkey, which means it seems fairly
likely that insertions occur in something close to index order.  So this
is not unlikely.)  It hits a couple of duplicate index entries, finds
they're dead, and hence marks them dead and sets the BTP_HAS_GARBAGE
field on the new right-half page (we need to assume this happens because
the split would've cleared that flag).

4. It keeps on inserting and eventually fills the the new right-half
page.  At this point, since BTP_HAS_GARBAGE is set, it runs
_bt_delitems_delete and incorrectly clears btpo_cycleid on that page.

5. Finally, the concurrent vacuum reaches the originally split page.
It notes the btpo_cycleid matching its cycleid and correctly realizes
it has to go back and process the right-half page.  However, when it
reaches that page, it finds btpo_cycleid already zero and concludes it
doesn't need to scan that page.  Result: any entries that should have
been removed, but were not already marked DEAD at the time
_bt_delitems_delete ran, are not removed from the index.

Now obviously I can't prove that this is exactly what happened in your
index.  It would have to have happened several times during a single
vacuum run, I think, because the remaining index entries are spread
across half a dozen index pages.  However, all those pages are
relatively early in the index --- the first such page is block 2, and
the last one block 49, out of a 605-page index.  And we know that your
usage pattern is such that a whole lot of inserts would be going on at
the same time that vacuum had been removing a whole lot of old entries
(and hence causing low-numbered pages to get freed up and returned to
the free space map).  So it seems to fit.  Anyway, clearing 
btpo_cycleid there is unquestionably wrong.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general