Re: Stats collector frozen?

Tom Lane <tgl@xxxxxxxxxxxxx> · Sat, 27 Jan 2007 21:51:40 -0500

Magnus Hagander <magnus@xxxxxxxxxxxx> writes:
> On Fri, Jan 26, 2007 at 09:55:39AM -0500, Tom Lane wrote:
>> Keep in mind also that we have seen the stats-test failure on
>> non-Windows machines, so we still need to explain that ...

> Yeah. But it *could* be two different stats issues lurking. Perhaps the
> issue we've seen on non-windows can be fixed by the settings Alvaro had
> me try (increasing autovacuum_vacuum_cost_delay or the delay in the
> regression test).

I had a sudden thought about that: the stats machinery is designed to be
non-reliable, ie, drop messages under load.  Maybe the occasional stats
failures we see are just an artifact of that happening.  It would be
pretty unfortunate if the stats test and autovacuum together were
sufficient load to cause message drops, but I doubt that's the
explanation.  I think the important change here has been the default
enablement of stats_row_level.  That means that some of the tests
terminating just before the stats test starts may still be trying to
dump statistics out to the collector at the same time the stats test is.
(Keep in mind that psql does not wait around for the backend to be
actually gone before it exits, hence backend-exit cleanup is very likely
to happen in parallel with the start of the next test.)  This idea
explains why we mostly see the failure in parallel tests not serial:
in the serial schedule there's no opportunity to have a gang of backends
all exiting at the critical time.

If this theory is correct, then we can improve the reliability of the
stats test a good deal if we put a sleep() at the *start* of the test,
to let any old backends get out of the way.  It seems worth a try
anyway.  I'll add this to HEAD and if the stats failure noise seems to
go down, we can back-port it.

			regards, tom lane