Re: Hanging queries on dual CPU windows

"Magnus Hagander" <mha@xxxxxxxxxxxxxx> · Fri, 10 Mar 2006 16:11:00 +0100

> > > >  I dunno
> > > >
> > > > > if you've got anything gdb-equivalent under Windows, 
> but that's 
> > > > > the first thing I'd be interested in ...
> > > >
> > > > Here ya go:
> > > >
> > > > http://www.devisser-siderius.com/stack1.jpg
> > > > http://www.devisser-siderius.com/stack2.jpg
> > > > http://www.devisser-siderius.com/stack3.jpg
> > > >
> > > > There are three threads in the process. I guess thread 1
> > > > (stack1.jpg) is the most interesting.
> > > >
> > > > I also noted that cranking up concurrency in my app 
> reproduces the 
> > > > problem in about 4 minutes ;-)
> >
> > Just reproduced again.
> >
> > > Actually, stack2 looks very interesting. Does it "stay stuck" in 
> > > pg_queue_signal? That's really not supposed to happen.
> >
> > Yes it does.
> 
> An update on that: There is actually *two* processes in this 
> state, both hanging in pg_queue_signal. I've looked at the 
> source of that, and the obvious candidate for hanging is 
> EnterCriticalSection. I also found this:
> 
> http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx
> 
> where they say:
> 
> "
> In addition, for Windows 2003, SP1, the EnterCriticalSection 
> API has a subtle change that's intended tor resolve many of 
> the lock convoy issues.  Before
> Win2003 SP1, if 10 threads were blocked on 
> EnterCriticalSection and all 10 threads had the same 
> priority, then EnterCriticalSection would service those 
> threads in a FIFO (first -in, first-out) basis.  Starting in 
> Windows 2003 SP1, the EnterCriticalSection will wake up a 
> random thread from the waiting threads.  If all the threads 
> are doing the same thing (like a thread pool) this won't make 
> much of a difference, but if the different threads are doing 
> different work (like the critical section protecting a widely 
> accessed object), this will go a long way towards removing 
> lock convoy semantics.
> "
> 
> Could it be they broke it when they did that????

In theory, yes, but it still seems a bit far fetched :-(

If you have the env to rebuild, can you try changing the order of the lines:
	ResetEvent(pgwin32_signal_event);
	LeaveCriticalSection(&pg_signal_crit_sec);

in backend/port/win32/signal.c

And if not, can you also try disabling the stats collector and see if that makes a difference. (Could be a workaround..)

//Magnus