On Mon, 1 Aug 2005, Paul Howarth wrote: > I see that a number of jobs have now made it into the queue, including > both of my requests (and some duplicates from other people too). I tried > killing one of my duplicate jobs about 20 minutes ago by doing: > > $ plague-client kill 282 > > Shortly afterwards I received an email stating that the job had been > killed. However, the page > http://buildsys.fedoraproject.org/build-status/job.psp?uid=282 still > shows that job as "building" and in fact the plague-client command has > still not exited. This doesn't seem right... It appears that (as of last night) the build server was stuck in SSL_BIO_read() trying to receive data from hammer3. I killed the hammer3 plague-builder process, but the server didn't notice that because it was stuck in that function. Now the fix for this is to use socket timeouts, which essentially make the sockets non-blocking, but this leads to other problems (ie, socket.makefile() doesn't work well with socket.settimeout(), but we have to use makefile because the SSL sockets don't have a dup2()) that need to be dealt with as well. I hope that I can come up with some non-blocking solution here to deal with these issues. The worst thing is that these problems are completely non-reproducible and occur at random. The immediate solution is to restart the build server. Dan