Hi Recently NetBSD regression tests started hanging quite frequently. Here is an example: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/1679/ The offending test is root-squash-self-heal.t which starts a never-ending glfsheal process: PID LID WCHAN STAT LTIME COMMAND 28554 5 parked Rl 0:00.04 /build/install/sbin/glfsheal patchy 28554 4 nanoslp Rl 0:01.28 /build/install/sbin/glfsheal patchy 28554 3 - Rl 0:00.00 /build/install/sbin/glfsheal patchy 28554 1 - Rl 754:21.27 /build/install/sbin/glfsheal patchy Thread 1 ate a lot of CPU time. It is looping or failed writes: 28554 1 glfsheal CALL __gettimeofday50(0xbf7fe650,0) 28554 1 glfsheal RET __gettimeofday50 0 28554 1 glfsheal CALL write(9,0xbb7c63fb,6) 28554 1 glfsheal RET write -1 errno 35 Resource temporarily unavailable Running a standalone glfsheal process shows it first writes "dummy" before it hits the same error. This suggests we are in event_dispatch_destroy(): /* Write to pipe(fd[1]) and then wait for 1 second or until * a poller thread that is dying, broadcasts. */ while (event_pool->activethreadcount > 0) { write (fd[1], "dummy", 6); sleep_till.tv_sec = time (NULL) + 1; ret = pthread_cond_timedwait (&event_pool->cond, &event_pool->mutex, &sleep_till); } Obviously something went wrong. Perhaps there should be a timeout there, and/or a check that write() does not fail? diff --git a/libglusterfs/src/event.c b/libglusterfs/src/event.c index f19d43a..b956d25 100644 --- a/libglusterfs/src/event.c +++ b/libglusterfs/src/event.c @@ -235,10 +235,14 @@ event_dispatch_destroy (struct event_pool *event_pool) pthread_mutex_lock (&event_pool->mutex); { /* Write to pipe(fd[1]) and then wait for 1 second or until - * a poller thread that is dying, broadcasts. + * a poller thread that is dying, broadcasts. Make sure we + * do not loop forever by limiting to 10 retries */ - while (event_pool->activethreadcount > 0) { - write (fd[1], "dummy", 6); + int retry = 0; + + while (event_pool->activethreadcount > 0 && retry++ < 10) { + if (write (fd[1], "dummy", 6) == -1) + break; sleep_till.tv_sec = time (NULL) + 1; ret = pthread_cond_timedwait (&event_pool->cond, &event_pool->mutex, -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu@xxxxxxxxxx -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu@xxxxxxxxxx _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel