On Mon, 28 Apr 2014 18:51:20 -0700 Andy Grover <agrover@xxxxxxxxxx> wrote: > This patch and the next are somewhat a revert of 318e9f2, but the previous > fix didn't quite close the race. This only happens when we create threads > for a backstore that turns out to be invalid, which we then tear down. > > See https://bugzilla.redhat.com/show_bug.cgi?id=848585 . > > This is occurring because there's still a window where a thread misses > seeing info->stop == 1 but is not yet in cond_wait so it misses the > broadcast: > > thread_close: thread_worker_fn: > info->stop is seen as 0 > info->stop = 1 > pthread_cond_broadcast -- misses broadcast > pthread_cond_wait > pthread_join (hangs) > > I believe the solution is to go back to using pthread_cancel. We can call > it before pthread_cond_wait is called (or after) and it will do the right > thing: pop out and exit. The only tricky bit is we need to use the > pthread_cleanup_push mechanism to properly release info->pending_lock. > > Signed-off-by: Andy Grover <agrover@xxxxxxxxxx> > --- > usr/bs.c | 25 ++++++++++++++----------- > usr/bs_thread.h | 2 -- > 2 files changed, 14 insertions(+), 13 deletions(-) Thanks a lot for the fixes and detailed explanation. Surely, looks like there is a race. The whole patchset looks good. Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html