On Aug 14, 2012, Alexandre Oliva <oliva@xxxxxxxxxxxxxxxxx> wrote: > Don't wait for a signal if the dispatch queue is non-empty already, > and support a non-NULL pipe for local responses. If we were to > require pipe to be NULL for that, the response would be lost. Although this patch applies to 0.50, I developed it on 0.48argonaut, I failed to mention. It occurred to me that I haven't even tried to duplicate on 0.50 the problem I'd observed consistently on 0.48, namely, that accessing (stat) hardlinks that were not in the MDS cache would hard-block the client, because the MDS would issue an archortable request to itself, and the response wouldn't find its way back. At first, I thought the response had been queued but not picked up because it would be signalled before the dispatcher returned and Wait()ed for the signal, so it could be a lost wake-up race condition. That turned out to not fix the problem, but I figured I'd leave the hunk in, for it appeared to make sense on its own (the spelling of the empty() call is different in 0.50 and 0.48, FWIW). > - if (!stop) > + if (queued_pipes.empty() && !stop) > cond.Wait(lock); //wait for something to be put on queue Debugging further, I noticed that the anchortable request was issued the first time (and never re-issued, for the request remained in the pending queue), but the response didn't make it back: it was handled as a remote request, rather than a local one, because we find a pipe at the time of posting the response. However, since there aren't threads for such local pipes, the response is AFAICT queued but never sent. Once I enabled the destination for the response to be recognized as local in spite of the existence of the pipe, the response found its way back to the (local) requester and the client didn't hang any more. At that point, I stopped digging. > // local? > - if (!pipe && my_inst.addr == dest_addr) { > + if (my_inst.addr == dest_addr) { I ran a 0.48 with this (or rather the equivalent) change for a bit, then upgraded to 0.50, ported the patches, tested them for a bit, and posted the patches. The client hang was 100% reproducible with the unpatched 0.48, as long as the hardlinked inode was out of the MDS cache, and it could only be temporarily averted by accessing the hardlinked file by another name that, I suppose, didn't require a self-request of anchortable lookup. After the patch, the problem is gone for good. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html