On Mon, Jan 09, 2017 at 08:39:31PM +0200, Tuomas Tynkkynen wrote: > Yes, this does seem to be related to this or otherwise MAX_REQ related! > - Bumping MAX_REQ up to 1024 makes the hang go away (on 4.7). > - Dropping it to 64 makes the same hang happen on kernels where it worked > before (I tried 4.4.x). > - Doing s/(MAX_REQ - 1)/MAX_REQ/ makes the hang go away. Note that it's still possible to trigger the same situation with that off-by-one taken care of; if client sends 64 Treadlink and 64 Tflush (one for each of those), then follows by another pile of Treadlink (feeding them in as soon as free slots appear), the server can end up with failing pdu_alloc() - completion of readlink will release its slot immediately, but pdu won't get freed until the corresponding flush has gotten the CPU. I'm _not_ familiar with scheduling in qemu and I don't quite understand the mechanism of getting from "handle_9p_output() bailed out with some requests still not processed" to "no further requests get processed", so it might be that for some reason triggering the former as described above won't escalate to the latter, but I wouldn't count upon that. Another thing I'm very certain about is that 9 0 0 0 108 1 0 1 0 sent by broken client (Tflush tag = 1 oldtag = 1) will do nasty things to qemu server. v9fs_flush() will try to find the pdu of request to cancel, find its own argument, put itself on its ->complete and yield CPU, expecting to get back once the victim gets through to pdu_complete(). Since the victim is itself... AFAICS, the things client can expect wrt Tflush handling are * in no case should Rflush be sent before the reply to request its trying to cancel * the only case when server _may_ not send Rflush is the arrival of more than one Tflush with the same oldtag; in that case it is allowed to suppress replies to earlier ones. If they are not suppressed, replies should come in the order of Tflush arrivals. * if reply to Tflush is sent (see above), it must be Rflush. * multiple Tflush with the same oldtag are allowed; Linux kernel client does not issue those, but other clients might. As the matter of fact, Plan 9 kernel client *does* issue those. * Tflush to Tflush is no-op; it still needs a reply, and ordering constraints apply (it can't be sent before the reply to Tflush it's been refering to, which, in turn, can't be sent before the reply to request the first Tflush refers to). Normally such requests are not sent, but in principle they are allowed. * Tflush to request that isn't being processed should be answered immediately. The same goes for Tflush refering to itself. The former is not an error (we might have already sent a reply), but the latter might be worth a loud warning - clients are definitely not supposed to do that. It still needs Rflush in response - Rerror is not allowed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html